Protein structure analysis – I

Protein structure analysis – I


in this lecture we will discuss about the
properties which can be derived from the protein three d structure in the last class we discussed
about protein three d structures right what did we discussed the last class how to get
the three d structures experimental x ray crystallography anamosa spectroscopy and electron microscopy
and ah there are several ah nobel orates right for understanding the structural function
of globular proteins using a x ray crystallography or anamosa spectroscopy right
so all the structures which is solved by x ray or anamo or a electron microscopy are
deposited in the database right whats the name of the database
the protein data bank is a protein data bank right protein data
bank is maintained right because ah research collaboratory in structural transformatics
right so there are several sides in japan in us and in europe right so that we can get
the access from different parts of the world so what are the major contents of protein
data bank there will be header information
right they have the header with the name and the source right and the how ah the resolution
and the secondary structures publications and we can have the data and the coordinates
right so what are the information we we can obtain from the coordinates atom name
atomic number atom number residue name
residue residue number
chain chain information
and and the coordinates xyz coordinates
and then b factor b factor
occupancy occupancy right we will give all the information
range for all the atoms right in a protein then we discussed about the visualization
tools there are several tools right which are commonly available to view the structures
and you can also ah manipulate the structures and different directions and you can calculate
the bond length bond angles torsion angles mutations and so on so then we discussed in
detail about pymol so pymol have several applications what are the potential applications of pymol
calculate the distance we can ah get the different ah measurements
and you can give the structures and you can make a high quality figures right and we can
do the mutation analysis we can do the interactions where hydrophobic or electrostatic or different
interactions right so it has various applications right so if you go through the pymol right
you will check the tutorials and we will get ah several options you can utilize pymol effectively
so we take the protein structures mainly they are classified into four different classes
depending upon the secondary structures present in protein structures right so if you what
is the secondary structures alpha
and the secondary structures are alpha helices beta strand
and beta strand right how these secondary structures are distributed in three d structures
based on that we classify the proteins into four different groups the major the major
one that is all alpha proteins right all alpha protein contains mainly alpha helices you
can see the dominated by alpha helices right so more than forty percent alpha helices and
less beta strands right we can see an example right here you can see several alpha alpha
helices helices this is a this structure for myoglobin right
this is a structure for the myoglobin right it contains a alpha helices but that no beta
strand here so we can see the structures and apha all alpha proteins are getting popular
because in the first solved structure of globular proteins right which contains
alpha alpha helices right so in this case this is
popular and if you see the membrane proteins here also we can see that the first solved
membrane protein structures right and this photosynthetic reactions and ah right which
also contains several alpha helices right so alpha helices we can see the all alpha
proteins are predominant right if you look at the all protein structures in the several
ah structures they ah belong to all alpha class then the second class likewise all alpha
some proteins we have more number of beta strands predominantly you can see the occurrence
of beta strands and these type of proteins are called all beta proteins you can see the
dominance of beta strands right we can see in the structures right
so we can see lot of beta strands for example more than forty percent beta strands right
and less alpha helices ah right ok this is an example ok this is concanavalin a so it
contains several beta strands in this protein so one contains mainly alpha helices and the
second class contains only beta strands now the next possibility is the proteins contain
both both helices and strands right based on the location of these helices and strands
they are classified in two groups one is alpha plus beta and the another is alpha by beta
depending upon the location of this beta strand alpha helices and beta strands
whether they are separate segregated separately or they are mixing with each other so one
class is alpha plus beta proteins right in this case helices and strands tend to segregate
you can see the presence of more than fifteen percent alpha helices right and more than
ten percent beta strands and the helices and strands oh here you can see the helices and
you can see the strands right so they are ah segregate from each other and this is one
example for lysozyme example for the alpha plus beta proteins
so the forth type of proteins right their alpha by beta proteins right here also you
can see the helices and strands but they mix each other we can see the more than fifteen
percent helices and more than ten percent beta strands right so this is the numbers
for understanding ok this is not a hard and fast rules to have this specific numbers right
so this helices and strands right which are mix with each other right for example this
is one team right it contains eight alpha helices and eight beta strands so we can see
alpha helices and beta strands alpha helices beta strands and so on right you can see the
structures ah in the timberal proteins so you know how can we get the information
whether a protein belongs to all alpha proteins or all beta or alpha plus beta or alpha by
beta so to understand these characteristic features right there are several databases
the major one is the scop so to classify the proteins based on different structure classes
not only the just structure classes then i also have several sub classifications based
on the how they fold which family they belong to what are super families right all the information
they give so scop gives the comprehensive description
of the structural and evolutionary relationship of the proteins of a known structures they
started from the structures available in protein data bank and they classify the proteins right
based on the information available in the structural ah ah data so they have the hierarchy
levels so go with the family and then superfamily for example take a protein right first the
small one they belong to a some family then different families they come close together
and they have a superfamily that different super families they belong to same fold so
go with the fold and the different folds right they they have the similar class right for
example the ah structural class all alpha right we can see a four helical bundles fold
and different types of helical folds right they are mainly a fold these all the helical
ah ah proteins having different folds right they belong together and under the class all
alpha so for example he give one example say here
lysozyme ah right the families lysozyme then you can see the fold right you can see the
lysozyme like common alpha plus beta motif and the structural class alpha plus beta right
and you can identify these protein in a six letter code first four or the pdb code right
how to represent the pdb id first one
first one ah first one is the numeric
numeric way first numeric and then three letters can be
that can be numeric or this is the any alphabets right the for for example one lzn so you can
this is always a numeric and look at this three either this is ah alphabets or a digit
right so you have the first four letters and if you have any chain information for example
a protein contains different chains right for example hemoglobin how many chains in
a hemoglobin four
four chains right two alpha and two beta in this case four chains for example if is a
abcd then you can see the chain information here if it is a chain it is a b chain it is
b and so on and the sixth one you can see there are several domains for example here
a large protein it contains several domains right which was stable right if you know the
domain information then we can put the number one two three right they have domain information
this is how they represent the ah structure of any any any protein
so this is the web server right the database of a scop you can search the database right
with any pdb id right or with the names so here you can ah the root is the scop is the
protein belongs to alpha plus beta and gets you the fold is lysozyme like right and you
have the superfamily family and the and the protein name is lysozyme and the species homo
sapiens this belongs to the human then they give the other entries which are relevant
to put the particular protein for example if this case it is lysozyme so what the other
proteins right which is similar to lysozyme you can give the ah details about the similar
entries so here is a web server the scope of server
so you get a few so access this site you can access this database and you can see the structure
class information for several proteins similarly there is a another one server right this is
also called kind of the database right they called a cath cath here also they tried to
classify the structures based on the class similar to scop and the architecture and the
topology and the superfamily they use this word cath which represent class a for architecture
t for topology and h for homologous superfamily so what is c what is a what is t and what
is h the c is their class is a simplest level for different classes just we discussed what
are the four different classes alpha
alpha all alpha all beta alpha plus beta and alpha by beta right we can see the four different
this say the simplest way right and then go with this architecture this summarizes the
orientation of the structure units like this is the barrels or sandwich and so on then
the topology level here you can see the sequence of connectivity how the members of this architecture
right how they are connected sometimes you have the same architecture but have the different
topologies right depending upon the alpha helices beta strands how they are connected
with each other then the last one is homologous superfamilies that is h a similar structure
and function so based on the connectivity and the proteins
which have a similar structure function or the orientation of secondary structure right
they classify a different groups right so this were they called as cath c for class
a for architecture t for topology and h for homologous superfamilies ok this is the database
a cath database ok you can here for the same human lysozyme so its the class is mainly
alpha here and the architecture they put the number one point ah one zero this orthogonal
bundle and the lysozyme right and the superfamilies belongs to hydrolase
in case of scop right so what is a classification for the class right lysozyme they put alpha
plus beta but here they put ah all alpha right most of the cases we can see the ah similar
classifications and some cases we can see the a differences right i will explain why
there is a some difference right ah for the same protein in different ah ah the ah databases
if you look into several structures right almost all this structure in the protein protein
data bank and if we classify based on scope or the cath most of the cases we can have
the similar classification right that not much different
in some cases it is possible to the different assignments in scope and cath especially for
the case of this ah a alpha plus beta or alpha beta proteins right how this happens right
how do you classify the all alpha mostly ah alpha helices how you classify alpha
plus beta it has both alpha helices and beta strands right depending upon this cutoff values
right for example if you take ten percent or fifteen percent right then you can say
either they belong to all alpha or belong to another mixed class alpha plus beta or
alpha beta right if the helical content is high then we can see this can be all alpha
proteins sometimes the helical content is high but also it contains beta beta strands
in this case we have the conflict if we consider the high helical content right in this case
you can see this can be classified as all alpha
so in this case human lysozyme if you look at the look into the contents alpha helical
structure is thirty one percent and beta strand is the eight percent so if high content of
alpha that more than thirty percent so classified as a alpha ah all alpha protein right in cath
due to the presence of helices and strands more than five percent strand and more than
thirty percent helix right it is classified as alpha plus beta in the case of a scop this
happens only for few structures so in this case we need to check these databases see
whether there is a consistency or not then ah based on your requirements we can
classify this as the all alpha or alpha plus beta then if you have the structures right
we can get the structures from protein data bank right then you can derive several parameters
ok like the from sequence we derived various factors or the various properties we derived
from sequence amino acid
occurrence composition molecular
molecular weight average
average property values hydrophobicity
and then hydrophobicity profile and we can do the a dipeptide ah composition
and we can do the alignment multiple sequence alignment conservation right several features
we can do it likewise if you have the three d structures we can do better because three
d structures contain more information then amino acid sequences
so we can derive various parameters right and the important aspect is whatever the parameter
we derive from this structures they have some applications they have some applications to
understand the structure or function or anything related with diseases i will explain some
of the features for example contact maps right this is the simplest one we can constrict
from protein three d structures and accessible surface area contact order right depending
upon the how the two residues or contact in protein structures and long range order they
depending upon the contacts which are closed in space but they far in the sequence level
and some cases some residues you have more number of contacts some case less number of
contacts right some case more number of contacts for example in a class for example here class
representative right we keep contacts large more number of contacts right some some students
they have less number of contacts right so then these contacts are important right so
the residues which have more number of contacts they have a higher influence right this this
is the principle of multiple contact index then we can derive other factors like hydrophobicity
buriedness transfer free energy how the accessibility is reduced from the unfolder state to goes
to the folder state how they get the different interactions cut and pay interactions electrostatic
interaction hydrophobic interactions and so on
so we will explain some of the parameters right and derive how to do it so first one
is this simplest one we can obtain from three d structures its a contact maps right so what
is a contact map it simply represents the distance between different residues in protein
structures so we have a three d structures the atoms and the residues are located with
the xyz coordinates right and we in the contact map we can see how residue one and two are
in contact with each other one and five are in contact or not so the representation of
a three d structures into two d graph for example x axis if we have the amino acid sequence
right so amino acid sequence here and y axis here the amino acid sequence so one two three
like that right now the question is whether this residues
are in contact or not if they are in contact you put a dot if no contact leave it then
you will get a matrix right we will show you whether which residues are in contact right
i will show a a some example so the two residues i and j one the residue i second another residue
j right if it is equal to if this matrix is one if the residues are closer than any specific
threshold like any distance in this case we need the information regarding a distance
right and then getting the distance which atoms we need to consider these are two different
aspects we need to construct the contact map so if you look this is the example for the
contact map just i showed earlier so here you have the amino acid sequence x axis y
axis amino acid sequence right and you can put a dot if i and j are close to closed in
the specific threshold right otherwise its a blank if you see this graph what you can
infer from this graph so diagonals you can see it is always present what does it mean
nearby residue nearby residues right because they are near
the van der waals contacts so in this case one and two one and three like two and three
two and one so they are always in contact right so this way it is ah in the diagonal
you can see always represent and if you close to the diagonal some cases we can see this
some residues are present right some cases no in this case there is no this case there
is no right most of the case it is yes but if you look into these specific cases for
example here it is the very far in this case it is will as a around ten this is around
three hundred these residues they are close in space but
they far away in the sequence right i will explain in the details now when we make the
contact right as i discussed earlier the two different aspects one is we need to fix the
distance second one we need to fix the threshold if we take the the atoms there are various
ah ah ways to define these either you can consider c alpha atoms thats the simplest
one take all the consider only c alpha atoms and see the ah distance or we can see c beta
atoms right because c beta atoms they can see the interior of the protein right this
way a many many researchers they use c beta atoms
so can we use c beta atoms for all the residues no
no right because one residue right for glysine doesnt have the c beta so that case they use
all alpha and all other residues they beat beta right they represent better than the
c alpha so they use c c beta atoms or you can has any atoms for example residue five
and residue ten they are in contact or not you can see any of the heavy atoms which are
within then specific distance or we can see centroid any atom any residue we can get a
centroid xyz coordinates right now now all the residues right are represented by centroids
then we can calculate the distance and then you can see the cutoff right
so their various ways we can consider the atoms either c alpha or we can get c beta
or all are a any heavy atoms or you can use centroid then the distance which distance
we need to consider four five six which distance you want to consider depending upon the atoms
you consider for example if we use a c alpha or c beta there you can use the distance of
six to twelve angstrom right because we consider only one atom in the case if you have the
all atoms then you can reduce right go for four angstrom or five angstrom ah otherwise
you will get more number of contacts right so depending upon the atoms you use either
c alpha or c beta or all atoms you can define it rest of all so now here this is a coordinates
so this is the residue name where are the coordinates here right xyz coordinates what
is the next one occupancy this one d factor right ok so you have the xyz coordinates
right any protein structure if you go i discuss ah i showed earlier ah the both the protein
data bank right one example so we will gets the same ah the level of this representation
you can see xyz coordinates you can use these coordinates you consider contact maps right
for example i show the coordinates these xyz coordinates so ah all this is a c alpha atoms
because i extract the c alpha atoms from here right the this is thirty six point four two
ok see the c alpha atoms right so here this is the coordinates thirty six minus twenty
three and minu[s]- eight right i extract all the c alpha atom c alpha atoms
in this case which distance its better to define
six to six to twelve or twelve angstrom till six
to fourteen angstrom so i use the c is the distance of eight angstrom right we can these
are the c alpha atoms and you can construct maps how to construct a contact map right
first this is a what is a x axis this is a sequence right this is a sequence here here
also you have the sequence so now we have the a residues right you can see this is the
one its methionine two is asparigine three is isolysine right we have the residues or
you can put the numbers ok one two three four five six seven eight nine ten
so one two three four five six seven eight nine ten so take one one and two are in contact
yes right one and three yes one and four about one and five
ah one ah one and four not in contact not in contact one and five
not in contact one and five also you know here three nine
thirty six plus nine equal to forty five sixteen ok maybe in contact we we will see about one
and six not ah in contact
in contact one and seven no no
one and eight no not
one and nine no
no one and ten no because twelve and twenty is already disturbing right now ok then the
two two and three yes
yes two and four ah yes
two and five yeah two and five yes ah yes
two and six yes
yes a two and seven two and seven no right no
two and eight no
two and nine no sir
no two and ten no thirty four forty three so this no then now goes to three three and
four yes yes
three and five yes three and six yes
yes three and seven yes three and eight no
ah no three and nine no three and ten no then four and five
yes yes four and six yes four and seven
yes ah four and seven ok yes four and eight probably
yes four and nine no we know right yeah four and ten no
no right then five five and six yes five and
seven yes five and eight yes five and nine yes five and ten
yes ah we know then six and seven yes six and
eight six and nine no thirty six thirty nine forty three
yes yes
yes so the six and ten four yes
yes about this then seven seven and eight yes
seven and nine yes seven and ten yeah yes then eight eight and nine
yes eight and ten nine nine and ten right so you
consider the matrix so we will get the symmetrical matrix or the non symmetrical matrix which
one yes symmetric right why it is symmetric
we all drawn the yeah we did not draw draw another shape because
if two and three are in contacts then three and two are also in contact right not like
the amino acid sequence thats we talk about the neighboring residues so in this case you
dont get a symmetrical matrix right but here if two residues are in contact right two and
three in contact three and two are also in contact so in this case you can get the a
symmetrical matrix so here you can see the diagonal from this one you can easily say
that the residues are influenced with the short range contacts because the residues
which are very close then some residues right you can see even
near the ah diagonal two three residues which are far away they are also contributing right
or they are also interacting in protein structures right and here we consider only ten residue
the ten ah residues this why ah we will shown we could see the long range contacts but in
this figure right we can see the long range contacts for this so depending upon these
contacts right based on the space and how they are located in the sequence we can classify
into different ah types of interactions right different types of contacts whether these
contacts have short range that means short in terms of sequence right because we fix
the space because when you when you construct a map the space is fixed any distance six
angstrom or seven angstrom or eight angstrom we fix
now their difference is only the sequence level so the sequence level we see whether
they are very close in sequence right and how far they are ah distance ah apart in the
sequence for example if we see this the this one right the t is the central one and the
radius is eight angstrom we construct this sphere and there is several residues now this
residues are located in the sequence which i give ah below right we can see this the
t is central we can see the three here and how they are distributed along the sequence
based on the distribution of the residues along the sequence we can classify into different
ah types when we can see short range contacts that
we just we want to see the residues which are close to the central residue either plus
two plus one or plus two almost all residues they have a short range contacts ah right
so if we take any residue how many short range contacts ah each residue will have
three four four residues right for example you take this
five for example residue number five so a residue number five then you can see four
three six seven so four contacts then you can see whether any residue the residues they
have contact with at least plus or minus three or plus or minus four residues
in this case you can see its a medium range contacts we ah use three or four because we
need to represent the type of alpha helices right what how the alpha helices are found
what is the range i and i plus four
four right in this case we can see it will come within this range and you can see long
range contacts if it is more than four but this limit is so far the because we can see
there are very large so we can see even four we can ah divided into small bins four to
ten eleven to twenty and so on and then see which range you can see the contacts in different
types of a a structural classes so here i show a figure this say this is the contact
a t one fifty two of lysozyme they see real contacts if you look into the pdb structure
and take the three one in one fifty two and if you get the residues which are within the
limit of eight angstrom you can get this figure you see this one you can define the short
medium and long range contacts some residues for example f one fifty three and a one fifty
one they are just neighboring residues these residues form short range contacts and some
cases for example one fifty six and one fifty five they are three to four residues far apart
so they make medium range contacts and some residues which are far away in the sequence
but they are close in space for example t one fifty two and a ninety eight right what is the what is the
ah distance in the sequence level sixty ah fifty four
fifty four residues right so they are far away in the sequence so this residues contribute
long range contacts right so if you take a structures then you can see the contacts based
on the short range medium range and long range we can also interpret in terms of alpha helix
and beta strands alpha helices we can see the dominated mainly by the medium range interactions
and the beta strands which are dominated by long range interactions why
because the because of the hydrogen bonding patterns right
in alpha helix and the beta beta strands right so now if you see this one diagonal ones they
are mainly these are the short range these are the short range contacts and you can see
the residues very close by the diagonal these are medium range and here you can see they
are long range and if you see the locate secondary structures for example we see these are the
beta strands right we can see the beta strands here because there are many long range contacts
and several because these are all mainly alpha helices
now if you see the distribution of secondary structures and the contacts easily you can
say that which region belongs to helices or which region belongs to beta strands based
on this patterns the long range contacts mainly they are far away from the diagonal so now
you show the data for the four structured classes we discussed four structured classes
right what the four structured classes we discussed
all alpha all all alpha all beta alpha plus beta and alpha
beta so if you see the all alpha class all alpha class dominated with helices right helices
are ah mainly dominated with medium range
medium range interactions right so in this case if you see you have a different residue
intervals namely three to four and then you can see one or two residues apart five or
six so we can see more than twenty five percent of these a helices we can see mainly at this
level four to ten interval and we can see it is very very less for the different guesses
it is going down right there are the contacts from the long range it is very less in the
in in case of all alpha proteins but if we look into the all beta proteins right this
is the all beta proteins right ok this is all beta proteins
so here all beta proteins you can see the range is eleven to twenty because mainly these
ah anti parallel beta strands ok they have the hydrogen bonding pattern with respect
to ah more than ten residues so you can see eleven to twenty residues that is the dominance
in case of all all beta proteins you look in to alpha plus beta proteins right there
is in between right you can see ah both the cases this is the all alpha is all beta we
can see it is in between that here also all beta here its all alpha in between all alpha
and all beta but look at the alpha by beta proteins so what is the specialty the all
beta alpha alpha by beta proteins alternate
alternate one alpha one beta in this case if you take two beta or if you take two alpha
the distance will be more right if we take the beta strands right so beta alpha and beta
so in between one alpha it crosses more than ten fifteen residues this is a reason if you
take this type of proteins right we can see this is dominant in the twenty one to thirty
range ok several tumbrel proteins right in this you if you see the hydrogen bonding pattern
they are between twenty one to thirty this way it is dominant the in twenty one to thirty
range so it makes sense and if you see the structures
you can relate the secondary structures the structural class as well as the number of
ah contacts in protein structures this is another i show example this is all alpha this
is all beta right if you look at these figures and the based on the number of contacts mainly
the long range contacts right so which one has more number of long range contacts
all beta all beta its expected right so you can see
the more number there up to thirteen contacts but here it is very less and most of the case
it is zero in any case we can see zero there is no long range contact in the case of all
alpha proteins but here there is very less most of them are having more than six long
range contacts what do you expect for the alpha plus beta and alpha by beta right alpha
plus beta if you take you can see the mixture combination of these two one part you should
be less number of long range contacts and another part you can see the more number of
long range contacts you can see these right if you see here ok here you have more number
of long range contacts here we have less number of long range contacts
here most of the case they have a zero a long range contacts thats fine right you can understand
so then the alpha by beta you can see it is exact right you have some high and low high
low that is you can see the patterns so in terms of these contacts you can easily understand
different structure classes and how they interact and how they make the patterns then easily
you can see the different functional aspects right this is the contact they will make and
how we can ah ah superimpose the different different structures and to understand the
different functions so now we have the contact maps to construct
the contact map what are the different information do you need
coordinates you need the coordinates and we need to consider
the distance plus the atoms right depending upon the atom type right we can fix a distance
and we can see if this is in contact we can put a dot and you can see get the construct
the contact maps thats simply the depreciation right of the three d structures in the two
d space right then we can understand how they residues the contact with each other at different
structural classes