Classifying Proteins with Profile HMMs

Classifying Proteins with Profile HMMs


We are now ready to classify proteins
into families using Hidden Markov Models. Let’s assume that we are
given the multiple alignment of already known members
of the protein family and that we have constructed
an HMM based on this family. And now, a new protein comes in, and we want to decide whether it
belongs to the given protein family. The easiest thing to do is simply apply
the Viterbi algorithm to find an optimal hidden path and decide
whether this protein belongs to the family depending on the
probability threshold for this protein. So that’s what shown on this slide. And if the probabilistic threshold is satisfied, then we classify the protein
as belonging to this family and extend the alignment of proteins in
this family as shown in this slide. So, to do this classification, we
need to construct the Viterbi graph for the given HMM diagram. Please note that we have not discussed yet how to construct the Viterbi graph for
an HMM diagram. For example, how many rows
will this Viterbi graph have? Well, as before, the number of rows
in the Viterbi graph should be equal to the number of
states in this HMM diagram. So, a large number of rows. And
here, I showed the Viterbi graph, and I took the liberty to assign
every single column to every state the HMM is passing through. Does it make sense? So, the number of columns in
the Viterbi graph shown here is equal to the number of visited states. There’s one problem with
this Viterbi graph: We don’t know in advance what will be the number of visited states when we
look at the newly sequenced protein. And therefore, it’s not even clear
how to construct this graph. For example, in this case, it
has two deletion states, but how do we know this before
we constructed the Viterbi graph? And therefore, we should actually
try to think about how to construct the Viterbi graph with the number
of columns based on the number of emitted symbols, something that we know,
rather than the number of visited states, something that we don’t know in the case
of silent states such as deletion states. And therefore, by definition,
the number of columns in the Viterbi graph should be equal to
the number of emitted symbols. Which means it should be smaller
than the number of columns in this erroneously
constructed Viterbi graph. But here is the correctly
constructed Viterbi graph. Every time we move to the deletion
state and do not emit a symbol, we actually transform the edge into
a vertical edge within the same column. In this case,
the number of columns in the Viterbi graph is equal to the number of emitted symbols,
and the Viterbi then is well-defined. The only thing we need to
add is the zero column that contains only silent states for
technical details. So we just reduced
the problem of alignment with a profile HMM to the following
problem: Align a new sequence to a family of of aligned
sequences using the profile HMM. The input is a multiple alignment,
the string text, a threshold “theta” describing the maximum fraction of insertions
per column, and a pseudocount “sigma”. The output is an optimal hidden path
emitting text in the profile HMM, denoted HMM Alignment, Theta, Sigma. Let’s take a look at the hidden path
constructed in the HMM diagram, and of course this hidden path
corresponds to a path in the usual alignment graph
that we considered before. The question then arises: We just described
a rather complex decoding algorithm for constructing the hidden path in the HMM. But, wouldn’t be easier to just
construct the same alignment pass using the traditional alignment technique? Or, in other words,
have I wasted your time? Indeed, the recurrence for
the Viterbi algorithm is very similar to the recurrence for
the traditional sequence alignment. For the Viterbi Algorithm, the score
node is equal to the maximum of three scores of the preceding state. For the alignment,
again the score of the node is equal to the maximum of the three
scores of it’s predecessor. So it indeed looks like
I’ve wasted your time. However, notice that
the choice of alignment paths is now based on varying transition and
emission probabilities within different columns, and therefore,
this individual scoring parameter allows us to detect subtle
similarities using HMMs, similarities that the traditional
sequence alignment often misses.