We are now ready to classify proteins

into families using Hidden Markov Models. Let’s assume that we are

given the multiple alignment of already known members

of the protein family and that we have constructed

an HMM based on this family. And now, a new protein comes in, and we want to decide whether it

belongs to the given protein family. The easiest thing to do is simply apply

the Viterbi algorithm to find an optimal hidden path and decide

whether this protein belongs to the family depending on the

probability threshold for this protein. So that’s what shown on this slide. And if the probabilistic threshold is satisfied, then we classify the protein

as belonging to this family and extend the alignment of proteins in

this family as shown in this slide. So, to do this classification, we

need to construct the Viterbi graph for the given HMM diagram. Please note that we have not discussed yet how to construct the Viterbi graph for

an HMM diagram. For example, how many rows

will this Viterbi graph have? Well, as before, the number of rows

in the Viterbi graph should be equal to the number of

states in this HMM diagram. So, a large number of rows. And

here, I showed the Viterbi graph, and I took the liberty to assign

every single column to every state the HMM is passing through. Does it make sense? So, the number of columns in

the Viterbi graph shown here is equal to the number of visited states. There’s one problem with

this Viterbi graph: We don’t know in advance what will be the number of visited states when we

look at the newly sequenced protein. And therefore, it’s not even clear

how to construct this graph. For example, in this case, it

has two deletion states, but how do we know this before

we constructed the Viterbi graph? And therefore, we should actually

try to think about how to construct the Viterbi graph with the number

of columns based on the number of emitted symbols, something that we know,

rather than the number of visited states, something that we don’t know in the case

of silent states such as deletion states. And therefore, by definition,

the number of columns in the Viterbi graph should be equal to

the number of emitted symbols. Which means it should be smaller

than the number of columns in this erroneously

constructed Viterbi graph. But here is the correctly

constructed Viterbi graph. Every time we move to the deletion

state and do not emit a symbol, we actually transform the edge into

a vertical edge within the same column. In this case,

the number of columns in the Viterbi graph is equal to the number of emitted symbols,

and the Viterbi then is well-defined. The only thing we need to

add is the zero column that contains only silent states for

technical details. So we just reduced

the problem of alignment with a profile HMM to the following

problem: Align a new sequence to a family of of aligned

sequences using the profile HMM. The input is a multiple alignment,

the string text, a threshold “theta” describing the maximum fraction of insertions

per column, and a pseudocount “sigma”. The output is an optimal hidden path

emitting text in the profile HMM, denoted HMM Alignment, Theta, Sigma. Let’s take a look at the hidden path

constructed in the HMM diagram, and of course this hidden path

corresponds to a path in the usual alignment graph

that we considered before. The question then arises: We just described

a rather complex decoding algorithm for constructing the hidden path in the HMM. But, wouldn’t be easier to just

construct the same alignment pass using the traditional alignment technique? Or, in other words,

have I wasted your time? Indeed, the recurrence for

the Viterbi algorithm is very similar to the recurrence for

the traditional sequence alignment. For the Viterbi Algorithm, the score

node is equal to the maximum of three scores of the preceding state. For the alignment,

again the score of the node is equal to the maximum of the three

scores of it’s predecessor. So it indeed looks like

I’ve wasted your time. However, notice that

the choice of alignment paths is now based on varying transition and

emission probabilities within different columns, and therefore,

this individual scoring parameter allows us to detect subtle

similarities using HMMs, similarities that the traditional

sequence alignment often misses.