Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004

Delete states (circle): silent or null state. Do not match any residues, they are there so it is possible to jump over one or more columns: For modeling when just a few of the sequences have a “-” at a position. Example:
Pseudo-counts Dangerous to estimate a probability distribution from just a few observed amino acids. If there are two sequences, with Leu at a position: P for Leu =1, but P = 0 for all other residues at this position But we know that often Val substitutes Leu. The probability of the whole sequence are easily become 0 if a single Leu is substituted by a Val. Or , the log-odds is minus infinity. How to avoid “over-fitting” (strong conclusions drawn from very little evidence)? Use pseudocounts: Pretend to have more counts than those from the data. A. Add 1 to all the counts: Leu: 3/23, other a.a.: 1/23

Adding 1 to all counts is as assuming a priori all a.a. are equally likely. Another approach: use background composition as pseudocounts.

Searching a database with HMM Know how to calculate the probability of a sequence in the alignment: multiplying all the probabilities (or adding the log-odds scores) in the model along the path followed by that sequence. For sequences not in the alignment, we do not know the path. Find a path through the model where the new sequence fits well: we can then score it as before. Need to “align” the sequence to the model: Assigning states to each residue in the sequence. A given sequence can have many alignments.

Eg. A protein has a.a. as: A1, A2, A3, … HMM states as: M1, M2, M3, … for match states, I1, I2, I3, … for insertion states, An alignment: A1 matches M1, A2 and A3 match I1, A4 matches M2, A5 matches M6 (after passing through three delete states). For each alignment, we can calculate the probability of the sequence,
