Motif Search CMSC 423

Sequence Profles CCT domain, oFten Found near one end oF plant proteins. Suppose we want to search For other examples oF this domain. How can we represent the pattern implied by these sequences? One way is a Sequence Profle
Sequence Profles (PSSM) ... A C D E T V W Y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 MotiF Position Amino Acid = 1 Color Probability that the i th position has the given amino acid = e i (x).

Sequence Logos Motif Position Height of letter fraction of time that letter is observed at that position. (Height of all the letters in a column to how conserved the column is)
Scoring a Sequence MRGSAMASINDSKILSLQNKKNALVDTSGYNAEVRVGDNVQLNTIYTNDFKLSSSGDKIIVN Color Probability that the i th position has the given amino acid = e i (x). x M= Score( x )=Pr( x | M )= L Y i =1 e i ( x i ) Score of a string according to proFle M = Product of the probabilities you would observe the given letters.

Background Frequencies ScoreCorrected( x )= Pr( x | M ) Pr( x | background) = L Y i =1 e i ( x i ) b ( x i ) Interested in how different this motif position is from we expect by chance. Correct for “expect by chance” by dividing by the probability of observing x in a random string: b(x i ) := probability of observing character x i at random. Usually computed as (# x i in entire string) / (length of string) ScoreCorrectedLog( x ) = log L Y i =1 e i ( x i ) b ( x i ) = L X i =1 log e i ( x i ) b ( x i ) Often, to avoid multiplying lots of terms, we take the log and then sum:
The PSSM doesn’t handle either: insertions of characters in the string that are not in the proFle. deletions of positions in the proFle (that don’t have a match in the string). A solution: use an HMM to model the proFle! AMASINDSKILSLQ-NKKNALVD

