LecturesPart08

LecturesPart08 - Computational Biology, Part 8 Representing...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 8 Representing and Finding Sequence Features Robert F. Murphy Copyright © 1996-2006. Copyright All rights reserved. Sequence Analysis Tasks ⇒ Representing sequence features, and finding sequence features using consensus sequences and frequency matrices sequences Definition s A sequence feature is a pattern that is sequence observed to occur in more than one sequence and (usually) to be correlated with some function some Sequence features s Features following an exact pattern x restriction enzyme recognition sites s Features with approximate patterns x promoters x transcription initiation sites x transcription termination sites x polyadenylation sites x ribosome binding sites x protein features Consensus sequences s A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature containing s Consensus sequences are regular Consensus expressions expressions Finding occurrences of consensus sequences s Example: recognition site for a restriction enzyme x x s EcoRI recognizes GAATTC GAATTC AccI recognizes GTMKAC GTMKAC Basic Algorithm x x x x Start with first character of sequence to be searched See if enzyme site matches starting at that position Advance to next character of sequence to be searched Repeat previous two steps until all positions have been Repeat tested tested Interactive Demonstration s (A1 Pattern matching demo) Block Diagram for Search with a Consensus Sequence Consensus Sequence (in IUB codes) Sequence to be searched Search Engine List of positions where matches occur Describing features using frequency matrices s Goal: Describe a sequence feature (or Goal: motif) more quantitatively than possible motif more using consensus sequences using s Need to describe how often particular bases Need are found in particular positions in a sequence feature sequence Describing features using frequency matrices s Definition: For a feature of length m using For an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature sequences Frequency matrices (continued) s Three uses of frequency matrices x Describe a sequence feature x Calculate probability of occurrence of feature Calculate probability in a random sequence in x Calculate degree of match between a new Calculate degree sequence and a feature sequence Interactive Demonstration s (A4 Frequency matrix demo) Frequency Matrices, PSSMs, and Profiles s A frequency matrix can be converted to a frequency Position-Specific Scoring Matrix (PSSM) pecific coring by converting frequencies to scores frequencies scores s PSSMs also called Position Weight also osition eight Matrixes (PWMs) or Profiles s) Profiles Methods for converting frequency matrices to PSSMs s Using log ratio of observed to expected score( j, i) = log m( j, i) / f ( j ) x s where m(j,i) is the frequency of character j observed at where m(j,i) position i and f(j) is the overall frequency of character j f(j) (usually in some large set of sequences) (usually Using amino acid substitution matrix (Dayhoff Using similarity matrix) [see later] similarity Pseudo-counts s How do we get a score for a position with How zero counts for a particular character? Can’t take log(0). take s Solution: add a small number to all Solution: positions with zero frequency positions Finding occurrences of a sequence feature using a Profile s As with finding occurrences of a consensus As sequence, we consider all positions in the target sequence as candidate matches target s For each position, we calculate a score by For “looking up” the value corresponding to the base at that position base Interactive Demonstration s (A5 Searching with Profile demo) Block Diagram for Building a PSSM Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM builder PSSM Block Diagram for Searching with a PSSM PSSM Threshold Set of Sequences to search PSSM search Sequences that match above threshold Positions and scores of matches Block Diagram for Searching for sequences related to a family with a PSSM Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM builder PSSM Threshold Set of Sequences to search PSSM search Sequences that match above threshold Positions and scores of matches Consensus sequences vs. frequency matrices s Should I use a consensus sequence or a Should consensus frequency matrix to describe my site? frequency x If all allowed characters at a given position are If equally "good", use IUB codes to create consensus sequence consensus 3 Example: Restriction enzyme recognition sites x If some allowed characters are "better" than If others, use frequency matrix others, 3 Example: Promoter sequences Consensus sequences vs. frequency matrices s Advantages of consensus sequences: smaller smaller description, quicker comparison description, s Disadvantage: lose quantitative information lose on preferences at certain locations on Sequence Analysis Tasks ⇒ Representing and finding sequence features using hidden Markov models features Markov chains s If we can predict all of the properties of a If sequence knowing only the conditional dinucleotide probabilities, then that sequence is an example of a Markov chain Markov s A Markov chain is defined as a sequence Markov of states in which each state depends only on the previous state on Formalism for Markov chains s s M=(Q,π,P) is a Markov chain, where Q = vector (1,..,n) is the list of states is x s π= vector (p1,..,pn) is the initial probability of each is state state x s Q(1)=A, Q(2)=C, Q(3)=G, Q(4)=T for DNA (1)=A, (2)=C, (3)=G, π i)=pQ(i) (e,g., π(1)=pA for DNA) ( for P= n x n matrix where the entry in row i and column j is the probability of observing state j if the previous state is i and the sum of entries in each row is 1 (≡ dinucleotide probabilities) each x P(i,j)=p*Q(i)Q(i) (e.g., P(1,2)=p*AC for DNA) Q(i)Q(i) Generating Markov chains s s s Given Q,π,P (and a random number generator), we Given Q, can generate sequences that are members of the Markov chain M Markov If π,P are derived from a single sequence, the If ,P family of sequences generated by M will include that sequence as well as many others that If π,P are derived from a sampled set of sequences, If ,P the family of sequences generated by M will be the population from which that set has been sampled sampled Interactive Demonstration s (A11 Markov chains) Discriminating between two states with Markov chains s To determine which of two states a To sequence is more likely to have resulted from, we calculate from, L a P ( x | model+) S ( x ) = log = å log P ( x | model-) i=1 a L S( x ) = å b xi - 1 xi i =1 + xi - 1 xi xi - 1 xi State probablities for + and models s + A C G T Given examples sequences that are from Given either + model (CpG island) or - model (not CpG island), can calculate the probability that each nucleotide will occur for each model (the a values for each model) A 0.180 0.171 0.161 0.079 C 0.274 0.368 0.339 0.355 G 0.426 0.274 0.375 0.384 T 0.120 0.188 0.125 0.182 A C G T A 0.300 0.322 0.248 0.177 C 0.205 0.298 0.246 0.239 G 0.285 0.078 0.298 0.292 T 0.210 0.302 0.208 0.292 Transition probabilities converted to log likelihood ratios ß A C G T A -0.740 -0.913 -0.624 -1.169 C 0.419 0.302 0.461 0.573 G 0.580 1.812 0.331 0.393 T -0.803 -0.685 -0.730 -0.679 Example s What is relative probability of C+G+C+ What compared with C-G-C-? compared s First calculate log-odds ratio: S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273 s Convert to relative probability: 22.273=4.833 s Relative probability is ratio of (+) to (-) P(+)=4.833 P(-) Example s Convert to percentage P(+) + P(-) = 1 4.833P(-) + P(-) = 1 P(-) = 1/5.833 = 17% s Conclusion P(+)=83% P(-)=17% Block Diagram for Generating Sequences with a Markov Model alphabet initial probabilities transition probabilities number of characters to generate Markov Model Sequence Generator sequence Hidden Markov models s “Hidden” connotes that the sequence is Hidden” generated by two or more states that have different transition probability matrices different More definitions s π i = state at position i in a path path s akl = P(π i = l | π i-1 = k) x probabilityof going from one state to another x “transition probability” s ek(b) = P(xi = b | π i = k) x probability of emitting a b when in state k probability emitting x “emission probability” Decoding s The goal of using an HMM is often to The determine (estimate) the sequence of underlying states that likely gave rise to an observed sequence observed s This is called “decoding” in the jargon of This speech recognition speech More definitions s Can calculate the joint probability of a Can sequence x and a state sequence π L P ( x, p ) = a0 p 1 Õ ep i ( x i ) ap i p i +1 i=1 requiring p L +1 = 0 Determining the optimal path: the Viterbi algorithm s Viterbi algorithm is form of dynamic Viterbi programming programming s Definition: Let vk(i) be the probability of the (i) most probable path ending in state k with observation i observation Determining the optimal path: the Viterbi algorithm s Initialisation (i=0): =0): v0(0)=1, vk(0)=0 for k>0 (0)=1, (0)=0 s Recursion (i=1..L): ): vl(i)=el(xi)maxk(vk(i-1)akl) ptri(l)=argmaxk(vk(i-1)akl) ptr s Termination: P(x,π*)=maxk(vk(L)ak0) πL*=argmaxk(vk(L)ak0) s Traceback (i=L..1): Traceback πi-1*=ptri(πi*) Block Diagram for Viterbi Algorithm alphabet initial probabilities transition probabilities sequence position i state k Viterbi Algorithm probability sequence was generated with position i being in state k Multiple paths can give the same sequence s The Viterbi algorithm finds the most likely The path given a sequence path s Other paths could also give rise to the same Other sequence sequence s How do we calculate the probability of a How sequence given an HMM? sequence Probability of a sequence s Sum the probabilities of all possible paths Sum that give that sequence that s Let P(x) be the probability of observing Let P(x) sequence x given an HMM P ( x ) = å P ( x, p ) p Probability of a sequence s Can find P(x) using a variation on Viterbi Can P(x) algorithm using sum instead of max algorithm s This is called the forward algorithm This forward s Replace vk(i) with fk(i)=P(x1…xi,πi=k) Replace Forward algorithm s Initialisation (i=0): =0): f0(0)=1, fk(0)=0 for k>0 (0)=1, (0)=0 s Recursion (i=1..L): ): f l (i) = el ( x i )å f k (i - 1) akl k s Termination: P ( x ) = å f k ( L) ak 0 k Backward algorithm s We may need to know the probability that a We particular observation xi came from a particular state k given a sequence x, P(πi=k|x) P( s Use algorithm analogous to forward Use algorithm but starting from the end algorithm Backward algorithm s Initialisation (i=0): =0): bk(L)=ak0 for all k s Recursion (i=L-1,…,1): =L-1,…,1): bk (i) = å akl el ( x i +1)bl (i + 1) l s Termination: P ( x ) = å a0 l el ( x1)bl (1) l Estimating probability of state at particular position s Combine the forward and backward probabilities Combine to estimate the posterior probability of the sequence being in a particular state at a particular position position f k (i)bk (i) P (p i = k | x ) = P( x) ...
View Full Document

Ask a homework question - tutors are online