This preview shows pages 1–9. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 Hidden Markov Models (HMMs) (Lecture for CS397CXZ Algorithms in Bioinformatics) Feb. 20, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, UrbanaChampaign 2 Motivation: the CpG island problem Methylation in human genome CG > TG happens in most place except start regions of genes CpG islands = 1001,000 bases before a gene starts Questions Q1: Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? Q2: Given a long sequence, how would we find the CpG islands in it? 3 Answer to Q1: Bayes Classifier (  ) ( ) (  ) ( ) (  ) (  ) ( ) ( ) CpG CpG Other Other CpG Other P X H P H P X H P H P H X P H X P X P X = = Hypothesis space: H={H CpG , H Other } Evidence: X=ATCGTTC Likelihood of evidence (Generative Model) Prior probability (  ) (  ) ( ) (  ) (  ) ( ) CpG CpG CpG Other Other Other P H X P X H P H P H X P X H P H = We need two generative models for sequences: p(X H CpG ), p(XH Other ) 4 A Simple Model for Sequences:p(X) 1 2 1 1 1 1 1 1 ( ) ( ... ) (  ... ) : ( ) ( ) : ( ) (  ) n n i i i n i i n i i i p X p X X X p X X X Unigram p X p X Bigram p X p X X = = = = = = = Probability rule Assume independence Capture some dependence P(xH CpG ) P(AH CpG )=0.25 P(TH CpG )=0.25 P(CH CpG )=0.25 P(GH CpG )=0.25 P(xH Other ) P(AH Other )=0.25 P(TH Other )=0.40 P(CH Other )=0.10 P(GH Other )=0.25 X=ATTG Vs. X=ATCG 5 Answer to Q2: Hidden Markov Model CpG Island X=ATTGATGCAAAAGGGGGATCGGGCGATATAAAATTTG Other Other How can we identify a CpG island in a long sequence? Idea 1: Test each window of a fixed number of nucleitides Idea2: Classify the whole sequence Class label S1: OOOO.O Class label S2: OOOO. OCC Class label Si: OOOOOCC..COO Class label SN: CCCC.CC S*=argmax S P(SX) = argmax S P(S,X) S*=OOOOOCC..COO CpG 6 HMM is just one way of modeling p(X,S) 7 A simple HMM Parameters Initial state prob: p(B)= 0.5; p(I)=0.5 State transition prob: p(B B)=0.8 p(B I)=0.2 p(I B)=0.5 p(I I)=0.5 Output prob: P(aB) = 0.25, p(cB)=0.10 P(cI) = 0.25 P(B)=0.5 P(I)=0.5 P(xB) B I 0.8 0.2 0.5 0.5 P(xI) 0.8 0.2 0.5 0.5 P(xH CpG )=p(xI) P(aI)=0.25 P(tI)=0.25 P(cI)=0.25 P(gI)=0.25 P(xH Other )=p(xB) P(aB)=0.25 P(tB)=0.40 P(cB)=0.10 P(gB)=0.25 8 ( , , , , ) HMM S V B A = ( ) : " " i k k i b v prob of generating v at s A General Definition of HMM 1 1 { ,..., } 1 N N i i = = = : i i prob of starting at state s 1 { ,..., } M V v v = 1 { ,..., } N S s s = N states M symbols Initial state probability: 1 { } 1 , 1 N ij ij j A a i j N a...
View
Full
Document
This note was uploaded on 02/13/2012 for the course CS 91.510 taught by Professor Staff during the Fall '09 term at UMass Lowell.
 Fall '09
 Staff
 Algorithms

Click to edit the document details