124.11.lec6

124.11.lec6 - CS 124/LINGUIST 180: From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 6: Hidden Markov Models IP notice: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 6: Hidden Markov Models Outline   Markov Chains   Hidden Markov Models   Three Algorithms for HMMs   The Forward Algorithm   The Viterbi Algorithm   The Baum ­Welch (EM Algorithm)   ApplicaJons:   The Ice Cream Task   Part of Speech Tagging   Biology   Gene Finding Definitions   A weighted finite ­state automaton   An FSA with probabiliJes onthe arcs   The sum of the probabiliJes leaving any arc must sum to one   A Markov chain (or observable Markov Model)   a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through   Markov chains can’t represent inherently ambiguous problems   Useful for assigning probabiliJes to unambiguous sequences Markov chain for weather Markov chain for words Markov chain = “First-order observable Markov Model”   a set of states   Q = q1, q2…qN; the state at Jme t is qt   TransiJon probabiliJes:   a set of probabiliJes A = a01a02…an1…ann.   Each aij represents the probability of transiJoning from state i to state j   The set of these is the transiJon probability matrix A aij = P (qt = j | qt −1 = i) 1 ≤ i, j ≤ N N ∑a ij = 1; 1≤ i ≤ N j =1 € €   DisJnguished start and end states Markov chain = “First-order observable Markov Model”   Current state only depends on previous state Markov Assumption : P (qi | q1 qi −1) = P (qi | qi −1 ) Another representation for start state   Instead of start state   Special iniJal probability vector π   An iniJal distribuJon over probability of start states π i = P (q1 = i) 1 ≤ i ≤ N   Constraints: € N ∑π j =1 € j =1 The weather figure using pi The weather figure: specific example Markov chain for weather   What is the probability of 4 consecuJve warm days?   Sequence is warm ­warm ­warm ­warm   I.e., state sequence is 3 ­3 ­3 ­3   P(3,3,3,3) =   π3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432 How about?   Hot hot hot hot   Cold hot cold hot   What does the difference in these probabiliJes tell you about the real world weather info encoded in the figure? HMM for Ice Cream   You are a climatologist in the year 2799   Studying global warming   You can’t find any records of the weather in BalJmore, MD for summer of 2008   But you find Jason Eisner’s diary   Which lists how many ice ­creams Jason ate every date that summer   Our job: figure out how hot it was Hidden Markov Model   For Markov chains, the output symbols are the same as the states.   See hot weather: we’re in state hot   But in named ­enJty or part ­of ­speech tagging (and speech recogniJon and other things)   The output symbols are words   But the hidden states are something else   Part ­of ­speech tags   Named en4ty tags   So we need an extension!   A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states.   This means we don’t know which state we are in. Hidden Markov Models Assumptions   Markov assump4on: P (qi | q1 qi −1) = P (qi | qi −1 )   Output ­independence assump4on € € P (ot | O1t −1, q1t ) = P (ot |q t ) Eisner task   Given   Ice Cream ObservaJon Sequence: 1,2,3,2,2,2,3…   Produce:   Weather Sequence: H,C,H,H,H,C… HMM for ice cream Different types of HMM structure Bakis = left-to-right Ergodic = fully-connected The Three Basic Problems for HMMs Jack Ferguson at IDA in the 1960s   Problem 1 (Evalua4on): Given the observaJon sequence O= (o1o2…oT), and an HMM model Φ = (A,B), how do we efficiently compute P(O| Φ), the probability of the observaJon sequence, given the model   Problem 2 (Decoding): Given the observaJon sequence O= (o1o2…oT), and an HMM model Φ = (A,B), how do we choose a corresponding state sequence Q=(q1q2…qT) that is opJmal in some sense (i.e., best explains the observaJons)   Problem 3 (Learning): How do we adjust the model parameters Φ = (A,B) to maximize P(O| Φ )? Problem 1: computing the observation likelihood   Given the following HMM:   How likely is the sequence 3 1 3? How to compute likelihood   For a Markov chain, we just follow the states 3 1 3 and mulJply the probabiliJes   But for an HMM, we don’t know what the states are!   So let’s start with a simpler situaJon.   CompuJng the observaJon likelihood for a given hidden state sequence   Suppose we knew the weather and wanted to predict how much ice cream Jason would eat.   I.e. P( 3 1 3 | H H C) Computing likelihood of 3 1 3 given hidden state sequence Computing joint probability of observation and state sequence Computing total likelihood of 3 1 3   We would need to sum over   Hot hot cold   Hot hot hot   Hot cold hot   ….   How many possible hidden state sequences are there for this sequence?   How about in general for an HMM with N hidden states and a sequence of T observaJons?   NT   So we can’t just do separate computaJon for each hidden state sequence. Instead: the Forward algorithm   A kind of dynamic programming algorithm   Just like Minimum Edit Distance   Uses a table to store intermediate values   Idea:   Compute the likelihood of the observaJon sequence   By summing over all possible hidden state sequences   But doing this efficiently   By folding all the sequences into a single trellis The forward algorithm   The goal of the forward algorithm is to compute P (o1, o2 , ..., oT , qT = qF | λ )   We’ll do this by recursion The forward algorithm   Each cell of the forward algorithm trellis alphat(j)   Represents the probability of being in state j   Amer seeing the first t observaJons   Given the automaton   Each cell thus expresses the following probabilty The Forward Recursion The Forward Trellis We update each cell The Forward Algorithm Decoding   Given an observaJon sequence   3 1 3   And an HMM   The task of the decoder   To find the best hidden state sequence   Given the observaJon sequence O=(o1o2…oT), and an HMM model Φ = (A,B), how do we choose a corresponding state sequence Q=(q1q2…qT) that is opJmal in some sense (i.e., best explains the observaJons) Decoding   One possibility:   For each hidden state sequence Q   HHH, HHC, HCH,   Compute P(O|Q)   Pick the highest one   Why not?   NT   Instead:   The Viterbi algorithm   Is again a dynamic programming algorithm   Uses a similar trellis to the Forward algorithm Viterbi intuition   We want to compute the joint probability of the observaJon sequence together with the best state sequence max P (q0 , q1, ..., qT , o1, o2 , ..., oT , qT = qF | λ) q 0, q 1,..., qT Viterbi Recursion The Viterbi trellis Viterbi intuition   Process observaJon sequence lem to right   Filling out the trellis   Each cell: Viterbi Algorithm Viterbi backtrace Hidden Markov Models for Part of Speech Tagging Part of speech tagging   8 (ish) tradiJonal parts of speech   Noun, verb, adjecJve, preposiJon, adverb, arJcle, interjecJon, pronoun, conjuncJon, etc   This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.)   Called: parts ­of ­speech, lexical category, word classes, morphological classes, lexical tags, POS   We’ll use POS most frequently   I’ll assume that you all know what these are POS examples   N noun chair, bandwidth, pacing   V verb study, debate, munch   ADJ adj purple, tall, ridiculous   ADV adverb unfortunately, slowly,   P preposiJon of, by, to   PRO pronoun I, me, mine   DET determiner the, a, that, those POS Tagging example WORD tag the koala put the keys on the table DET N V DET N P DET N POS Tagging   Words omen have more than one POS: back   The back door = JJ   On my back = NN   Win the voters back = RB   Promised to back the bill = VB   The POS tagging problem is to determine the POS tag for a parJcular instance of a word. These examples from Dekang Lin POS tagging as a sequence classification task   We are given a sentence (an “observaJon” or “sequence of observaJons”)   Secretariat is expected to race tomorrow   She promised to back the bill   What is the best sequence of tags which corresponds to this sequence of observaJons?   ProbabilisJc view:   Consider all possible sequences of tags   Out of this universe of sequences, choose the tag sequence which is most probable given the observaJon sequence of n words w1…wn. Getting to HMM   We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest.   Hat ^ means “our esJmate of the best one”   Argmaxx f(x) means “the x such that f(x) is maximized” Getting to HMM   This equaJon is guaranteed to give us the best tag sequence   But how to make it operaJonal? How to compute this value?   IntuiJon of Bayesian classificaJon:   Use Bayes rule to transform into a set of other probabiliJes that are easier to compute Using Bayes Rule Likelihood and prior n Two kinds of probabilities (1)   Tag transiJon probabiliJes p(ti|ti ­1)   Determiners likely to precede adjs and nouns   That/DT ight/NN   The/DT yellow/JJ hat/NN   So we expect P(NN|DT) and P(JJ|DT) to be high   But P(DT|JJ) to be:   Compute P(NN|DT) by counJng in a labeled corpus: Two kinds of probabilities (2)   Word likelihood probabiliJes p(wi|ti)   VBZ (3sg Pres verb) likely to be “is”   Compute P(is|VBZ) by counJng in a labeled corpus: POS tagging: likelihood and prior n 1/5/07 An Example: the verb “race”   Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR   People/NNS conJnue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/ NN   How do we pick the right tag? Disambiguating “race”   P(NN|TO) = .00047   P(VB|TO) = .83   P(race|NN) = .00057   P(race|VB) = .00012   P(NR|VB) = .0027   P(NR|NN) = .0012   P(VB|TO)P(NR|VB)P(race|VB) = .00000027   P(NN|TO)P(NR|NN)P(race|NN)=.00000000032   So we (correctly) choose the verb reading Transitions between the hidden states of HMM, showing A probs   B observation likelihoods for POS HMM   The A matrix for the POS HMM   The B matrix for the POS HMM   Viterbi intuition: we are looking for the best ‘path’ S1 S2 S3 S4 S5 JJ DT VB VB NNP NN RB NN VB N TO VB D promised to back the bill Slide from Dekang Lin Viterbi example   1/5/07 Another Application of HMMs   Gene Finding The Central Dogma DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein Slide from Serafim Batzoglou PEPTIDE Gene structure exon1 intron1 exon2 intron2 exon3 transcription splicing translation exon = protein-coding Slide from Serafim Batzoglou intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid Finding Genes in Yeast 5 Coding Intergenic Stop codon TAG/TGA/TAA Start codon ATG Mean coding length about 1500bp (500 codons) Transcript Slide from Serafim Batzoglou Intergenic 3 Introns: The Bane of ORF Scanning 5 Intergenic Exon Intron Start codon ATG Slide from Serafim Batzoglou Exon Splice sites Transcript Intron Exon Intergenic Stop codon TAG/TGA/TAA 3 Slide from Serafim Batzoglou Slide from Serafim Batzoglou Needles in a Haystack Slide from Serafim Batzoglou Hidden Markov Models for Gene Finding Intergene State Intergenic Exon First Exon State Intron Intron State Exon Intron Exon Intergenic GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA Slide from Serafim Batzoglou Hidden Markov Models for Gene Finding Intergene State Intergenic Exon First Exon State Intron Intron State Exon Intron Exon Intergenic GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA Slide from Serafim Batzoglou Outline   Markov Chains   Hidden Markov Models   Three Algorithms for HMMs   The Forward Algorithm   The Viterbi Algorithm   The Baum ­Welch (EM Algorithm)   ApplicaJons:   The Ice Cream Task   Part of Speech Tagging   Biology   Gene Finding   Next Jme: Named EnJty Tagging ...
View Full Document

Ask a homework question - tutors are online