This preview shows page 1. Sign up to view the full content.
Unformatted text preview: CS 124/LINGUIST 180: From Languages
to Information
Dan Jurafsky Lecture 6: Hidden Markov Models IP notice: CS 124/LINGUIST 180: From Languages
to Information
Dan Jurafsky Lecture 6: Hidden Markov Models Outline
Markov Chains Hidden Markov Models Three Algorithms for HMMs The Forward Algorithm The Viterbi Algorithm The Baum
Welch (EM Algorithm) ApplicaJons: The Ice Cream Task Part of Speech Tagging Biology Gene Finding Definitions
A weighted ﬁnite
state automaton An FSA with probabiliJes onthe arcs The sum of the probabiliJes leaving any arc must sum to one A Markov chain (or observable Markov Model) a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabiliJes to unambiguous sequences Markov chain for weather Markov chain for words Markov chain = “Firstorder
observable Markov Model”
a set of states Q = q1, q2…qN; the state at Jme t is qt TransiJon probabiliJes: a set of probabiliJes A = a01a02…an1…ann. Each aij represents the probability of transiJoning from state i to state j The set of these is the transiJon probability matrix A aij = P (qt = j  qt −1 = i) 1 ≤ i, j ≤ N
N ∑a ij = 1; 1≤ i ≤ N j =1 €
€ DisJnguished start and end states Markov chain = “Firstorder
observable Markov Model”
Current state only depends on previous state Markov Assumption : P (qi  q1 qi −1) = P (qi  qi −1 ) Another representation for start
state
Instead of start state Special iniJal probability vector π An iniJal distribuJon over probability of start states π i = P (q1 = i) 1 ≤ i ≤ N
Constraints: € N ∑π
j =1 € j =1 The weather figure using pi The weather figure: specific
example Markov chain for weather
What is the probability of 4 consecuJve warm days? Sequence is warm
warm
warm
warm I.e., state sequence is 3
3
3
3 P(3,3,3,3) = π3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432 How about?
Hot hot hot hot Cold hot cold hot What does the diﬀerence in these probabiliJes tell you about the real world weather info encoded in the ﬁgure? HMM for Ice Cream
You are a climatologist in the year 2799 Studying global warming You can’t ﬁnd any records of the weather in BalJmore, MD for summer of 2008 But you ﬁnd Jason Eisner’s diary Which lists how many ice
creams Jason ate every date that summer Our job: ﬁgure out how hot it was Hidden Markov Model
For Markov chains, the output symbols are the same as the states. See hot weather: we’re in state hot But in named
enJty or part
of
speech tagging (and speech recogniJon and other things) The output symbols are words But the hidden states are something else Part
of
speech tags Named en4ty tags So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states. This means we don’t know which state we are in. Hidden Markov Models Assumptions
Markov assump4on: P (qi  q1 qi −1) = P (qi  qi −1 )
Output
independence assump4on € € P (ot  O1t −1, q1t ) = P (ot q t ) Eisner task
Given Ice Cream ObservaJon Sequence: 1,2,3,2,2,2,3… Produce: Weather Sequence: H,C,H,H,H,C… HMM for ice cream Different types of HMM structure Bakis = lefttoright Ergodic =
fullyconnected The Three Basic Problems for HMMs
Jack Ferguson at IDA in the 1960s
Problem 1 (Evalua4on): Given the observaJon sequence O= (o1o2…oT), and an HMM model Φ = (A,B), how do we eﬃciently compute P(O Φ), the probability of the observaJon sequence, given the model Problem 2 (Decoding): Given the observaJon sequence O=
(o1o2…oT), and an HMM model Φ = (A,B), how do we choose a corresponding state sequence Q=(q1q2…qT) that is opJmal in some sense (i.e., best explains the observaJons) Problem 3 (Learning): How do we adjust the model parameters Φ = (A,B) to maximize P(O Φ )? Problem 1: computing the
observation likelihood
Given the following HMM: How likely is the sequence 3 1 3? How to compute likelihood
For a Markov chain, we just follow the states 3 1 3 and mulJply the probabiliJes But for an HMM, we don’t know what the states are! So let’s start with a simpler situaJon. CompuJng the observaJon likelihood for a given hidden state sequence Suppose we knew the weather and wanted to predict how much ice cream Jason would eat. I.e. P( 3 1 3  H H C) Computing likelihood of 3 1 3 given
hidden state sequence Computing joint probability of
observation and state sequence Computing total likelihood of 3 1 3
We would need to sum over Hot hot cold Hot hot hot Hot cold hot …. How many possible hidden state sequences are there for this sequence? How about in general for an HMM with N hidden states and a sequence of T observaJons? NT So we can’t just do separate computaJon for each hidden state sequence. Instead: the Forward algorithm
A kind of dynamic programming algorithm Just like Minimum Edit Distance Uses a table to store intermediate values Idea: Compute the likelihood of the observaJon sequence By summing over all possible hidden state sequences But doing this eﬃciently By folding all the sequences into a single trellis The forward algorithm
The goal of the forward algorithm is to compute P (o1, o2 , ..., oT , qT = qF  λ ) We’ll do this by recursion The forward algorithm
Each cell of the forward algorithm trellis alphat(j) Represents the probability of being in state j Amer seeing the ﬁrst t observaJons Given the automaton Each cell thus expresses the following probabilty The Forward Recursion The Forward Trellis We update each cell The Forward Algorithm Decoding
Given an observaJon sequence 3 1 3 And an HMM The task of the decoder To ﬁnd the best hidden state sequence Given the observaJon sequence O=(o1o2…oT), and an HMM model Φ = (A,B), how do we choose a corresponding state sequence Q=(q1q2…qT) that is opJmal in some sense (i.e., best explains the observaJons) Decoding
One possibility: For each hidden state sequence Q HHH, HHC, HCH, Compute P(OQ) Pick the highest one Why not? NT Instead: The Viterbi algorithm Is again a dynamic programming algorithm Uses a similar trellis to the Forward algorithm Viterbi intuition
We want to compute the joint probability of the observaJon sequence together with the best state sequence max P (q0 , q1, ..., qT , o1, o2 , ..., oT , qT = qF  λ) q 0, q 1,..., qT Viterbi Recursion The Viterbi trellis Viterbi intuition
Process observaJon sequence lem to right Filling out the trellis Each cell: Viterbi Algorithm Viterbi backtrace Hidden Markov Models for Part of
Speech Tagging Part of speech tagging
8 (ish) tradiJonal parts of speech Noun, verb, adjecJve, preposiJon, adverb, arJcle, interjecJon, pronoun, conjuncJon, etc This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) Called: parts
of
speech, lexical category, word classes, morphological classes, lexical tags, POS We’ll use POS most frequently I’ll assume that you all know what these are POS examples
N noun chair, bandwidth, pacing V verb study, debate, munch ADJ adj
purple, tall, ridiculous ADV adverb
unfortunately, slowly, P preposiJon of, by, to PRO pronoun
I, me, mine DET determiner the, a, that, those POS Tagging example
WORD tag the koala put the keys on the table DET N V DET N P DET N POS Tagging
Words omen have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a parJcular instance of a word. These examples from Dekang Lin POS tagging as a sequence classification
task
We are given a sentence (an “observaJon” or “sequence of observaJons”) Secretariat is expected to race tomorrow She promised to back the bill What is the best sequence of tags which corresponds to this sequence of observaJons? ProbabilisJc view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observaJon sequence of n words w1…wn. Getting to HMM
We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tnw1…wn) is highest. Hat ^ means “our esJmate of the best one” Argmaxx f(x) means “the x such that f(x) is maximized” Getting to HMM
This equaJon is guaranteed to give us the best tag sequence But how to make it operaJonal? How to compute this value? IntuiJon of Bayesian classiﬁcaJon: Use Bayes rule to transform into a set of other probabiliJes that are easier to compute Using Bayes Rule Likelihood and prior n Two kinds of probabilities (1)
Tag transiJon probabiliJes p(titi
1) Determiners likely to precede adjs and nouns That/DT ight/NN The/DT yellow/JJ hat/NN So we expect P(NNDT) and P(JJDT) to be high But P(DTJJ) to be: Compute P(NNDT) by counJng in a labeled corpus: Two kinds of probabilities (2)
Word likelihood probabiliJes p(witi) VBZ (3sg Pres verb) likely to be “is” Compute P(isVBZ) by counJng in a labeled corpus: POS tagging: likelihood and prior n 1/5/07 An Example: the verb “race”
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR People/NNS conJnue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/
NN How do we pick the right tag? Disambiguating “race” P(NNTO) = .00047 P(VBTO) = .83 P(raceNN) = .00057 P(raceVB) = .00012 P(NRVB) = .0027 P(NRNN) = .0012 P(VBTO)P(NRVB)P(raceVB) = .00000027 P(NNTO)P(NRNN)P(raceNN)=.00000000032 So we (correctly) choose the verb reading Transitions between the hidden states of HMM,
showing A probs
B observation likelihoods for POS
HMM
The A matrix for the POS HMM
The B matrix for the POS HMM
Viterbi intuition: we are looking for
the best ‘path’
S1 S2 S3 S4 S5 JJ DT VB VB NNP NN RB
NN
VB N
TO
VB D promised to back the bill
Slide from Dekang Lin Viterbi example
1/5/07 Another Application of HMMs
Gene Finding The Central Dogma
DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein
Slide from Serafim Batzoglou PEPTIDE Gene structure
exon1 intron1 exon2 intron2 exon3 transcription
splicing
translation exon = proteincoding
Slide from Serafim Batzoglou
intron = noncoding Codon:
A triplet of nucleotides that
is converted to one amino
acid Finding Genes in Yeast
5 Coding Intergenic Stop codon
TAG/TGA/TAA Start codon
ATG
Mean coding length about 1500bp (500 codons) Transcript
Slide from Serafim Batzoglou Intergenic 3 Introns: The Bane of ORF Scanning 5 Intergenic Exon Intron Start codon
ATG Slide from Serafim Batzoglou Exon Splice sites Transcript Intron Exon Intergenic Stop codon
TAG/TGA/TAA 3 Slide from Serafim Batzoglou Slide from Serafim Batzoglou Needles in a Haystack Slide from Serafim Batzoglou Hidden Markov Models for Gene Finding Intergene
State Intergenic Exon First Exon
State Intron Intron
State Exon Intron Exon Intergenic GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA Slide from Serafim Batzoglou Hidden Markov Models for Gene Finding Intergene
State Intergenic Exon First Exon
State Intron Intron
State Exon Intron Exon Intergenic GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA Slide from Serafim Batzoglou Outline
Markov Chains Hidden Markov Models Three Algorithms for HMMs The Forward Algorithm The Viterbi Algorithm The Baum
Welch (EM Algorithm) ApplicaJons: The Ice Cream Task Part of Speech Tagging Biology Gene Finding Next Jme: Named EnJty Tagging ...
View Full
Document
 Winter '09

Click to edit the document details