This preview shows page 1. Sign up to view the full content.
Unformatted text preview: HMM Tagging problem 5/14/08 10:50 PM HMM Tagging Problem: Part I
Complexity issues have reared their ugly heads again and with the IPO date on your new comp ling startup fast approaching, you have discovered that if your hot new system is going to parse sentences as long as 4 words, you had better limit yourself to a 3-word vocabulary. Consider the following HMM tagger for tagging texts constructed with a 3 word vocabulary. 1. station 2. ground 3. control and a tagset of 2 tags: 1. V [Verb] 2. N [Noun] Here is a partial state table for the HMM: States Start V = Verb N = Noun Transition Probability * Pr(ground | N) .4 .20 Start => V Pr(V | Start) .5 Start => N Start => V Start => N Start => V V => N .9 V => V * .15 control station V ground to be filled in to be filled in to be filled in to be filled in Pr(N | V) * .36 Pr(V | V) .1 V => N V => V V => N V => V N => N .5 N => V * .03 control station N ground to be filled in to be filled in to be filled in to be filled in Pr(N | N) * .02 Pr(V | N) .5 N => V N => V N => V N => V * .15 control station to be filled in to be filled in to be filled in to be filled in * Pr(ground | V) .3 * Pr(ground | N) .4 * Pr(ground | V) .3 * Pr(ground | N) .4 * Pr(ground | V) .3 .5 * State Observation Transition Start ground Start => N Pr(N | Start) The table is incomplete. The states and are all present, and all the transitions with non-zero probabilities are present, but the transition probabilities for control and station have been left out. Our HMM tagger always starts in the Start state. Since states always correspond to
http://www-rohan.sdsu.edu/~gawron/compling/course_core/assignments/hmm_tagging_problem.htm Page 1 of 5 HMM Tagging problem 5/14/08 10:50 PM tags in our model, this corresponds to the assumption that the previous tag at the start of input is always Start. You can use this tagger to assign the most likely tag sequence to any sequence of words taken from our small vocabulary. Consider the following input to our tagger: ground ground There are 4 different state sequences that will accept this input: 1. 2. 3. 4. Start Start Start Start V V N N V N N V These correspond to 4 different assignments of part-of-speech tags to the two input words (ground ground) Here is each word aligned above its transition and transition probability. ground ground start V N (.3 * .5) (.4 * .9) So this corresponds to the path in which the first occurrence of ground is labeled a verb, and the second a noun. Let's review where these transition probabilities come from. The probability model being used is the following. Prob(w 1,n ,t 1,n )= Pi i=1 P(w i | t i) * P(t i | t i-1 ) That is, the joint probability of a word sequence n words long and a tag sequence n + 1 tags long is equal to the product of the probabilities of each word given its tag times the probability of each tag given the previous tag. There is one extra tag because we take the tag at t=0 to be start. For each w i, then, we get a factor: P(w i | t i) * P(t i | t i-1 ) For our example, according to this probability model we calculate the joint probability to be: ground .3 * .5 * .4 ground * .9 = .054 Pr(ground|V) * Pr(V | Start) * Pr(ground|N) * Pr(N|V) This is the product of the transition probabilities for this path through the HMM. There are three others. To find the most likely assignment of tags we need to find the most probable path through the HMM. This is what the Viterbi algorithm is for. Problem 3 Proper
Part A The above HMM was given with only a partial probability model. Here is the entire probability model: Pr(w i | t i ) w ground control station Pr(w | N ) Pr(w | V ) Pr(w | Start ) .4 .3 .3 .3 .3 .4 0 0 0 Pr(ti | t i-1 ) N V Start http://www-rohan.sdsu.edu/~gawron/compling/course_core/assignments/hmm_tagging_problem.htm Page 2 of 5 HMM Tagging problem 5/14/08 10:50 PM Pr(V | N) Pr(V | V) Pr(V | Start) .5 .1 .5 Pr(N | N) Pr(N | V) Pr(N | Start) .5 .9 .5 The first part of the problem is to use this probability model to complete the transition table for the above HMM tagger by filling in the transition probabilities for control and station. Part B The second part of the tagging problem is to tag the following input: ground control station This can be done by computing the products of the transition probabilities (called the path probabilities) for all 16 paths through the HMM and choosing the most probable path. But the assigned problem is to choose the most probable path by using the Viterbi algorithm. To help you get started, here is the partial Viterbi matrix for our HMM and the given input: V N Start 1.0 ground control station t=0 t=1 t=2 t=3 Note that the Viterbi values for t=0 have already been filled in. Continue the matrix and fill in the values for t=1, t=2, and t=3. Show your calculations. Use the Viterbi homework assignment and Viterbi lecture as your model of what to show. Part C Using the results of your Viterbi calculation, give the most probable state sequence through the HMM: HMM Tagging Problem: Part II
Write a program that produces a probability model for for an HMM bigram tagger using a tagged corpus. NOTE: You are NOT being asked to write a tagger, just a program that produces the probability model such a tagger uses. To help you out here are some models to modify: 1. Baseline tagger (Perl, Python) 2. New! New! New! Interactive Python session executing relevant bits of Python code! This tagger executes the "baseline" strategy. For each word it assigns the most frequent tag for that word. Here I am training and testing the baseline tagger:
[tagger]$ tagger data/train.tag data/test.txt > tr_test1.tag Reading data/train.tag Finding most common tags Reading data/test.txt train.tag is a file with tagged data in it. The first few lines look like this: FACTSHEET_NN1 WHAT_DTQ IS_VBZ AIDS_NN1 ?_? AIDS_NN1 (_( Acquired_NP0 Immune_AJ0 Deficiency_NN1 Syndrome_NP0 )_) is_VBZ a_AT0 condition_NN1 caused_VVN by_PRP a_AT0 virus_NN1 called_VVD HIV_NP0 How_AVQ is_VBZ infection_NN1 transmitted_VVD ?_? through_PRP unprotected_AJ0 sexual_AJ0 intercourse_NN1 with_PRP an_AT0 infected_AJ0 partner_NN1 ._. through_PRP infected_AJ0 blood_NN1 or_CJC blood_NN1 products_NN2 ._. from_PRP an_AT0 infected_AJ0 mother_NN1 to_PRP her_DPS baby_NN1 ._. Each word is connected to its tag by an underscore ("_"), so you need to separate these two, keep count of how many times each word and tag co-occur, and keep http://www-rohan.sdsu.edu/~gawron/compling/course_core/assignments/hmm_tagging_problem.htm Page 3 of 5 HMM Tagging problem 5/14/08 10:50 PM track of tag "bigrams" as well. The data and code you need can be found on bulba under:
/home/ling581/hmm_tagger Here's a description of the DATA: File data/train.tag Type Description tagged training data train on this! training data/really_tiny_train.tag data/test.txt very small subset of tagged training data Use this only for debugging training phase! untagged test data run your baby on this! tagged test data gold standard for test.txt evaluate your baby's performance with this! the data of train.tag untagged: run your tagger on this and do real well! also: for running your tagger without unknown words tiny subset of training data the size of test data files faster max performance test tiny untagged subset of training data faster max performance test gold standard for tiny_train.txt subset of test.txt but a tiny amount for debugging gold standard for tiny_test.txt tagged development test data untagged development test data data/test.tag data/train.txt test data/tiny_train.txt data/tiny_train.tag data/tiny_test.txt data/tiny_test.tag data/valid.tag data/valid.txt development The corpora are all line-by-line corpora. This means as much as possible, each lines contains a complete sentence or a complete fragment. It also means adjacent lines are not guaranteed to be meaningfully related. This is good. It means that in training and testing you can process these corpora on a line by line basis, which is easier in many programming languages, including Perl and Python. The large training file is also here. Hint about example code: : For this assignment all you need to pay attention to is the first part of the code, the training step That part of the code ends here in the python code. fsock_train.close() You need to hand in a proper HMM probability model for the corpus train.tag. That will consist of the following: 1. Word-tag model: For each word tag pair, the probability of the word given the tag. 2. Tag-tag model For each tag tag pair (t1,t2), the probability of t2 given t1. Your probability model should be output to a file which you will hand in. The format is the following. For the word tag model word tag prob Each line contains just the word, the tag, and the probability, in that order separated by nothing but white space. http://www-rohan.sdsu.edu/~gawron/compling/course_core/assignments/hmm_tagging_problem.htm Page 4 of 5 HMM Tagging problem 5/14/08 10:50 PM For the tag-tag model the format is tag tag prob The two tag models should come in the order given above, word-tag model followed by tagtag model, and they should be separated by a line containing the following: ***END WORD TAG MODEL*** Here is some python code illustrating how to output stuff to afile: try: fsock_out=open(out_file_s,'w',0) ## Open file for writing print >> sys.stderr, 'Writing to %s' % out_file ## Mesage to STD_OUT word_freq_list = word_count.items() ## Make a list of pairs from ## from a dictionary for item in word_freq_list: # print each pair to the file separated by tabs. print >> fsock_out, '%s\t%s' % (item,item) # Close the file handle (Good citizenship!) fsock_out.close() http://www-rohan.sdsu.edu/~gawron/compling/course_core/assignments/hmm_tagging_problem.htm Page 5 of 5 ...
View Full Document
- Fall '08