jurafsky&martin_3rdEd_17 (1).pdf

There are some minor differences in the tagsets used

Info icon This preview shows pages 147–149. Sign up to view the full content.

There are some minor differences in the tagsets used by the corpora. For example in the WSJ and Brown corpora, the single Penn tag TO is used for both the infinitive to ( I like to race ) and the preposition to ( go to the store ), while in the Switchboard corpus the tag TO is reserved for the infinitive use of to , while the preposition use is tagged IN: Well/UH ,/, I/PRP ,/, I/PRP want/VBP to/TO go/VB to/IN a/DT restaurant/NN Finally, there are some idiosyncracies inherent in any tagset. For example, be- cause the Penn 45 tags were collapsed from a larger 87-tag tagset, the original Brown tagset , some potential useful distinctions were lost. The Penn tagset was designed for a treebank in which sentences were parsed, and so it leaves off syntac- tic information recoverable from the parse tree. Thus for example the Penn tag IN is used for both subordinating conjunctions like if, when, unless, after : after/IN spending/VBG a/DT day/NN at/IN the/DT beach/NN and prepositions like in, on, after : after/IN sunrise/NN Tagging algorithms assume that words have been tokenized before tagging. The Penn Treebank and the British National Corpus split contractions and the ’s -genitive from their stems: would/MD n’t/RB children/NNS ’s/POS Indeed, the special Treebank tag POS is used only for the morpheme ’s , which must be segmented off during tokenization. Another tokenization issue concerns multipart words. The Treebank tagset as- sumes that tokenization of words like New York is done at whitespace. The phrase a New York City firm is tagged in Treebank notation as five separate words: a/DT New/NNP York/NNP City/NNP firm/NN . The C5 tagset for the British National Cor- pus, by contrast, allow prepositions like “ in terms of ” to be treated as a single word by adding numbers to each tag, as in in/II31 terms/II32 of/II33 . 10.3 Part-of-Speech Tagging Part-of-speech tagging ( tagging for short) is the process of assigning a part-of- tagging speech marker to each word in an input text. Because tags are generally also applied to punctuation, tokenization is usually performed before, or as part of, the tagging process: separating commas, quotation marks, etc., from words and disambiguating end-of-sentence punctuation (period, question mark, etc.) from part-of-word punc- tuation (such as in abbreviations like e.g. and etc. ) The input to a tagging algorithm is a sequence of words and a tagset, and the output is a sequence of tags, a single best tag for each word as shown in the examples on the previous pages. Tagging is a disambiguation task; words are ambiguous —have more than one ambiguous possible part-of-speech— and the goal is to find the correct tag for the situation. For example, the word book can be a verb ( book that flight ) or a noun (as in hand me that book .
Image of page 147

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

148 C HAPTER 10 P ART - OF -S PEECH T AGGING That can be a determiner ( Does that flight serve dinner ) or a complementizer ( I thought that your flight was earlier ). The problem of POS-tagging is to resolve resolution these ambiguities, choosing the proper tag for the context. Part-of-speech tagging is
Image of page 148
Image of page 149
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern