jurafsky&martin_3rdEd_17 (1).pdf

And internal i parts of each chunk as well as those

Info icon This preview shows pages 207–209. Sign up to view the full content.

and internal ( I ) parts of each chunk, as well as those elements of the input that are outside ( O ) any chunk. Under this scheme, the size of the tagset is ( 2 n + 1 ) , where n is the number of categories to be classified. The following example shows the bracketing notation of ( 12.4 ) on page 206 reframed as a tagging task: (12.7) The B NP morning I NP flight I NP from B PP Denver B NP has B VP arrived I VP The same sentence with only the base-NPs tagged illustrates the role of the O tags. (12.8) The B NP morning I NP flight I NP from O Denver B NP has O arrived. O Notice that there is no explicit encoding of the end of a chunk in this scheme; the end of any chunk is implicit in any transition from an I or B to a B or O tag. This encoding reflects the notion that when sequentially labeling words, it is generally easier (at least in English) to detect the beginning of a new chunk than it is to know when a chunk has ended. Not surprisingly, a variety of other tagging schemes rep- resent chunks in subtly different ways, including some that explicitly mark the end of constituents. Tjong Kim Sang and Veenstra (1999) describe three variations on this basic tagging scheme and investigate their performance on a variety of chunking tasks. Given such a scheme, building a chunker consists of training a classifier to la- bel each word of an input sentence with one of the IOB tags from the tagset. Of course, training requires training data consisting of the phrases of interest delimited and marked with the appropriate category. The direct approach is to annotate a rep- resentative corpus. Unfortunately, annotation efforts can be both expensive and time consuming. It turns out that the best place to find such data for chunking is in an existing treebank such as the Penn Treebank described in Chapter 11. Such treebanks provide a complete parse for each corpus sentence, allowing base syntactic phrases to be extracted from the parse constituents. To find the phrases we’re interested in, we just need to know the appropriate non-terminal names in the corpus. Finding chunk boundaries requires finding the head and then including the material to the left of the head, ignoring the text to the right. This is somewhat error-prone since it relies on the accuracy of the head-finding rules described in Chapter 11. Having extracted a training corpus from a treebank, we must now cast the train- ing data into a form that’s useful for training classifiers. In this case, each input can be represented as a set of features extracted from a context window that sur- rounds the word to be classified. Using a window that extends two words before and two words after the word being classified seems to provide reasonable perfor- mance. Features extracted from this window include the words themselves, their parts-of-speech, and the chunk tags of the preceding inputs in the window.
Image of page 207

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 208
Image of page 209
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern