jurafsky&martin_3rdEd_17 (1).pdf

Depending on the application tokenization algorithms

Info icon This preview shows pages 23–24. Sign up to view the full content.

View Full Document Right Arrow Icon
Depending on the application, tokenization algorithms may also tokenize mul- tiword expressions like New York or rock ’n’ roll as a single token, which re- quires a multiword expression dictionary of some sort. Tokenization is thus inti- mately tied up with named entity detection , the task of detecting names, dates, and organizations (Chapter 20). One commonly used tokenization standard is known as the Penn Treebank to- kenization standard, used for the parsed corpora (treebanks) released by the Lin- Penn Treebank tokenization guistic Data Consortium (LDC), the source of many useful datasets. This standard separates out clitics ( doesn’t becomes does plus n’t ), keeps hyphenated words to- gether, and separates out all punctuation: Input : “The San Francisco-based restaurant,” they said, “doesn’t charge $10”. Output : The San Francisco-based restaurant , they said , does n’t charge $ 10 . Tokens can also be normalized , in which a single normalized form is chosen for words with multiple forms like USA and US or uh-huh and uhhuh . This standard- ization may be valuable, despite the spelling information that is lost in the normal- ization process. For information retrieval, we might want a query for US to match a document that has USA ; for information extraction we might want to extract coherent information that is consistent across differently-spelled instances. Case folding is another kind of normalization. For tasks like speech recognition case folding and information retrieval, everything is mapped to lower case. For sentiment anal- ysis and other text classification tasks, information extraction, and machine transla- tion, by contrast, case is quite helpful and case folding is generally not done (losing the difference, for example, between US the country and us the pronoun can out- weigh the advantage in generality that case folding provides). In practice, since tokenization needs to be run before any other language pro- cessing, it is important for it to be very fast. The standard method for tokeniza- tion/normalization is therefore to use deterministic algorithms based on regular ex- pressions compiled into very efficient finite state automata. Carefully designed de- terministic algorithms can deal with the ambiguities that arise, such as the fact that the apostrophe needs to be tokenized differently when used as a genitive marker (as in the book’s cover ), a quotative as in ‘The other class’, she said , or in clitics like they’re . We’ll discuss this use of automata in Chapter 3. 2.3.3 Word Segmentation in Chinese: the MaxMatch algorithm Some languages, including Chinese, Japanese, and Thai, do not use spaces to mark potential word-boundaries, and so require alternative segmentation methods. In Chi- nese, for example, words are composed of characters known as hanzi . Each charac- hanzi ter generally represents a single morpheme and is pronounceable as a single syllable.
Image of page 23

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 24
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern