jurafsky&martin_3rdEd_17 (1).pdf

Words not in the dictionary or genres that differ a

Info icon This preview shows pages 25–26. Sign up to view the full content.

(words not in the dictionary) or genres that differ a lot from the assumptions made by the dictionary builder. The most accurate Chinese segmentation algorithms generally use statistical se- quence models trained via supervised machine learning on hand-segmented training sets; we’ll introduce sequence models in Chapter 10. 2.3.4 Lemmatization and Stemming Lemmatization is the task of determining that two words have the same root, despite their surface differences. The words am , are , and is have the shared lemma be ; the words dinner and dinners both have the lemma dinner . Representing a word by its lemma is important for web search, since we want to find pages mentioning wood- chucks if we search for woodchuck . This is especially important in morphologically complex languages like Russian, where for example the word Moscow has different endings in the phrases Moscow , of Moscow , from Moscow , and so on. Lemmatizing each of these forms to the same lemma will let us find all mentions of Moscow. The lemmatized form of a sentence like He is reading detective stories would thus be He be read detective story . How is lemmatization done? The most sophisticated methods for lemmatization involve complete morphological parsing of the word. Morphology is the study of the way words are built up from smaller meaning-bearing units called morphemes . morpheme Two broad classes of morphemes can be distinguished: stems —the central mor- stem pheme of the word, supplying the main meaning— and affixes —adding “additional” affix meanings of various kinds. So, for example, the word fox consists of one morpheme (the morpheme fox ) and the word cats consists of two: the morpheme cat and the morpheme -s . A morphological parser takes a word like cats and parses it into the two morphemes cat and s , or a Spanish word like amaren (‘if in the future they would love’) into the morphemes amar ‘to love’, 3PL , and future subjunctive . We’ll introduce morphological parsing in Chapter 3. The Porter Stemmer While using finite-state transducers to build a full morphological parser is the most general way to deal with morphological variation in word forms, we sometimes make use of simpler but cruder chopping off of affixes. This naive version of mor- phological analysis is called stemming , and one of the most widely used stemming stemming algorithms is the simple and efficient Porter (1980) algorithm. The Porter stemmer Porter stemmer applied to the following paragraph: This was not the map we found in Billy Bones’s chest, but an accurate copy, complete in all things-names and heights and soundings-with the single exception of the red crosses and the written notes. produces the following stemmed output: Thi wa not the map we found in Billi Bone s chest but an accur copi complet in all thing name and height and sound with the singl except of the red cross and the written note The algorithm is based on series of rewrite rules run in series, as a cascade , in cascade
Image of page 25

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 26
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern