124.11.lec2

124.11.lec2 - Click to edit Master subtitle style Lecture 2...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Click to edit Master subtitle style Lecture 2 Tokenization/Segmentation Minimum Edit Distance CS 124/LINGUIST 180 From Languages to Information Thanks to Chris Manning and Serafim 6/1/11 Outline Tokenization Word Tokenization Normalization Lemmatization and stemming Sentence Tokenization Minimum Edit Distance Levenshtein distance Needleman-Wunsch Smith-Waterman 6/1/11 2 2 6/1/11 Tokenization For Information retrieval Information extraction (detecting named entities, etc.) Spell-checking 3 tasks Segmenting/tokenizing words in running text Normalizing word formats Segmenting sentences in running text Why not just periods and white-space? Mr. Sherwood said reaction to Sea Containers’ proposal has been "very positive." In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. “I said, ‘what’re you? Crazy?’ “ said 6/1/11 What’s a word? I do uh main- mainly business data processing Fragments Filled pauses Are cat and cats the same word? Some terminology Lemma : a set of lexical forms having the same stem, major part of speech, and rough word sense Cat and cats = same lemma Wordform : the full inflected surface form. 6/1/11 How many words? they lay back on the San Francisco grass and looked at the stars 13 tokens (or 12) 12 types (or 11) The Switchboard corpus of American telephone conversation: 2.4 million wordform tokens ~20,000 wordform types Brown et al (1992) large corpus of text 583 million wordform tokens 293,181 wordform types Shakespeare: 884,647 wordform tokens 31,534 wordform types Let N = number of tokens, V = vocabulary = number of types General wisdom: V > O(sqrt(N)) 6/1/11 Issues in Tokenization Finland’s capital Finland? Finlands? Finland’s what’re, I’m, isn’t-> What are, I am, is not Hewlett-Packard & Hewlett and Packard as two tokens? state-of-the-art : Break up? lowercase , lower-case , lower case ? San Francisco, New York : one token or two? Words with punctuation Slide from 6/1/11 Tokenization: language issues French L'ensemble c one token or two? L ? L’ ? Le ? Want l’ensemble to match with un ensemble German noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’ German retrieval systems benefit greatly from a compound splitter module Slide from 6/1/11 Tokenization: language issues Chinese and Japanese no spaces between words: — ¡ T ‰ “ .& ¢ L £ • “. & o && q /q& & ”& & && & ¤ ¥ & ¦ ” & Sharapova now lives in US southeastern Florida Further complicated in Japanese, with multiple alphabets intermingled Dates/amounts in multiple formats S ˝º ˜“. 500 ¸%ˆ »“. && —5¦” & $500K( 6,000 ) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana!...
View Full Document

This document was uploaded on 06/01/2011.

Page1 / 85

124.11.lec2 - Click to edit Master subtitle style Lecture 2...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online