class03-textrep

class03-textrep - Our assumptions so far • We know what a...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Our assumptions so far • We know what a document is • We know what a term is • In reality, it can be complex • Today, we’ll look at how we defne and process the vocabulary oF terms in a collection 1 Recall basic indexing pipeline Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen. 2 Parsing a document • What format is it in? • pdf/word/excel/html? • What language is it in? • What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically … 3 Complications: Format/language • Documents being indexed can include docs from many different languages • A single index may have to contain terms of several languages. • Sometimes a document or its components can contain multiple languages/formats • French email with a German pdf attachment. • What is a unit document ? • A ¡le? • An email? (Perhaps one of many in an mbox.) • An email with 5 attachments? • A group of ¡les (PPT or LaTeX in HTML) 4 Tokenization • Input : “ Friends, Romans and Countrymen ” • Output : Tokens • Friends • Romans • Countrymen • Each such token is now a candidate for an index entry, after further processing • Described below • But what are valid tokens to emit? 5 Why tokenization is difFcult -- even in English • Example: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. • Tokenize this sentence 6 One word or two? (or several) • Hewlett-Packard • State-of-the-art • co-education • the hold-him-back-and-drag-him-away maneuver • data base • San Francisco • Los Angeles-based company • cheap San Francisco-Los Angeles fares • York University vs. New York University 7 Numbers • 3/12/91 • 12/3/91 • Mar 12, 1991 • B-52 • 100.2.86.144 • (800) 234-2333 • 800.234.2333 8 Chinese: No whitespace 9 Ambiguous segmentation in Chinese • Can be treated as one word meaning “monk” or as two words meaning “and” and “still” 10 Tokenization: Language issues • Chinese and Japanese have no spaces between words:...
View Full Document

{[ snackBarMessage ]}

Page1 / 15

class03-textrep - Our assumptions so far • We know what a...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online