class03-textrep

class03-textrep - Our assumptions so far We know what a...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Our assumptions so far We know what a document is We know what a term is In reality, it can be complex Today, well look at how we defne and process the vocabulary oF terms in a collection 1 Recall basic indexing pipeline Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen. 2 Parsing a document What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically 3 Complications: Format/language Documents being indexed can include docs from many different languages A single index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats French email with a German pdf attachment. What is a unit document ? A le? An email? (Perhaps one of many in an mbox.) An email with 5 attachments? A group of les (PPT or LaTeX in HTML) 4 Tokenization Input : Friends, Romans and Countrymen Output : Tokens Friends Romans Countrymen Each such token is now a candidate for an index entry, after further processing Described below But what are valid tokens to emit? 5 Why tokenization is difFcult -- even in English Example: Mr. ONeill thinks that the boys stories about Chiles capital arent amusing. Tokenize this sentence 6 One word or two? (or several) Hewlett-Packard State-of-the-art co-education the hold-him-back-and-drag-him-away maneuver data base San Francisco Los Angeles-based company cheap San Francisco-Los Angeles fares York University vs. New York University 7 Numbers 3/12/91 12/3/91 Mar 12, 1991 B-52 100.2.86.144 (800) 234-2333 800.234.2333 8 Chinese: No whitespace 9 Ambiguous segmentation in Chinese Can be treated as one word meaning monk or as two words meaning and and still 10 Tokenization: Language issues Chinese and Japanese have no spaces between words:...
View Full Document

Page1 / 15

class03-textrep - Our assumptions so far We know what a...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online