5 - Tokenization.pdf - Information Retrieval Riphah Institute for Computing and Applied Sciences Dr Ayesha Kashif \u2022 Document delineation \u2022

5 - Tokenization.pdf - Information Retrieval Riphah...

This preview shows page 1 - 5 out of 14 pages.

3/31/2019 1 Information Retrieval Riphah Institute for Computing and Applied Sciences Dr. Ayesha Kashif Document delineation Tokenization Terms: The things indexed in an IR system Stemming and Lemmatization Faster postings merges Phrase Queries Biword Indexes Positional Indexes
Image of page 1
3/31/2019 2 Recall the basic indexing pipeline Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend roman countryman 2 4 2 13 16 1 Documents to be indexed Friends, Romans, countrymen. Parsing a document What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? (CP1252, UTF- 8, …) Documents being indexed can include docs from many different languages Sec. 2.1
Image of page 2
3/31/2019 3 Tokenization Input : “ Friends, Romans and Countrymen Output : Tokens Friends Romans Countrymen A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. Each such token is now a candidate for an index entry, after further processing Described below But what are valid tokens to emit? Sec. 2.2.1 Tokenization Issues in tokenization: Finland’s capital Finland AND s ? Finlands ? Finland’s ? Hewlett-Packard Hewlett and Packard as two tokens? state-of-the-art : break up hyphenated sequence. co-education lowercase , lower-case , lower case ? It can be effective to get the user to put in possible hyphens San Francisco : one token or two? How do you decide it is one token? Sec. 2.2.1
Image of page 3
3/31/2019 4 Numbers 3/20/91 Mar. 12, 1991 20/3/91 55 B.C. B-52 My PGP key is 324a3df234cb23e (800) 234-2333 Often have embedded spaces Older IR systems may not index numbers But often very useful: think about things like looking up error codes /stacktraces on the web (One answer is using n-grams: short subsequence of characters) Will often index “meta - data” separately Creation date, format, etc.
Image of page 4
Image of page 5

You've reached the end of your free preview.

Want to read all 14 pages?

  • Spring '19
  • Dr. Ayesha Kashif

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes
A+ icon
Ask Expert Tutors