s10-indexing

s10-indexing - Indexing & Tolerant Dictionaries...

Info iconThis preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Indexing & Tolerant Dictionaries The pdf image slides are from Hinrich Schtzes slides, L'Homme qui marche Alberto Giacometti (sold for 104M) Sanity Check/Review What is the vector similarity of a pair of documents that have jaccard similarity 0? What happens to the cosine (vector) similarity of a document d w.r.t. a query q if d is doubled in size? first half of d is doubled in size? the words in d are randomly permuted? Which of the measurestf or idfare local to the document vs. being a global property of the corpus? Efficient Retrieval Document-term matrix t1 t2 . . . tj . . . tm nf d1 w11 w12 . . . w1j . . . w1m 1/|d1| d2 w21 w22 . . . w2j . . . w2m 1/|d2| . . . . . . . . . . . . . . di wi1 wi2 . . . wij . . . wim 1/|di| . . . . . . . . . . . . . . dn wn1 wn2 . . . wnj . . . wnm 1/|dn| wij is the weight of term tj in document di Most wijs will be zero . Nave retrieval Consider query q = (q1, q2, , qj, , qn), nf = 1/|q|. How to evaluate q (i.e., compute the similarity between q and every document)? Method 1: Compare q with every document directly. document data structure: di : ((t1, wi1), (t2, wi2), . . ., (tj, wij), . . ., (tm, wim ), 1/|di|) Only terms with positive weights are kept. Terms are in alphabetic order. query data structure: q : ((t1, q1), (t2, q2), . . ., (tj, qj), . . ., (tm, qm ), 1/|q|) Nave retrieval Method 1: Compare q with documents directly (cont.) Algorithm initialize all sim(q, di) = 0; for each document di (i = 1, , n) { for each term tj (j = 1, , m) if tj appears in both q and di sim(q, di) += qj wij; sim(q, di) = sim(q, di) (1/|q|) (1/|di|); } sort documents in descending similarities and display the top k to the user; Observation Method 1 is not efficient Needs to access most non-zero entries in doc-term matrix. Solution: Inverted Index Data structure to permit fast searching. Like an Index in the back of a text book. Key words --- page numbers. E.g, precision , 40, 55, 60-63, 89, 220 Lexicon Occurrences Search Processing (Overview) Lexicon search E.g. looking in index to find entry Retrieval of occurrences Seeing where term occurs Manipulation of occurrences Going to the right page Inverted Files A file is a list of words by position First entry is the word in position 1 (first word) Entry 4562 is the word in position 4562 (4562 nd word) Last entry is the last word An inverted file is a list of positions by word! POS 1 10 20 30 36 FILE a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE Inverted Files for Multiple Documents 107 4 322 354 381 405 232 6 15 195 248 1897 1951 2192 677 1 481 713 3 42 312 802 WORD NDOCS PTR jezebel 20 jezer 3 jezerit 1 jeziah 1 jeziel 1 jezliah 1 jezoar 1 jezrahliah 1 jezreel 39 34 6 1 118 2087 3922 3981 5002 44 3 215 2291 3010 56...
View Full Document

Page1 / 45

s10-indexing - Indexing & Tolerant Dictionaries...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online