lecture7-vectorspace-handout-6-per

Unformatted text preview: al Follower Sec. 7.1.6 Why use random sampling General variants   Fast   Leaders reﬂect data distribu*on   Have each follower afached to b1=3 (say) nearest leaders.   From query, ﬁnd b2=4 (say) nearest leaders and their followers.   Can recurse on leader/follower construc*on. 6 Introduc)on to Informa)on Retrieval Sec. 7.1.6 Introduc)on to Informa)on Retrieval Sec. 6.1 Exercises Parametric and zone indexes   To ﬁnd the nearest leader in step 1, how many cosine computa*ons do we do?   Thus far, a doc has been a sequence of terms   In fact documents have mul*ple parts, some with special seman*cs:   Why did we have √N in the ﬁrst place?   What is the eﬀect of the constants b1, b2 on the previous slide?   Devise an example where this is likely to fail – i.e., we miss one of the K nearest docs.   Likely under random sampling.             Author Title Date of publica*on Language Format etc.   These cons*tute the metadata about a document Introduc)on to Informa)on Retrieval Sec. 6.1 Introduc)on to Informa)on Retrieval Sec. 6.1 Fields Zone   We some*mes wish to search by these metadata   A zone is a region of the doc that can contain an arbitrary amount of text, e.g.,   E.g., ﬁnd docs authored by William Shakespeare in the year 1601, containing alas poor Yorick   Year = 1601 is an example of a ﬁeld   Also, author last name = shakespeare, etc.   Field or parametric index: pos*ngs for each ﬁeld value   Some*mes build range trees (e.g., for dates)   Field query typically treated as conjunc*on   Title   Abstract   References …   Build inverted indexes on zones as well to permit querying   E.g., ﬁnd docs with merchant in the *tle zone and matching the query gentle rain   (doc must be authored by shakespeare) Introduc)on to...
