lecture12-clustering-handout-6-per

11 2 introducon to informaon retrieval introducon

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ren t length normalized   Need a no*on of similarity/distance   How many clusters?   Fixed a priori?   Completely data driven? yippy.com – grouping search results   Avoid trivial clusters  ­ too large or small   If a cluster's too large, then for naviga*on purposes you've wasted an extra user click without whiVling down the set of documents much. 11 2 Introduc)on to Informa)on Retrieval Introduc)on to Informa)on Retrieval No*on of similarity/distance Clustering Algorithms   Ideal: seman*c similarity.   Prac*cal: term ­sta*s*cal similarity   Flat algorithms   We will use cosine similarity.   Docs as vectors.   For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.   We will mostly speak of Euclidean distance   Usually start with a random (par*al) par**oning   Refine it itera*vely   K means clustering   (Model based clustering)   Hierarchical algorithms   BoVom ­up, agglomera*ve   (Top ­down, divisive)   But real implementa*ons use cosine s...
View Full Document

Ask a homework question - tutors are online