class14 - Today Introduction Clustering in IR K-means...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
To d ay Introduction Clustering in IR K-means Evaluation How many clusters? K? What is clustering? Clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning . Unsupervised: there are no labeled or annotated data
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Data set with clear cluster structure How would you design an algorithm for Fnding these three clusters? Clustering vs. Classifcation Clustering: unsupervised learning Classifcation: supervised learning Classifcation: Classes are human-defned and input to the learning algorithm. Clustering: Clusters are inFerred From the data without human input. However, there are many ways oF in±uencing the outcome oF clustering: number oF clusters, similarity measure, representation oF documents, . . .
Background image of page 2
Clustering in IR Result set clustering for better navigation Clusty
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Global clustering for improved navigation Google news Visualizing a document collection
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Clustering for improving recall Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. Therefore, to improve search recall Cluster docs in corpus a priori When a query matches a doc d, also return other docs in the cluster containing d Hope if we do this: the query “car” will also return docs containing “automobile” Because clustering grouped together docs containing “car” with those containing “automobile”. Why? Issues for clustering Representation for clustering Document representation Vector space? Normalization? Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.
Background image of page 6
Flat vs. Hierarchical Clustering Flat algorithms Usually start with a random (partial) partitioning of docs into groups Re±ne iteratively Main algorithm: K-means Hierarchical algorithms Create a hierarchy Bottom-up, agglomerative Top-down, divisive Hard vs. Soft Clustering Hard clustering: Each document belongs to exactly one cluster.
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 21

class14 - Today Introduction Clustering in IR K-means...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online