*This preview shows
pages
1–12. Sign up
to
view the full content.*

This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*
**Unformatted text preview: **1 Clustering Algorithms Applications Hierarchical Clustering k -Means Algorithms CURE Algorithm 2 The Problem of Clustering ◆ Given a set of points, with a notion of distance between points, group the points into some number of clusters , so that members of a cluster are in some sense as close to each other as possible. 3 Example x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x 4 Problems With Clustering ◆ Clustering in two dimensions looks easy. ◆ Clustering small amounts of data looks easy. ◆ And in most cases, looks are not deceiving. 5 The Curse of Dimensionality ◆ Many applications involve not 2, but 10 or 10,000 dimensions. ◆ High-dimensional spaces look different: almost all pairs of points are at about the same distance. 6 Example : Curse of Dimensionality ◆ Assume random points within a bounding box, e.g., values between 0 and 1 in each dimension. ◆ In 2 dimensions: a variety of distances between 0 and 1.41. ◆ In 10,000 dimensions, the difference in any one dimension is distributed as a triangle. 7 Example – Continued ◆ The law of large numbers applies. ◆ Actual distance between two random points is the sqrt of the sum of squares of essentially the same set of differences. 8 Example High-Dimension Application: SkyCat ◆ A catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands). ◆ Problem : cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc. ◆ Sloan Sky Survey is a newer, better version. 9 Example : Clustering CD’s (Collaborative Filtering) ◆ Intuitively : music divides into categories, and customers prefer a few categories. ◗ But what are categories really? ◆ Represent a CD by the customers who bought it. ◆ Similar CD’s have similar sets of customers, and vice-versa. 10 The Space of CD’s ◆ Think of a space with one dimension for each customer. ◗ Values in a dimension may be 0 or 1 only. ◆ A CD’s point in this space is ( x 1 , x 2 ,…, x k ), where x i = 1 iff the i th customer bought the CD. ◗ Compare with boolean matrix: rows = customers; cols. = CD’s. 11 Space of CD’s – (2) ◆ For Amazon, the dimension count is tens of millions. ◆ An alternative : use minhashing/LSH to get Jaccard similarity between “close” CD’s....

View
Full
Document