Unformatted text preview: Clustering Preliminaries
Applications Euclidean/NonEuclidean Spaces Distance Measures
1 The Problem of Clustering
x Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that members of a cluster are in some sense as close to each other as possible. 2 Example
x x x x x x x x xx x x x x x x x x xx x x x x x x x xx x x x x x x x x x x x x 3 Problems With Clustering
x Clustering in two dimensions looks easy. x Clustering small amounts of data looks easy. x And in most cases, looks are not deceiving. 4 The Curse of Dimensionality
x Many applications involve not 2, but 10 or 10,000 dimensions. x Highdimensional spaces look different: almost all pairs of points are at about the same distance. Example: assume random points within a bounding box, e.g., values between 0 and 1 in each dimension.
5 Example: SkyCat
x A catalog of 2 billion "sky objects" represents objects by their radiation in 9 dimensions (frequency bands). x Problem: cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc. x Sloan Sky Survey is a newer, better version.
6 Example: Clustering CD's (Collaborative Filtering)
x Intuitively: music divides into categories, and customers prefer a few categories. x Represent a CD by the customers who bought it. x Similar CD's have similar sets of customers, and viceversa.
7 But what are categories really? The Space of CD's
x Think of a space with one dimension for each customer. x A CD's point in this space is (x1, x2,..., xk), where xi = 1 iff the i th customer bought the CD. Compare with the "shingle/signature" matrix: rows = customers; cols. = CD's.
8 Values in a dimension may be 0 or 1 only. Space of CD's (2)
x For Amazon, the dimension count is tens of millions. x An option: use minhashing/LSH to get Jaccard similarity between "close" CD's. x 1 minus Jaccard similarity can serve as a (nonEuclidean) distance. 9 Example: Clustering Documents
x Represent a document by a vector (x1, x2,..., xk), where xi = 1 iff the i th word (in some order) appears in the document. x Documents with similar sets of words may be about the same topic.
10 It actually doesn't matter if k is infinite; i.e., we don't limit the set of words. Example: Gene Sequences
x Objects are sequences of {C,A,T,G}. x Distance between sequences is edit distance, the minimum number of inserts and deletes needed to turn one into the other. x Note there is a "distance," but no convenient space in which points "live."
11 Distance Measures
x Each clustering problem is based on some kind of "distance" between points. x Two major classes of distance measure:
1. Euclidean 2. NonEuclidean 12 Euclidean Vs. NonEuclidean
x A Euclidean space has some number of realvalued dimensions and "dense" points. There is a notion of "average" of two points. A Euclidean distance is based on the locations of points in such a space. x A NonEuclidean distance is based on properties of points, but not their "location" in a space.
13 Axioms of a Distance Measure
x d is a distance measure if it is a function from pairs of points to real numbers such that:
1. 2. 3. 4. d(x,y) > 0. d(x,y) = 0 iff x = y. d(x,y) = d(y,x). d(x,y) < d(x,z) + d(z,y) (triangle inequality ). 14 Some Euclidean Distances
x L2 norm : d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension. x L1 norm : sum of the differences in each dimension. Manhattan distance = distance if you had to travel along coordinates only.
15 The most common notion of "distance." Examples of Euclidean Distances
L2norm: dist(x,y) = (42+32) = 5 y = (9,8) 5 4 3 x = (5,5) L1norm: dist(x,y) = 4+3 = 7 16 Another Euclidean Distance
x L norm : d(x,y) = the maximum of the differences between x and y in any dimension. x Note: the maximum is the limit as n goes to of what you get by taking the n th power of the differences, summing and taking the n th root.
17 NonEuclidean Distances
x Jaccard distance for sets = 1 minus ratio of sizes of intersection and union. x Cosine distance = angle between vectors from the origin to the points in question. x Edit distance = number of inserts and deletes to change one string into another.
18 Jaccard Distance for BitVectors
x Example: p1 = 10111; p2 = 10011. Size of intersection = 3; size of union = 4, Jaccard similarity (not distance) = 3/4. x Need to make a distance function satisfying triangle inequality and other laws. x d(x,y) = 1 (Jaccard similarity) works.
19 Why J.D. Is a Distance Measure
x d(x,x) = 0 because xx = xx. x d(x,y) = d(y,x) because union and intersection are symmetric. x d(x,y) > 0 because xy < xy. x d(x,y) < d(x,z) + d(z,y) trickier next slide. 20 Triangle Inequality for J.D.
1 x z + 1 y z > 1 x y x z y z x y x Remember: a b/a b = probability that minhash(a) = minhash(b). x Thus, 1 a b/a b = probability that minhash(a) minhash(b). 21 Triangle Inequality (2)
x Observe that prob[minhash(x) minhash(y)] < prob[minhash(x) minhash(z)] + prob[minhash(z) minhash(y)] x Clincher: whenever minhash(x) minhash(y), at least one of minhash(x) minhash(z) and minhash(z) minhash(y) must be true. 22 Cosine Distance
x Think of a point as a vector from the origin (0,0,...,0) to its location. x Two points' vectors make an angle, whose cosine is the normalized dot product of the vectors: p1.p2/p2p1. Example p1 = 00111; p2 = 10011. p1.p2 = 2; p1 = p2 = 3. cos() = 2/3; is about 48 degrees.
23 CosineMeasure Diagram
p1 Why? Next slide p1.p2 p2 p2 dist(p1, p2) = = arccos(p1.p2/p2p1) 24 Why?
Dot product is invariant under rotation, so pick convenient coordinate system. p1.p2 = x1x2. p2 = x2. x1 =x1x2/x2 = p1.p2/p2 25 p1 = (x1,y1) x1 p2 = (x2,0) Why C.D. Is a Distance Measure
x d(x,x) = 0 because arccos(1) = 0. x d(x,y) = d(y,x) by symmetry. x d(x,y) > 0 because angles are chosen to be in the range 0 to 180 degrees. x Triangle inequality: physical reasoning. If I rotate an angle from x to z and then from z to y, I can't rotate less than from x to y.
26 Edit Distance
x The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. x Equivalently: d(x,y) = x +  y 2LCS(x,y). LCS = longest common subsequence = longest string obtained both by deleting from x and deleting from y.
27 Example
x x = abcde ; y = bcduve. x Turn x into y by deleting a, then inserting u and v after d. x Or, LCS(x,y) = bcde. x x + y 2LCS(x,y) = 5 + 6 2*4 = 3.
28 Editdistance = 3. Why E.D. Is a Distance Measure
x d(x,x) = 0 because 0 edits suffice. x d(x,y) = d(y,x) because insert/delete are inverses of each other. x d(x,y) > 0: no notion of negative edits. x Triangle inequality: changing x to z and then to y is one way to change x to y. 29 Variant Edit Distance
x Allow insert, delete, and mutate. Change one character into another. x Minimum number of inserts, deletes, and mutates also forms a distance measure. 30 ...
View
Full
Document
This document was uploaded on 03/04/2012.
 Fall '09

Click to edit the document details