cs345-cl2

cs345-cl2 - More Clustering CURE Algorithm Non-Euclidean...

Info iconThis preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon
1 More Clustering CURE Algorithm Non-Euclidean Approaches
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 The CURE Algorithm ± Problem with BFR/k -means: ² Assumes clusters are normally distributed in each dimension. ² And axes are fixed --- ellipses at an angle are not OK. ± CURE: ² Assumes a Euclidean distance. ² Allows clusters to assume any shape.
Background image of page 2
3 Example: Stanford Faculty Salaries e e e e e e e e e e e h h h h h hh h h h h h h salary age
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Starting CURE 1. Pick a random sample of points that fit in main memory. 2. Cluster these points hierarchically --- group nearest points/clusters. 3. For each cluster, pick a sample of points, as dispersed as possible. 4. From the sample, pick representatives by moving them (say) 20% toward the centroid of the cluster.
Background image of page 4
5 Example: Initial Clusters e e e e e e e e e e e h h h h h h hh h h h h h salary age
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Example: Pick Dispersed Points e e e e e e e e e e e h h h h h h hh h h h h h salary Pick (say) 4 remote points for each cluster. age
Background image of page 6
7 Example: Pick Dispersed Points e e e e e e e e e e e h h h h h h hh h h h h h salary Move points (say) 20% toward the centroid. age
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
8 Finishing CURE ± Now, visit each point p in the data set. ± Place it in the “closest cluster.” ² Normal definition of “closest”: that cluster with the closest (to p ) among all the sample points of all the clusters.
Background image of page 8
9 Curse of Dimensionality ± One way to look at it: in large- dimension spaces, random vectors are perpendicular. Why? ± Argument #1: Lots of 2-dim subspaces. There must be one where the vectors’ projections are almost perpendicular. ± Argument #2: Expected value of cosine of angle is 0.
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
10 Cosine of Angle Between Random Vectors ± Assume vectors emanate from the origin (0,0,…,0). ± Components are random in range [-1,1]. ± (a 1 ,a 2 ,…,a n ).(b 1 ,b 2 ,…,b ) has expected value 0 and a standard deviation that grows as n. ± But lengths of both vectors grow as ± So dot product around n/ ( n * n) = 1/
Background image of page 10
11 Random Vectors --- Continued ± Thus, a typical pair of vectors has an angle whose cosine is on the order of 1/ n. ± As n -> , that’s 0; i.e., the angle is about 90°.
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
12 Interesting Consequence ± Suppose “random vectors are perpendicular,” even in non-Euclidean spaces.
Background image of page 12
Image of page 13
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 01/06/2012.

Page1 / 42

cs345-cl2 - More Clustering CURE Algorithm Non-Euclidean...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online