Three important questions: u 1) How do you represent a cluster of more than one point? u 2)How do you determine the “nearness” of clusters? 3)When to stop combining clusters?
(1) How to represent a cluster of many points? (2) How to determine “nearness” of clusters?
1. May know how many clusters there are in the data Ø Have been told or some intuitive number of clusters 2. Stop combining when best combination of existing clusters produces a cluster that is inadequate Ø E.g., Average distance between centroid and its points should be below some limit.
Dendrogram u Can obtain k clusters from result for desired k u k can be any value between 1 and n 45
K-Means Clustering Algorithm 46 u User specifies a target number of clusters (k) u Place randomly k cluster centers u For each datapoint, attach it to the nearest cluster center u For each center, find the centroid of all the datapoints attached to it u Turn the centroids into cluster centers u Repeat until the sum of all the datapoint distances to the cluster centers is minimized
K-Means Clustering (1) 47
K-Means Clustering (2) 48
K-Means Clustering (3) 49
K-Means Clustering (4) 50
K-Means Clustering (5) 51
K-Means Clustering (6) 52
Clustering Methods u Hierarchical clustering u Attach datapoints to root points u K-Means clustering u Centroid-based u Density-based methods u Clusters contain a minimal number of datapoints u … 53
Modeling and Simulation 54
- Fall '17
- Machine Learning, u Data, u class labels