This preview shows pages 1–11. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSE 6740 Lecture 9 How Can I Reduce/Relate the Data Points? (Association and Clustering) Alexander Gray agray@cc.gatech.edu Georgia Institute of Technology CSE 6740 Lecture 9 p. 1/3 5 Today 1. Clustering 2. Associations Central tasks for data mining. CSE 6740 Lecture 9 p. 2/3 5 Clustering Methods Show me the subgroups in the data. CSE 6740 Lecture 9 p. 3/3 5 Clustering Why show subgroups in the data? Sometimes: Computational reasons ( e.g. use cluster centers instead of the dataset) Statistical reasons ( e.g. identify/remove outliers) Mainly: Visualization/understanding reasons CSE 6740 Lecture 9 p. 4/3 5 Procedural Methods When we can speak of a true underlying function (as we do in most density estimation, classification, and regression methods), we can discuss error, error bounds, generalization (minimizing error on future data), what happens to the error as we get more data, etc. In other words we can leverage all the powerful tools of statistics we have discussed. CSE 6740 Lecture 9 p. 5/3 5 Procedural Methods I will call a method which has not been formally related to some function of the underlying density a procedural method. This turns out to be common in clustering and density estimation methods. Though this makes it hard/impossible to say much about these methods analytically, they are nonetheless often still useful in practice. CSE 6740 Lecture 9 p. 6/3 5 Mixture of Gaussians Treat clustering as a density estimation problem, where each Gaussian is a cluster. CSE 6740 Lecture 9 p. 7/3 5 Mixture of Gaussians Again: Task: density estimation Model class: set of all possible mixtures of Gaussians with K components Loss: Likelihood Optimizer: EM algorithm Generalization mechanism: Crossvalidation Evaluation algorithm: Nbody CSE 6740 Lecture 9 p. 8/3 5 SumofSquares Minimization Heres a simpler method, which cannot be described as relating to some function of the underlying distribution of the data. Well seek a partitioning of the points into K disjoint subsets C k each containing N k points, such that the following sumofsquares objective function is minimized: K summationdisplay k =1 summationdisplay i C k  x i k  2 (1) where k = 1 N k i C k x i is the mean of the points in set C k . C ( x i ) = C k will denote that the class of x i is C k . CSE 6740 Lecture 9 p. 9/3 5 Kmeans The Kmeans method is as follows: First initialize the means k somehow, for example by choosing K different points randomly. Then: 1. Assign each point according to C ( x i ) = arg min k  x i k  . 2. Recompute each k according to the new assignments. Stop when no assignments change....
View Full
Document
 Fall '08
 Staff

Click to edit the document details