lecture15 - Data Mining CS57300 Purdue University November...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Data Mining CS57300 Purdue University November 2, 2010 Descriptive modeling (cont) Hierarchical methods • Construct a hierarchy of nested clusters rather than picking k beforehand • Approaches: • Agglomerative: merge clusters successively • Divisive: divided clusters successively • Dendrogram depicts sequences of merges or splits and height indicates distance Agglomerative • For i = 1 to n: • Let Ci = {x(i)} • While |C|>1: • Let Ci and Cj be the pair of clusters with min D(Ci,Cj) • Ci=Ci ∪ Cj • Remove Cj • Complexity: time? space? Distance measures between clusters • Single-link/nearest neighbor: D(Ci,Cj) = min{d(x,y) | x ∈ Ci, y ∈ Cj} can produce long thin clusters • Complete-link/furthest neighbor: D(Ci,Cj) = max{d(x,y) | x ∈ Ci, y ∈ Cj} is sensitive to outliers • Average link: D(Ci,Cj) = avg{d(x,y) | x ∈ Ci, y ∈ Cj} compromise between the two Divisive • While |C| < n: • For each Ci with more than 2 objects: • Apply partition-based clustering method to split Ci into two clusters Cj and Ck • C = C - {Ci} ∪ {Cj, Ck} • Complexity? • Example partition-based methods: • K-means, spectral clustering Basic Spectral clustering ! • idea Cut a weighted Graph into a number of disjoint Cut a weighted graph into a pieces (clusters) such that the intra-clusternumber of disjoint groups using weights (similarities) are high and the intera similarity matrix W cluster weights are low. • Popularized for image Note that segmentation the graph (Shi and Malik 2000) ! representation is just an abstract idea. One can cluster any data if one has an appropriate similarity measure (wij) (see properties later). Computational Intelligence Seminar E, Summer Term 2007, Gregor Hörzer Institute for Theoretical Computer Science, Graz, University of Technology page 4 Agraph, minimizing the normalized cutsection 3.4:function. We outline the algorithm below lgorithmsimilarity metrics in objective and describe the • Input: G, A, S , m, c • Algorithm: 1. Let W be an N × N matrix with Wij = S (i, j ). 2. Let D be an N × N diagonal matrix with di = j ∈V S (i, j ). 3. Solve the eigensystem (D − W)x = λDx for the eigenvector x1 associated with the second smallest eigenvalue λ1 . 4. Sort x1 . 5. Consider m evenly spaced entries in x1 . For each value x1m : (a) Bipartition the nodes into (A, B ) such that A ∩ B = ∅ and A ∪ B = V and x1a < x1m ∀va ∈ A. (b) Calculate the normalized cut objective function J (A, B ): J (A, B ) = cut(A,B ) P i∈A di where cut(A, B ) = + cut(A,B ) P j ∈B dj i∈A,j ∈B S (i, j ) (1) 6. Partition the graph into the (A, B ) that minimizes J . 7. Calculate the stability1 of the current cut, if stability > c stop recursing. 8. Recursively repartition A and B if necessary. In general, it takes O(n3 ) operations to solve for all eigenvalues of an arbitrary eigensystem. However, if the weight matrix is sparse, the Lanczos algorithm can be used to compute the solution in O(n1.4 ) operations [20, 7], and approximate algorithms can compute the solution in O(|E |) operations [14]. Similarity metrics that produce sparse matrices Spectral clustering Clustering Example ! • Strengths: Shows advantages of spectral clustering over K-means clustering [Ng, Jordan, Weiss, 2001] • Can find nonspherical clusters • Weaknesses: • O(n3) to find all eigenvalues of an arbitrary eigensystem, Left: Rows of Y matrix (k-means clustering is easy here), 90° between point clouds approximate methods O(|E|) Middle: Clustering performed by Spectral Clustering Right: Clustering performed by simple K-means • Researchers disagree about: Computational Intelligence Seminar E, Summer Term 2007, Gregor Hörzer Institute for Theoretical Computer Science, Graz, University of Technology • Which eigenvectors to use • How to derive clusters from the eigenvectors page 19 Model-based clustering Probabilistic model-based clustering • Assumes a probabilistic model for each underlying cluster (component) • Mixture model specifies a weighted combinations of component distributions (e.g., Gaussian, Poisson, Exponential) K f (x) = wk fk (x; θ) k=1 Gaussian ple: Mixtmodels Gaussians Exam mixture ure of 3 Mixture of three Gaussians B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p Contours urs of Probabdistributionution Conto of probability ility Distrib Mixture of three Gaussians B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p Mixture models (cont) • How to learn the model from data? • Don’t know w1...k or component parameters K f (x) = wk fk (x; θ) k=1 • Solution: K p(x) = • Interpret mixing coefficients as prior probabilities • Use Expectation-Maximization (Dempster, Laird, Rubin, 1977) k=1 p(k )p(x|k ) Generative process • For each data point: • Pick component Gaussian randomly with probability wk • Draw sample point from that component randomly K f (x) = wk fk (x; θ) k=1 K p(x) = k=1 p(k )p(x|k ) Samplethetic Data Set Syn dataset B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p Learning the model • We want to invert this process • Given the dataset, find the parameters • Mixing coefficients • Component means and covariance matrix • If we knew which component generated each point then the MLE solution would involve fitting each component distribution to the appropriate cluster points • Problem: the cluster memberships are hidden Unlabeled tic Data Set Without Labels Synthe dataset B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p Posterior probabilities • We can think of the mixing coefficients as prior probabilities for the components • For given value of x, we can evaluate the corresponding posterior probabilities of cluster membership with Bayes theorem γk (x) ≡ p(k |x) = = p(k )p(x|k ) p(x) wk N (x|µk , Σk ) K j =1 wj N (x|µj , Σj ) Posterior rprobabilitieslities (colour coded) P o s te i o r P r o b a b i B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p MLE for GMM • Log likelihood takes the following form: N log p(D|w, µ, Σ) = K log n=1 k=1 • Note the sum over components is inside the log • There is no closed form solution for the MLE wk N (xn |µk , Σk ) EM algorithm • Popular algorithm for parameter estimation in data with hidden/unobserved values • • Hidden variables=cluster membership Basic idea • Initialize hidden variables and parameters • Predict values for hidden variables given current parameters • Estimate parameters given current prediction for hidden variables • Repeat E Step M-Step Hidden variables • If we know the values of the hidden variables we can maximize the complete data log-likelihood N log p(x, z |θ) = K n=1 k=1 znk log wk + log N (xn |µk , Σk ) • This has a trivial closed form solution except we don’t know the values for the hidden variables • But, for given set of parameters we can compute the expected values of the hidden variables EM for GMM • Suppose we make a guess for the parameters values θ • Use these to evaluate cluster memberships γk (x) ≡ p(k |x) = = • Now compute the log-likelihood using predicted cluster memberships N log p(x, z |θ) = p(k )p(x|k ) E-Step p(x) wk N (x|µk , Σk ) K j =1 wj N (x|µj , Σj ) K n=1 k=1 γi (xn ) log wk + log N (xn |µk , Σk ) • Use completed likelihood to determine MLE for θ M-Step B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p More on EM • Often both the E and the M step can be solved in closed form • Neither the E step nor the M step can decrease the log-likelihood • Algorithm is guaranteed to converge to a local maximum of the likelihood • Must specify initialization and stopping criteria Probabilistic clustering • Model provides full distributional description for each component • May be able to interpret differences in the distributions • Soft clustering (compared to k-mean hard clustering) • Given the model, each point has a k-component vector of membership probabilities • Key cost: assumption of parametric model How to choose k? • Choose k to maximize likelihood? • As k increases the value of the maximum likelihood cannot decrease • Thus more complex models will always improve likelihood • How to compare models with different complexities? ...
View Full Document

This note was uploaded on 03/13/2012 for the course CS 573 taught by Professor Staff during the Fall '08 term at Purdue University-West Lafayette.

Ask a homework question - tutors are online