Unformatted text preview: Data Mining
CS57300
Purdue University
November 2, 2010 Descriptive modeling (cont) Hierarchical methods
• Construct a hierarchy of nested clusters rather than picking k beforehand
• Approaches:
• Agglomerative: merge clusters successively
• Divisive: divided clusters successively
• Dendrogram depicts sequences of merges or splits and height indicates
distance Agglomerative
• For i = 1 to n:
• Let Ci = {x(i)}
• While C>1:
• Let Ci and Cj be the pair of clusters with min D(Ci,Cj)
• Ci=Ci ∪ Cj
• Remove Cj • Complexity: time? space? Distance measures between clusters
• Singlelink/nearest neighbor: D(Ci,Cj) = min{d(x,y)  x ∈ Ci, y ∈ Cj}
can produce long thin clusters
• Completelink/furthest neighbor: D(Ci,Cj) = max{d(x,y)  x ∈ Ci, y ∈ Cj}
is sensitive to outliers
• Average link: D(Ci,Cj) = avg{d(x,y)  x ∈ Ci, y ∈ Cj}
compromise between the two Divisive
• While C < n:
• For each Ci with more than 2 objects:
• Apply partitionbased clustering method to split Ci into
two clusters Cj and Ck
• C = C  {Ci} ∪ {Cj, Ck}
• Complexity?
• Example partitionbased methods:
• Kmeans, spectral clustering Basic
Spectral clustering
! • idea Cut a weighted Graph into a number of disjoint
Cut a weighted graph into a
pieces (clusters) such that the intraclusternumber of disjoint groups using
weights (similarities) are high and the intera similarity matrix W
cluster weights are low. • Popularized for image
Note that
segmentation the graph
(Shi and Malik 2000) ! representation is just
an abstract idea. One
can cluster any data if
one has an appropriate
similarity measure (wij) (see properties later).
Computational Intelligence Seminar E, Summer Term 2007, Gregor Hörzer
Institute for Theoretical Computer Science, Graz, University of Technology page 4 Agraph, minimizing the normalized cutsection 3.4:function. We outline the algorithm below
lgorithmsimilarity metrics in objective
and describe the
• Input: G, A, S , m, c
• Algorithm: 1. Let W be an N × N matrix with Wij = S (i, j ).
2. Let D be an N × N diagonal matrix with di = j ∈V S (i, j ). 3. Solve the eigensystem (D − W)x = λDx for the eigenvector x1 associated
with the second smallest eigenvalue λ1 .
4. Sort x1 .
5. Consider m evenly spaced entries in x1 . For each value x1m :
(a) Bipartition the nodes into (A, B ) such that A ∩ B = ∅ and A ∪ B = V
and x1a < x1m ∀va ∈ A.
(b) Calculate the normalized cut objective function J (A, B ):
J (A, B ) = cut(A,B )
P
i∈A di where cut(A, B ) = + cut(A,B )
P
j ∈B dj i∈A,j ∈B S (i, j ) (1) 6. Partition the graph into the (A, B ) that minimizes J .
7. Calculate the stability1 of the current cut, if stability > c stop recursing.
8. Recursively repartition A and B if necessary.
In general, it takes O(n3 ) operations to solve for all eigenvalues of an arbitrary eigensystem. However, if the weight matrix is sparse, the Lanczos algorithm can be used to
compute the solution in O(n1.4 ) operations [20, 7], and approximate algorithms can compute the solution in O(E ) operations [14]. Similarity metrics that produce sparse matrices Spectral clustering Clustering Example
! • Strengths: Shows advantages of spectral clustering over
Kmeans clustering [Ng, Jordan, Weiss, 2001] • Can ﬁnd nonspherical clusters
• Weaknesses:
• O(n3) to ﬁnd all
eigenvalues of an
arbitrary eigensystem,
Left: Rows of Y matrix (kmeans clustering is easy here), 90° between point clouds
approximate methods O(E)
Middle: Clustering performed by Spectral Clustering
Right: Clustering performed by simple Kmeans • Researchers disagree about:
Computational Intelligence Seminar E, Summer Term 2007, Gregor Hörzer
Institute for Theoretical Computer Science, Graz, University of Technology • Which eigenvectors to use • How to derive clusters from the eigenvectors page 19 Modelbased clustering Probabilistic modelbased clustering
• Assumes a probabilistic model for each underlying cluster
(component)
• Mixture model speciﬁes a weighted combinations of component
distributions (e.g., Gaussian, Poisson, Exponential) K f (x) = wk fk (x; θ)
k=1 Gaussian ple: Mixtmodels Gaussians
Exam mixture ure of 3 Mixture of three Gaussians B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p Contours urs of Probabdistributionution
Conto of probability ility Distrib Mixture of three Gaussians B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p Mixture models (cont)
• How to learn the model from data?
• Don’t know w1...k or component
parameters K f (x) = wk fk (x; θ)
k=1 • Solution: K p(x) =
• Interpret mixing coefﬁcients as prior
probabilities
• Use ExpectationMaximization
(Dempster, Laird, Rubin, 1977) k=1 p(k )p(xk ) Generative process
• For each data point:
• Pick component Gaussian
randomly with probability wk
• Draw sample point from that
component randomly K f (x) = wk fk (x; θ)
k=1
K p(x) =
k=1 p(k )p(xk ) Samplethetic Data Set
Syn dataset B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p Learning the model
• We want to invert this process
• Given the dataset, ﬁnd the parameters
• Mixing coefﬁcients
• Component means and covariance matrix
• If we knew which component generated each point then the MLE solution
would involve ﬁtting each component distribution to the appropriate cluster
points
• Problem: the cluster memberships are hidden Unlabeled tic Data Set Without Labels
Synthe dataset B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p Posterior probabilities
• We can think of the mixing coefﬁcients as prior probabilities for the
components
• For given value of x, we can evaluate the corresponding posterior
probabilities of cluster membership with Bayes theorem γk (x) ≡ p(k x) =
= p(k )p(xk )
p(x)
wk N (xµk , Σk )
K
j =1 wj N (xµj , Σj ) Posterior rprobabilitieslities (colour coded)
P o s te i o r P r o b a b i B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p MLE for GMM
• Log likelihood takes the following form: N log p(Dw, µ, Σ) = K log
n=1 k=1 • Note the sum over components is inside the log
• There is no closed form solution for the MLE wk N (xn µk , Σk ) EM algorithm
• Popular algorithm for parameter estimation in data with hidden/unobserved
values
• • Hidden variables=cluster membership Basic idea
• Initialize hidden variables and parameters • Predict values for hidden variables given current parameters • Estimate parameters given current prediction for hidden variables • Repeat E Step
MStep Hidden variables
• If we know the values of the hidden variables we can maximize the complete
data loglikelihood
N log p(x, z θ) = K n=1 k=1 znk log wk + log N (xn µk , Σk ) • This has a trivial closed form solution except we don’t know the values for the
hidden variables
• But, for given set of parameters we can compute the expected values of the
hidden variables EM for GMM
• Suppose we make a guess for the parameters values θ
• Use these to evaluate cluster
memberships γk (x) ≡ p(k x) =
= • Now compute the loglikelihood
using predicted cluster memberships
N log p(x, z θ) = p(k )p(xk )
EStep
p(x)
wk N (xµk , Σk )
K
j =1 wj N (xµj , Σj ) K n=1 k=1 γi (xn ) log wk + log N (xn µk , Σk ) • Use completed likelihood to determine MLE for θ MStep B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p B C S S u m m e r S c h o o l, E x e t e r , 2 0 0 3 C h r is t o p h e r M . B is h o p More on EM
• Often both the E and the M step can be solved in closed form
• Neither the E step nor the M step can decrease the loglikelihood
• Algorithm is guaranteed to converge to a local maximum of the likelihood
• Must specify initialization and stopping criteria Probabilistic clustering
• Model provides full distributional description for each component
• May be able to interpret differences in the distributions
• Soft clustering (compared to kmean hard clustering)
• Given the model, each point has a kcomponent vector of membership
probabilities
• Key cost: assumption of parametric model How to choose k?
• Choose k to maximize likelihood?
• As k increases the value of the maximum likelihood cannot decrease
• Thus more complex models will always improve likelihood
• How to compare models with different complexities? ...
View
Full Document
 Fall '08
 Staff
 Data Mining, Likelihood function, Σk, spectral clustering, wk fk

Click to edit the document details