This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Data Mining
CS57300
Purdue University
November 16, 2010 Descriptive modeling: evaluation Cluster validity
• For prediction tasks there are a variety of external evaluation metrics
• Accuracy, squared loss, area under ROC, etc.
• For cluster analysis the external evaluation should evaluate the “goodness” of
the resulting clusters
• Why do we want external validation?
• To avoid ﬁnding patterns in noise
• To compare clustering algorithms
• To compare two sets of clusters Random data
Clusters found in Random Data
1
0.9 0.8 0.7 0.7 0.6 y 0.9 0.8 Random
P o in t s 1 0.6 0.5 0.5 y 0.4 DBSCAN 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0. 2 0.4 0.6 0 .8 0 1 0 0. 2 0.4 x
1 1 0.9 0.8 0.8 0.7 0.7 0.6 y 0 .8 1 0.9 K m e a n s 0.6 x 0.6 0.5 y C o m p le t e
L in k 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0. 2 0.4 0.6 x October 30, 2007 0 .8 1 0 0 0. 2 0.4 0.6 0 .8 1 x Data Mining: Concepts and Techniques 4 Evaluation approaches
• Determine the clustering tendency of the data
• Evaluate the clusters using known class labels
• Evaluate how well the clusters “ﬁt” the data
• Determine which of two different clustering results is better
• Determine the “correct” number of clusters Evaluation measures
• Supervised
• Measures the extent to which clusters match external class label values
• Unsupervised
• Measures goodness of ﬁt without class labels
• Relative
• Compares two clusterings, often uses external or internal index Unsupervised Correlation
• Compute the correlation between the initial similarity matrix and an “ideal”
cluster matrix
• Entry i,j is 1 if i and j are in the same cluster, 0 otherwise
• High correlation indicates that points in the same cluster are close to each
other
• Assumes that the proximity values are [0,1]
• When is this a good measure? Measuring Cluster Validity Via Correlation
Example
! Correlation of ideal similarity and proximity
matrices for the Kmeans clusterings of the
following two data sets.
1
0.9 0.9 0.8 0.8 0.7 0.7 0.6 y 1 0.6 0.5 y 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0. 2 0.4 0.6 x C o r r =  0 .9 2 3 5 October 30, 2007 0 .8 1 0 0 0. 2 0.4 0.6 0 .8 1 x C o r r =  0 .5 8 1 0 Data Mining: Concepts and Techniques 8 Visual inspection
• Order the proximity matrix with respect to cluster labels
• Inspect visually
• Good clusterings exhibit clear block pattern Using
Example 1 Similarity Matrix for Cluster Validation
! Order the similarity matrix with respect to cluster
labels and inspect visually.
1 1
10 y 0.5
0.4
0.3
0.2
0.1
0 0. 7 40 0. 6 50 0. 5 60 0. 4
0. 3
0. 2 90 0.6 30 80 s
t
n
i
o
P 0.7 0. 8 70 0.8 0. 9 20 0.9 0. 1 100 0 0. 2 0.4 0.6 x October 30, 2007 0 .8 1 20 40 60 80 0
100 Si mila rity Points Data Mining: Concepts and Techniques 9 Example II
arity Matrix for Cluster Validation sing Similarity Matrix for C
U
ndom data are not so crisp ! Clusters in random data are not 1 1 1 1 0. 9 0.9 10 0. 9 0.9 0. 8 0.8 20 0. 8 0.8 0. 7 0.7 30 0. 7 0.7 0. 6 0.6 40 0. 6 0.6 0. 5 0.5 50 0. 5 60 0. 4 0.4 y s
t
n
i
o
P y 0.5 0. 4
0. 3 0.3 70 0. 3 0.3 0. 2 0.2 80 0. 2 0.2 0. 1
80 0.4 0.1 90 0. 1 0.1 0
100 Si mila rity 0 0 0. 2 0.4 0.6 x 0 .8 1 100 20 60 80 0
100 Si mila rity 0 0 Points Kmeans Data Mining: Concepts and Techniques 40 Kmeans October 30,11007
2 Data Mining: Concepts and Techn Cohesion and separation
Internal Measures: Cohesion and Separation ! PrototypeBased View of Cohesion and Separation.
!"#$%&'()&%)*"(&'( +,)!&."/'&#
!&."/'&#+!')4 ! proximity ( x, ci )
x"Ci +0)1"23%3$'&#
1"2"%3$'&#+!'5)!6)4)7%&8'*'$9+:'5):61"2"%3$'&#+!')4)7%&8'*'$9+:'5): :');)$."):"#$%&'()&<):=>/$"%)!')5):);)&?"%3==):"#$%&'(@
A<)B")="$)2%&8'*'$9)C")$.")/D>3%"()E>:='("3#)('/$3#:")$."#)!&."/'&#)'/)$.")!=>/$"%)11E@ Cohesion
• Measures how closely related the objects are within each cluster
• Within cluster sum of squared errors (SSE)
• For each point, the error is the distance to the centroid
• Within cluster pairwise weighting
• Sum distance between all pairs of points in same cluster Separation
• Measures how distinct a cluster is from the other clusters
• Between cluster SSE (for cluster C)
• For each cluster C’, the error is the distance from the centroid c to the
other centroid c’
• The error is multiplied by the cluster size C’
• Between cluster pairwise weighting
• Sum distance between all pairs of points in different clusters Cohesion and separation
• The sum of the between cluster SSE and within cluster SSE is equal to the
total sum of squared error (distance of each point to overall mean)
• Thus minimizing cohesion is equivalent to maximizing separation Silhouette coefﬁcient
• Combines both cohesion and separation
• For an individual point i:
• A = average distance of i to points in same cluster
• B = average distance of i to points in other clusters
• S = (BA) / max(A,B)
• Can calculate average S for a cluster or clustering
• Closer to 1 is better Cophenetic distance
• For hierarchical clustering techniques
• Cophenetic distance between objects
• Similarity level at which an agglomerative clustering technique puts the
objects in the same cluster for the ﬁrst time
• Can deﬁne a cophenetic distance matrix between all pairs of objects to
describe a hierarchical clustering Evaluation of Hierarchical Clustering
Example
! Example : Below table shows the cophenetic distance matrix for
the single link clustering of the 6 twodimensional points & *
) * ,', ( ' ,&* & '
( ,&, + ,,* ) ,,, 2#$345.4#$6.7489%5:#$3
!"#$% !& !'
,') !* ( + ' * ) & 2#$345.4#$6.;5$;:"3:<= !( !) !+ !"#$% !& ,'' ,(/ ,() ,'( !& , !'
,'' !( !) !* !+ ,'' ,'' ,'' ,'' !& , !' ,') .., ,&* ,', ,&) ,'* !' ,'' .., ,&* ,&* ,&) ,&* !( ,'' ,&* .., ,&* ,'0 ,&& !( ,'' ,&* .., ,&* ,&* ,&& !) ,(/ ,', ,&* .., ,'' !) ,'' ,&* ,&* .., ,&* !* ,() ,&) ,'0 ,'1 .., ,(1 !* ,'' ,&) ,&* ,&* .., ,&* !+ ,'( ,'* ,&& ,'' ,(1 .., !+ ,'' ,&* ,&& ,&* ,&* .., ,'1 ,&* B:#3#$<4.;#99#=#4<:#%C.=<%:#@.DEF874#;5<$.;#9%<$7 .>"!?5$5%#7.;#9%<$75.=<%:#@.A":.9#$345.4#$
Similarity matrix Data Mining: Concepts and Techniques
Cophenetic matrix
5G
6
October 30, 2007 15 Cophenetic correlation
• Measure correlation between the original disimilarity matrix and the
cophenetic matrix C ov (D, CD )
CP CC =
[V ar(D) · V ar(CD )]1/2 Supervised Classlabel evaluation
• If you have class labels why cluster?
• Usually small handlabeled dataset for evaluation
• But large dataset to cluster automatically
• May want to assess how close clusterings correspond to classes but still
allow for more variation in the clusters Classiﬁcationoriented
• Purity: another measure of the extent to which a cluster contains objects of a
single class
• The purity of cluster i is pi=max pij
K purity =
i=1 Ni
pi
N • Entropy: the degree to which each cluster consists of objects of a single class
• For each cluster i compute the probability of class j
C ei = − pij log pij
j =1 Classiﬁcationoriented
• Normalized mutual information gain:
• Measures the amount of information by which our knowledge about the
classes increases when we are told what the clusters are N M I (C, G) =
= I (C, G)
H (C ) + H (G)
c − c (c,g )
p(c, g )log ppc)p(g) ·
g
( p(c)log p(c) − g p(g )log p(g ) Classiﬁcationoriented
• Precision
• The fraction of a cluster that consists of objects of a speciﬁed class
• Recall
• The extent to which a cluster contains all objects of a speciﬁed class
• Accuracy
• Why is it hard to measure the accuracy of a clustering if you know class
labels? Similarityoriented
• Based on premise that any pair of objects in the same cluster should have the
same class and vice versa
• Compare the “ideal” cluster similarity matrix to the “ideal” class similarity
matrix Approaches
• Correlation between two ideal matrices
• Measures of binary similarity between two ideal matrices
• f00 = # pairs of objects having diff class and diff cluster
• f01 = # pairs of objects having diff class and same cluster
• f10 = # pairs of objects having same class and diff cluster
• f1 = # pairs of objects having same class and same cluster f00 + f11
Rand =
f00 + f01 + f10 + f11 f11
J accard =
f01 + f10 + f11 Determining the Correct Number of Clusters
Determining k
Various unsupervised cluster evaluation measures can be
used to approximately k, look for peak, dip, or knee in evaluation
• Approach: evaluate over a range ofdetermine the correct number
measure clusters.
of
! ! Example : The data set has 10 natural clusters. (@E=D5/
")*( !"
#$ ")%( #% ")(( #& ")&( #'
#" ")+(
( !" !( '" '( ,,#.)/)#012345#67#891/:45/#76
SSE
5#:;4#<=:=)#>?@24=0A October 30, 2007 ( !" !( '" '( B.45=C4#/D9;614::4#86477D8D40:#.)
Silhouette
/)#012345#67#891/:45/#765#:;4#<=:
Data Mining: Concepts and Techniques
= 37 Clustering tendency
• Evaluate whether a dataset has clusters without clustering
• Most common approach (for lowdimensional Euclidean data)
• Use a statistical test for spatial randomness
• Hopkins statistic: sample 20 points from dataset, generate 20 random
points in same space H= p
i=1 p
i=1 ui + ui: distance from random point to NN in data wi
p
i=1 wi wi: distance from sample point to NN in data Assessing signiﬁcance
• How do we know a score is “good”?
• How do we know that a difference between two algorithms is signiﬁcant?
• This is the same problem we had for predictive models
• Need a sampling distribution to compare to
• Can generate random data in same space and compute empirical
sampling distribution
• Can partition data to get multiple folds for evaluation Density estimation Model selection
• Goal 1: Summarize the data
• Describe data as precisely as possible
• General approach based on data compression and information theory
uses score function:
• Score(θ,M) = # bits to describe the model + # bits to
describe the exceptions in the data
= log p(θ,M)  log p(D θ,M) Model selection
• Goal 2: Generalize to new data
• Goodness of ﬁt is part of the evaluation
• But since the data is not the entire population, we want to learn a model
that will generalize to other new data instances
• Thus, want to strike a balance between between how well the model ﬁts and
the data and the simplicity of the model Score functions
• Score(θ,M) = error(M) + penalty(M)
• Penalty may depend on the number of parameters in the model (p) and the
number of data points (n)
• AIC Akaike information criterion
ScoreAIC = 2 log L + 2p
• BIC: Bayesian information criterion
ScoreBIC = 2 log L + p log n
• Other functions: minimum description length, structural risk minimization Bayesian networks
• Goodness of ﬁt
• AIC, BIC, MDL, BD
• How to assess quality of structure learning?
• Use synthetic data generated from speciﬁed Bayes net, try to recover
structure, compare to “gold standard” Next class
• Reading: Chapter 13 PDM
• Topic: Pattern mining ...
View
Full
Document
This note was uploaded on 03/13/2012 for the course CS 573 taught by Professor Staff during the Fall '08 term at Purdue UniversityWest Lafayette.
 Fall '08
 Staff
 Data Mining

Click to edit the document details