# lecture18 - Data Mining CS57300 Purdue University November...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Data Mining CS57300 Purdue University November 16, 2010 Descriptive modeling: evaluation Cluster validity • For prediction tasks there are a variety of external evaluation metrics • Accuracy, squared loss, area under ROC, etc. • For cluster analysis the external evaluation should evaluate the “goodness” of the resulting clusters • Why do we want external validation? • To avoid ﬁnding patterns in noise • To compare clustering algorithms • To compare two sets of clusters Random data Clusters found in Random Data 1 0.9 0.8 0.7 0.7 0.6 y 0.9 0.8 Random P o in t s 1 0.6 0.5 0.5 y 0.4 DBSCAN 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0. 2 0.4 0.6 0 .8 0 1 0 0. 2 0.4 x 1 1 0.9 0.8 0.8 0.7 0.7 0.6 y 0 .8 1 0.9 K -m e a n s 0.6 x 0.6 0.5 y C o m p le t e L in k 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0. 2 0.4 0.6 x October 30, 2007 0 .8 1 0 0 0. 2 0.4 0.6 0 .8 1 x Data Mining: Concepts and Techniques 4 Evaluation approaches • Determine the clustering tendency of the data • Evaluate the clusters using known class labels • Evaluate how well the clusters “ﬁt” the data • Determine which of two different clustering results is better • Determine the “correct” number of clusters Evaluation measures • Supervised • Measures the extent to which clusters match external class label values • Unsupervised • Measures goodness of ﬁt without class labels • Relative • Compares two clusterings, often uses external or internal index Unsupervised Correlation • Compute the correlation between the initial similarity matrix and an “ideal” cluster matrix • Entry i,j is 1 if i and j are in the same cluster, 0 otherwise • High correlation indicates that points in the same cluster are close to each other • Assumes that the proximity values are [0,1] • When is this a good measure? Measuring Cluster Validity Via Correlation Example ! Correlation of ideal similarity and proximity matrices for the K-means clusterings of the following two data sets. 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 y 1 0.6 0.5 y 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0. 2 0.4 0.6 x C o r r = - 0 .9 2 3 5 October 30, 2007 0 .8 1 0 0 0. 2 0.4 0.6 0 .8 1 x C o r r = - 0 .5 8 1 0 Data Mining: Concepts and Techniques 8 Visual inspection • Order the proximity matrix with respect to cluster labels • Inspect visually • Good clusterings exhibit clear block pattern Using Example 1 Similarity Matrix for Cluster Validation ! Order the similarity matrix with respect to cluster labels and inspect visually. 1 1 10 y 0.5 0.4 0.3 0.2 0.1 0 0. 7 40 0. 6 50 0. 5 60 0. 4 0. 3 0. 2 90 0.6 30 80 s t n i o P 0.7 0. 8 70 0.8 0. 9 20 0.9 0. 1 100 0 0. 2 0.4 0.6 x October 30, 2007 0 .8 1 20 40 60 80 0 100 Si mila rity Points Data Mining: Concepts and Techniques 9 Example II arity Matrix for Cluster Validation sing Similarity Matrix for C U ndom data are not so crisp ! Clusters in random data are not 1 1 1 1 0. 9 0.9 10 0. 9 0.9 0. 8 0.8 20 0. 8 0.8 0. 7 0.7 30 0. 7 0.7 0. 6 0.6 40 0. 6 0.6 0. 5 0.5 50 0. 5 60 0. 4 0.4 y s t n i o P y 0.5 0. 4 0. 3 0.3 70 0. 3 0.3 0. 2 0.2 80 0. 2 0.2 0. 1 80 0.4 0.1 90 0. 1 0.1 0 100 Si mila rity 0 0 0. 2 0.4 0.6 x 0 .8 1 100 20 60 80 0 100 Si mila rity 0 0 Points K-means Data Mining: Concepts and Techniques 40 K-means October 30,11007 2 Data Mining: Concepts and Techn Cohesion and separation Internal Measures: Cohesion and Separation ! Prototype-Based View of Cohesion and Separation. !"#\$%&'()&%)*"(&'( +,-)!&."/'&# !&."/'&#+!'-)4 ! proximity ( x, ci ) x"Ci +0-)1"23%3\$'&# 1"2"%3\$'&#+!'5)!6-)4)7%&8'*'\$9+:'5):61"2"%3\$'&#+!'-)4)7%&8'*'\$9+:'5):- :');)\$."):"#\$%&'()&<):=>/\$"%)!')5):);)&?"%3==):"#\$%&'(@ A<)B")="\$)2%&8'*'\$9)C")\$.")/D>3%"()E>:='("3#)('/\$3#:")\$."#)!&."/'&#)'/)\$.")!=>/\$"%)11E@ Cohesion • Measures how closely related the objects are within each cluster • Within cluster sum of squared errors (SSE) • For each point, the error is the distance to the centroid • Within cluster pairwise weighting • Sum distance between all pairs of points in same cluster Separation • Measures how distinct a cluster is from the other clusters • Between cluster SSE (for cluster C) • For each cluster C’, the error is the distance from the centroid c to the other centroid c’ • The error is multiplied by the cluster size |C’| • Between cluster pairwise weighting • Sum distance between all pairs of points in different clusters Cohesion and separation • The sum of the between cluster SSE and within cluster SSE is equal to the total sum of squared error (distance of each point to overall mean) • Thus minimizing cohesion is equivalent to maximizing separation Silhouette coefﬁcient • Combines both cohesion and separation • For an individual point i: • A = average distance of i to points in same cluster • B = average distance of i to points in other clusters • S = (B-A) / max(A,B) • Can calculate average S for a cluster or clustering • Closer to 1 is better Cophenetic distance • For hierarchical clustering techniques • Cophenetic distance between objects • Similarity level at which an agglomerative clustering technique puts the objects in the same cluster for the ﬁrst time • Can deﬁne a cophenetic distance matrix between all pairs of objects to describe a hierarchical clustering Evaluation of Hierarchical Clustering Example ! Example : Below table shows the cophenetic distance matrix for the single link clustering of the 6 two-dimensional points & * ) * ,-', ( ' ,-&* & ' ( ,-&, + ,-,* ) ,-,, 2#\$345.4#\$6.7489%5:#\$3 !"#\$% !& !' ,-') !* ( + ' * ) & 2#\$345.4#\$6.;5\$;:"3:<= !( !) !+ !"#\$% !& ,-'' ,-(/ ,-() ,-'( !& , !' ,-'' !( !) !* !+ ,-'' ,-'' ,-'' ,-'' !& , !' ,-') .., ,-&* ,-', ,-&) ,-'* !' ,-'' .., ,-&* ,-&* ,-&) ,-&* !( ,-'' ,-&* .., ,-&* ,-'0 ,-&& !( ,-'' ,-&* .., ,-&* ,-&* ,-&& !) ,-(/ ,-', ,-&* .., ,-'' !) ,-'' ,-&* ,-&* .., ,-&* !* ,-() ,-&) ,-'0 ,-'1 .., ,-(1 !* ,-'' ,-&) ,-&* ,-&* .., ,-&* !+ ,-'( ,-'* ,-&& ,-'' ,-(1 .., !+ ,-'' ,-&* ,-&& ,-&* ,-&* .., ,-'1 ,-&* B:#3#\$<4.;#99#=#4<:#%C.=<%:#@.DEF874#;5<\$.;#9%<\$7 .>"!?5\$5%#7.;#9%<\$75.=<%:#@.A":.9#\$345.4#\$ Similarity matrix Data Mining: Concepts and Techniques Cophenetic matrix 5G 6 October 30, 2007 15 Cophenetic correlation • Measure correlation between the original disimilarity matrix and the cophenetic matrix C ov (D, CD ) CP CC = [V ar(D) · V ar(CD )]1/2 Supervised Class-label evaluation • If you have class labels why cluster? • Usually small hand-labeled dataset for evaluation • But large dataset to cluster automatically • May want to assess how close clusterings correspond to classes but still allow for more variation in the clusters Classiﬁcation-oriented • Purity: another measure of the extent to which a cluster contains objects of a single class • The purity of cluster i is pi=max pij K purity = i=1 Ni pi N • Entropy: the degree to which each cluster consists of objects of a single class • For each cluster i compute the probability of class j C ei = − pij log pij j =1 Classiﬁcation-oriented • Normalized mutual information gain: • Measures the amount of information by which our knowledge about the classes increases when we are told what the clusters are N M I (C, G) = = I (C, G) H (C ) + H (G) c − c (c,g ) p(c, g )log ppc)p(g) · g ( p(c)log p(c) − g p(g )log p(g ) Classiﬁcation-oriented • Precision • The fraction of a cluster that consists of objects of a speciﬁed class • Recall • The extent to which a cluster contains all objects of a speciﬁed class • Accuracy • Why is it hard to measure the accuracy of a clustering if you know class labels? Similarity-oriented • Based on premise that any pair of objects in the same cluster should have the same class and vice versa • Compare the “ideal” cluster similarity matrix to the “ideal” class similarity matrix Approaches • Correlation between two ideal matrices • Measures of binary similarity between two ideal matrices • f00 = # pairs of objects having diff class and diff cluster • f01 = # pairs of objects having diff class and same cluster • f10 = # pairs of objects having same class and diff cluster • f1 = # pairs of objects having same class and same cluster f00 + f11 Rand = f00 + f01 + f10 + f11 f11 J accard = f01 + f10 + f11 Determining the Correct Number of Clusters Determining k Various unsupervised cluster evaluation measures can be used to approximately k, look for peak, dip, or knee in evaluation • Approach: evaluate over a range ofdetermine the correct number measure clusters. of ! ! Example : The data set has 10 natural clusters. (@E=D5/ ")*( !" #\$ ")%( #% ")(( #& ")&( #' #" ")+( ( !" !( '" '( ,,-#.)/)#012345#67#891/:45/#76 SSE 5#:;4#<=:=)#>?@24=0A October 30, 2007 ( !" !( '" '( B.45=C4#/D9;614::4#86477D8D40:#.) Silhouette /)#012345#67#891/:45/#765#:;4#<=: Data Mining: Concepts and Techniques = 37 Clustering tendency • Evaluate whether a dataset has clusters without clustering • Most common approach (for low-dimensional Euclidean data) • Use a statistical test for spatial randomness • Hopkins statistic: sample 20 points from dataset, generate 20 random points in same space H= p i=1 p i=1 ui + ui: distance from random point to NN in data wi p i=1 wi wi: distance from sample point to NN in data Assessing signiﬁcance • How do we know a score is “good”? • How do we know that a difference between two algorithms is signiﬁcant? • This is the same problem we had for predictive models • Need a sampling distribution to compare to • Can generate random data in same space and compute empirical sampling distribution • Can partition data to get multiple folds for evaluation Density estimation Model selection • Goal 1: Summarize the data • Describe data as precisely as possible • General approach based on data compression and information theory uses score function: • Score(θ,M) = # bits to describe the model + # bits to describe the exceptions in the data = -log p(θ,M) - log p(D| θ,M) Model selection • Goal 2: Generalize to new data • Goodness of ﬁt is part of the evaluation • But since the data is not the entire population, we want to learn a model that will generalize to other new data instances • Thus, want to strike a balance between between how well the model ﬁts and the data and the simplicity of the model Score functions • Score(θ,M) = error(M) + penalty(M) • Penalty may depend on the number of parameters in the model (p) and the number of data points (n) • AIC Akaike information criterion ScoreAIC = -2 log L + 2p • BIC: Bayesian information criterion ScoreBIC = -2 log L + p log n • Other functions: minimum description length, structural risk minimization Bayesian networks • Goodness of ﬁt • AIC, BIC, MDL, BD • How to assess quality of structure learning? • Use synthetic data generated from speciﬁed Bayes net, try to recover structure, compare to “gold standard” Next class • Reading: Chapter 13 PDM • Topic: Pattern mining ...
View Full Document

## This note was uploaded on 03/13/2012 for the course CS 573 taught by Professor Staff during the Fall '08 term at Purdue University-West Lafayette.

Ask a homework question - tutors are online