class10

# class10 - q The k-means algorithm is sensitive to outliers...

This preview shows pages 1–7. Sign up to view the full content.

What Is the Problem of the K-Means Method? The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
The  K - Medoids   Clustering Method Find representative objects, called medoids , in clusters PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectively for small data sets, but does not scale well for large data sets CLARA CLARANS Focusing + spatial data structure (Ester et al., 1995)
A Typical K-Medoids Algorithm (PAM) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrary  choose k  object as  initial  medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign  each  remainin g object  to  nearest  medoids Randomly select a  nonmedoid object,O ramdom Compute  total cost of  swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 26 Swapping O  and O ramdom  If quality is  improved. Do loop Until no  change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
PAM (Partitioning Around Medoids) (1987) PAM (Kaufman and Rousseeuw, 1987), built in Splus Use real object to represent the cluster Select k representative objects arbitrarily For each pair of non-selected object h and selected object i , calculate the total swapping cost TC ih For each pair of i and h , If TC ih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change
PAM Clustering:  Total swapping cost   TC = C 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 j i h t C jih = 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 t i h j C jih = d(j, h) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 h i t j C jih = d(j, t) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 t i h j C jih = d(j, h) - d(j, t)

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean Pam works efficiently for small data sets but does not scale well for large data sets. O(k(n-k)
This is the end of the preview. Sign up to access the rest of the document.

## This note was uploaded on 08/29/2011 for the course CAP 4770 taught by Professor Staff during the Fall '08 term at FIU.

### Page1 / 71

class10 - q The k-means algorithm is sensitive to outliers...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online