07-clustering

3 add ns sums sumsqs 4 consider merging compressed

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: tanford C246: Mining Massive Datasets 31 How do we decide if a point is “close enough” to a cluster that we will add the point to that cluster? How do we decide whether two compressed sets deserve to be combined into one? 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 32 We need a way to decide whether to put a new point into a cluster. BFR suggest two ways: 1. The Mahalanobis distance is less than a threshold. 2. Low likelihood of the currently nearest centroid changing. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 33 Normalized Euclidean distance from centroid. For point (x1,…,xk) and centroid (c1,…,ck): 1. Normalize in each dimension: yi = (xi -ci)/σi 2. Take sum of the squares of the yi ’s. 3. Take the square root. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 34 If clusters are normally distributed in d dimensions, then after transformation, one standard deviation = √d. i.e., 70% of the points of the cluster will have a Mahalanobis distance < √d. Accept a point for a cluster if its M.D. is < some threshold, e.g. 4 standard deviations. 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Datasets 35 2σ σ 11/26/2010 Jure Leskovec, Stanford C246: Mining Massive Data...
View Full Document

This document was uploaded on 02/26/2014 for the course CS 246 at Stanford.

Ask a homework question - tutors are online