This preview shows pages 1–9. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Statistical Data Mining ORIE 474 Spring 2007 Tatiyana V. Apanasovich 08/29/07 Measurement and Data 2.3 Distance Measures A Similarity measures B Distance and Metric C Covariance Matrix D Multivariate Binary Data E Categorical Data A. (Dis)Similarity Measures Can be obtained Directly from the object Surveys: rate pairs of objects according to their similarity Food tasting: state similarities between flavors Indirectly from vectors of measurements or characteristics describing each object Need to state precisely what similar means Involves calculating formal similarity measures Proximity is often used as a general term to denote a measure of similarity or of dissimilarity B. Distance and Metric Distance Dissimilarity measure derived from the objects characteristics, e.g. Euclidean distance Metric 1. d(i,j)&gt;0 for all i, j &amp; d(i,j)=0 if and only if i=j 2. d(i,j)=d(j,i) for all i,j 3. d(i,j) d(i,k)+d(k,j) for all i,j,k (triangle inequality) Euclidean distance : x(i) = (x 1 (i),x 2 (i),,x p (i)), i=1,,n, 2 / 1 1 2 )) ( ) ( ( ) , (  = = p k k k E j x i x j i d Euclidean Distance (contd) Assumes some degree of commensurability between the different variables Ex: 1 st variable: length 2 nd variable: weight No obvious choice of units Altering the choice of units could change which variable was most important For data sets with noncommensurate variables, one way of approach is to standardize the data by dividing each of the variables by its sample standard deviation Sample Mean and Sample Standard Deviation Standard deviation for the k th variable X k where is the mean of X k If is not known, it can be estimated using the sample mean 2 / 1 1 2 ) ) ( ( 1  = = n i k k k i x n k k = = n i k k i x n x 1 ) ( 1 Weighted Euclidean Distance 2 / 1 1 2 )) ( ) ( ( ) , (  = = p k k k k E j x i x w j i d Ex: Euclidean and weighted Euclidean distances are additive, that is, the variables contribute independently If variables are not independent, we need to take into account the covariances between them 2 1 = k k w C. Covariance Matrix Consider 2 variables X and Y, and assume we have n objects, with X taking values x(1),,x(n) and Y taking values y(1),,y(n)....
View
Full
Document
This note was uploaded on 12/23/2009 for the course ORIE 474 at Cornell.
 '07
 APANASOVICH

Click to edit the document details