class_08_29

# class_08_29 - Statistical Data Mining ORIE 474 Spring 2007...

This preview shows pages 1–9. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Data Mining ORIE 474 Spring 2007 Tatiyana V. Apanasovich 08/29/07 Measurement and Data 2.3 Distance Measures A Similarity measures B Distance and Metric C Covariance Matrix D Multivariate Binary Data E Categorical Data A. (Dis)Similarity Measures Can be obtained Directly from the object Surveys: rate pairs of objects according to their similarity Food tasting: state similarities between flavors Indirectly from vectors of measurements or characteristics describing each object Need to state precisely what similar means Involves calculating formal similarity measures Proximity is often used as a general term to denote a measure of similarity or of dissimilarity B. Distance and Metric Distance Dissimilarity measure derived from the objects characteristics, e.g. Euclidean distance Metric 1. d(i,j)&amp;gt;0 for all i, j &amp;amp; d(i,j)=0 if and only if i=j 2. d(i,j)=d(j,i) for all i,j 3. d(i,j) d(i,k)+d(k,j) for all i,j,k (triangle inequality) Euclidean distance : x(i) = (x 1 (i),x 2 (i),,x p (i)), i=1,,n, 2 / 1 1 2 )) ( ) ( ( ) , ( - = = p k k k E j x i x j i d Euclidean Distance (contd) Assumes some degree of commensurability between the different variables Ex: 1 st variable: length 2 nd variable: weight No obvious choice of units Altering the choice of units could change which variable was most important For data sets with non-commensurate variables, one way of approach is to standardize the data by dividing each of the variables by its sample standard deviation Sample Mean and Sample Standard Deviation Standard deviation for the k th variable X k where is the mean of X k If is not known, it can be estimated using the sample mean 2 / 1 1 2 ) ) ( ( 1 - = = n i k k k i x n k k = = n i k k i x n x 1 ) ( 1 Weighted Euclidean Distance 2 / 1 1 2 )) ( ) ( ( ) , ( - = = p k k k k E j x i x w j i d Ex: Euclidean and weighted Euclidean distances are additive, that is, the variables contribute independently If variables are not independent, we need to take into account the covariances between them 2 1 = k k w C. Covariance Matrix Consider 2 variables X and Y, and assume we have n objects, with X taking values x(1),,x(n) and Y taking values y(1),,y(n)....
View Full Document

## This note was uploaded on 12/23/2009 for the course ORIE 474 at Cornell.

### Page1 / 34

class_08_29 - Statistical Data Mining ORIE 474 Spring 2007...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online