hw2sol - DATA MINING Statistics 202 Autumn 2010 Homework 2...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
DATA MINING Statistics 202 Autumn, 2010 Homework 2 Solutions: due Friday, October 8th at 5pm 1. Reading: Read about the Mahalanobis distance, in the book or on the internet. In your own words, explain in no more than 20 lines how the Mahalanobis distance can be useful. If we compute the Euclidean distance between observation points in the ordinary way we will see the distances dominated by the variables with high variance. We then use standardization to give all the variables the same weight. However that does not deal with the problem of correlation between variables. Imagine the most extreme case where we have measure Acidity and Ph on 78 different wines, the correlation between these variables was close to -0.9. They are thus redundant and in some sense we are giving them twice the weight of the other uncorrelated variables. The ”Mahalanobis distance” is better adapted than the usual ”Euclidian distance” to settings involving non spherically symmetric distributions. The example with two very correlated variables gives a scatter of points shaped like a very long ellipse (in this case with an axis orthogonal to the first diagonal (x=y). The most important use of the Mahalanobis distance is to calculate the distance between a new observation and the centre of gravity (or mean) of a group of points whose empirical covariance is ˆ Σ . In this case the distance is expressed as D 2 ( x ) = ( x - μ ) t ˆ Σ - 1 ( x - μ ) It can also be useful for computing the distance between the means of two distributions with the same known non singular covariance Σ : D 2 ( μ 1 2 ) = ( μ 2 - μ 1 ) t Σ - 1 ( μ 2 - μ 1 ) 2. Comparing similarities: (a) Generate 100 vectors of length 10, iid from the uniform distribution on ( 0,1 ) 10 . > n = 100 > rv = matrix(runif(n * 10), ncol = 10, nrow = n) (b) Recenter every vector of length 10 to have mean 0 using sweep and apply . > rvc = sweep(rv, 1, STATS = apply(rv, 1, mean), FUN = "-") (c) Renormalize each vector to have L 2 norm 1 using sweep and apply . 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
> norm = function(vec) { + return(sqrt(sum(vec^2))) + } > rvcn = sweep(rvc, 1, STATS = apply(rvc, 1, norm), FUN = "/") (d) Compute the cosine similarity between all 100 vectors (2 by 2). > cos.sim = function(p, q) { + return(sum(p * q)) + } > cossv = rep(0, n * (n - 1)/2) > for (i in 1:(n - 1)) { + for (j in (i + 1):n) { + cossv[n * (i - 1) - i * (i - 1)/2 + j - i] = cos.sim(rvcn[i, + ], rvcn[j, ]) + } + } > cossm = matrix(0, n, n) > for (i in 1:n) { + for (j in i:n) { + cossm[i, j] = cos.sim(rvcn[i, ], rvcn[j, ]) + cossm[j, i] = cossm[i, j] + } + } (e) Compute the correlation between all 100 vectors of length 10 (2 by 2). > corrm = cor(t(rvcn)) > corr.forvec = as.vector(as.dist(corrm)) (f) Show these two similarities against each other in a scatter plot. > library(ggplot2) > print(qplot(c(as.vector(as.dist(corrm, diag = TRUE)), diag(corrm)), + c(as.vector(as.dist(cossm, diag = TRUE)), diag(corrm)), alpha = I(1/100), + main = "Matrix Method Comparison of Cos.sim + and Correlation (with diagonal)")) 2
Background image of page 2
Matrix Method Comparison of Cos.sim and Correlation (with diagonal) c(as.vector(as.dist(corrm, diag = TRUE)), diag(corrm)) c(as.vector(as.dist(cossm, diag = TRUE)), diag(corrm)) -0.5 0.0 0.5 1.0 -0.5 0.0 0.5 1.0 > print(qplot(corr.forvec, cossv, alpha = I(1/100), main = "Vector Method for + computing Cos.sim and correlation (without diagonal)")) 3
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Vector Method for computing Cos.sim and correlation (without diagonal)
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 13

hw2sol - DATA MINING Statistics 202 Autumn 2010 Homework 2...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online