hw1sol - Homework 1 Solutions Statistics 202 Autumn, 2010...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Homework 1 Solutions Statistics 202 Autumn, 2010 Homework 1: due Thursday, September 30th at 5pm 1. Reading: Read appendix B.1 of the book on PCA and SVD. Alternatively you can read the wikipedia entry on Singular value Decomposition. In your own words, explain in no more than 20 lines what the singular value decomposition is useful for. The SVD is an extremely useful matrix decomposition which allows for any N × p data matrix X to be written as X = U ΣV T , where U is an N × p orthogonal matrix consisting of the left singular vectors, Σ is a p × p matrix with the singular values of A (σ1 ≥ . . . ≥ σp ≥ 0) on the diagonal, and V is a p × p orthogonal matrix consisting of the right singular vectors. The SVD forms the basis of principal component analysis, in which we have N measurements on p variables and seek a small set of uncorrelated variables that are linear combinations of the original variables and explain most of the variation in the original data. The idea here is to exploit linear structure in the original variables to achieve dimensionality reduction. If X is centered (column means are zero), then the columns of V are the principal components of X . The rank q SVD of X (set all but the first q σi s equal to 0) also possesses the property of being the best rank q approximation of X (in the Frobenius norm sense; imagine writing a matrix out as a vector and taking the 2 -norm). This is used as a justification for using the SVD in image compression. Instead of sending the original (large) image matrix X , pick a small enough q (good compression) such that the rank q SVD of X still looks essentially like the original image. 2. Implementation: (a) Create a working directory called Stats202. (b) Copy from the coursework Data folder the data set health.csv to that directory. (c) Start R, set your working directory to be Stats202 using the command setwd(DDDDD) where DDDDD is the directory address on your system (Windows and Mac have different addressing conventions). (d) Read in this csv file using health=read.csv("health.csv"). health = read.csv(“health.csv”) (e) Use the command head to see the first few records. ( The abbreviations aeh pih ugh stand for Arab Emirates, the Philippines and Uganda.) > head(health) country age sex height weight hungry fruit vegetables teeth hands_eating 1 aeh 16+ Male 1.70 58 Sometimes <1 1 0 Always 2 aeh 16+ Male 1.72 85 Never <1 <1 2 Always 3 aeh 16+ Male NA NA Never <1 1 3 Always 4 aeh 16+ Male 1.65 51 Never <1 2 1 Always 5 aeh 15 Male 1.72 52 Never <1 <1 <1 Always 6 aeh 16+ Male 1.71 88 Rarely <1 <1 <1 Always hands_toilet hands_soap sample_weight bmi 1 Always Sometimes 30.3289 20.06920 2 Most of the time Always 30.3289 28.73175 3 Always Always 30.3289 NA 4 Always Always 30.3289 18.73278 5 Always Most of the time 30.3289 17.57707 6 Always Always 30.3289 30.09473 (f) Use the function table() to see how many observations came from each of the three countries. > table(health$country) aeh 15790 pih 5657 ugh 3215 (g) Use the function summary() to look at the attributes/variables. What type are each of the variables? $ $ $ $ $ $ $ $ $ $ $ $ $ $ country : age : sex : height : weight : hungry : fruit : vegetables : teeth : hands_eating : hands_toilet : hands_soap : sample_weight: bmi : Nominal Ordinal Nominal Continuous Continuous or Ordinal Ordinal Ordinal Ordinal Ordinal Ordinal Ordinal Ordinal Continuous Continuous (h) Make a random subset of size 1000 of the data, call it extr1 (Hint: Use the function sample). extr1 = health[sample(1:nrow(health), 1000),] (i) Make side by side boxplots for the variable height. Do the same for the variable weight. b oxplot(height ∼ country, data = extr1) alternatively, library(ggplot2) qplot(x = country, y = height, data = extr1, geom = “boxplot”) q 1.9 1.8 q height 1.7 1.6 1.5 1.4 1.3 q aeh pih country ugh b oxplot(weight ∼ country, data = extr1) alternatively, qplot(x = country, y = weight, data = extr1, geom = “boxplot”) q q 140 120 q q q q q weight q q q q q q q q 100 80 q q q q q q q 60 40 aeh pih ugh country (j) Download and install the package ggplot2. install.packages(“ggplot2”) (k) Load the package with library(ggplot2). library(ggplot2) (l) What more can you say about these variables looking at this plot qplot(weight, height, data = extr1,colour=country) qplot(weight, height, data = extr1,colour=country) q 1.9 q q 1.8 q q q q qq q q q q q q q q qq q qq q q qq q q q q q qq qq q q qq q q q q qq q q q qq qq q q q q q qq qq q qq q q q q q q qqq q q q q q q q qq qq q q q qq q qqq qqq qq q q q qq qq q q qq q qq q q q qq qq q q qqqq qq q qq q qq q q q qqqq qq q qq q q q q qq q qqq qq q q q qqq qqq q q qqqqqqqq q q q qqqqq qq q qqq q qq q qq q qqq q q qq qq q qqq qqqqqqqq qq qq qq qq q qqq q q qqqq q q qq q q qq q q qqqqqq qqqqqq qqq q q q qq q qq qqqqqqqqq q qqqqqqqq qqq q q q q qq qqqqqq qqqqq q q q qq q q q q q qq q qqqqqq q q q q q q qq q q q q q q qqqq qq qq q qqq qqq qq q q qqq q q q q qqqqq q qqq q qq q q q q qq qq q q qqqqq q qqq q q q qq q q qqqqqqqq qqqqq qqqqqqq q q q qqqqqqq q qqqqq q q qq qq q qq q q qqq q q q qq q q qqqqqq q qq qqq qq q q q q qq q qq q q qqqqqq qq q q q qq qq qq q q qq qq q qq q qq qqq q q qq q q q qqqqq qqq q qq q q q q qqq qq q q qq q qqq q q q qq qq q q q qqq qq q q q q q qq qqq q qqq q qq q q qq qq q qq q q q q qq q q q height 1.7 1.6 1.5 1.4 q q q q q q q q q q q country q q q q q q 40 60 80 weight 100 120 140 pih q q q 1.3 aeh q ugh We have to be careful because in this sample of size 1000, aeh had 639 observations. It appears that the lowest height and highest weight individuals come from pih. The tallest individual (by a fair margin) was from aeh. Regardless of country, height and weight are positively correlated. It might be the case that individuals from ugh have the smallest height and weight variance. ...
View Full Document

This note was uploaded on 07/29/2011 for the course STAT 202 at Stanford.

Ask a homework question - tutors are online