Least squares 2 principal component analysis

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ties of the conditional density Pr(Y |X ). Unsupervised Learning We have a set of N observations (x1 , ..., xN ) of a random p-vector X having joint density Pr(X ). The goal is to directly infer the properties of this probability density without the help of a supervisor or teacher providing correct answers or degree-of-error for each observation. High dimension, complicated properties of interest. Curse of Dimensionality. Must settle for estimating rather crude global models, such as Gaussian mixtures or various simple descriptive statistics that characterize Pr(X ). PCA attempts to identify low-dimensional linear subspace within the X -space that represent high data density. FA aims to find the hidden common structure on the variation of X . 4 Principal Component Direction Largest Principal Component o o -2 0 2 ooo oo o o o oo o o oooo o oo o o o o o o oo o o oo o o o oo o o o o oo oo oo oo o o o oo ooo o oo o o oo o o oo o o o o o oo o oo o o o oo o o o o o oo o o oo o oo o o o o o ooo o o o o o oooo o oo ooo o o o oo o o o ooo oooo oo oo o o oo o o oo o o o o o oo o o o o oo o o oo o o o o oo o o o o o o o o oo o o oo Smallest Principal o o Component o o -4 X2 o -4 -2 0 X1 2 4 1.0 Best Linear Approximation • • • • • 0.5 0.0 • xi • −0.5 • •• • • •• •• • −1.0 • ui1 d1 • Second principal component v1 • FIGURE 14.21. The best ran tion to the half-sphere data. T Principal Component Analysis Find a direction along which the data has the largest variation. max {sample variance of X v }. v =1 Find the best low dimensional linear approximations to the data. Consider the rank-q linear model for representing the p-dimensional data x1 , . . . , xN . f (η ) = µ + V q η, where µ ∈ Rp is a location vector, V q is a p × q orthogonal matrix, and η ∈ Rq is a vector of parameters. Fitting such a model to the data by least squares amounts to minimizing the reconstruction error N xi − µ − V q η i 2 . min µ,{ηi },V q i=1 PCA: Preprocessing 1 1 Compute x = N ¯ each observation N i=1 xi , and then subtract the sample mean from xi := xi − x. ¯ 2 Optional. Preferred when features are on different scales. Normalize each feature. N x2 ij sj = ˆ then xij := i=1 xij sj for all 1 ≤ i ≤ N, 1 ≤ j ≤ p. From now on we always assume Step 1 has been done. Problems become Direction of maximum sample variance max v T X T Xv 1 . 1 v 1 =1 Best linear approximation N xi − v 1 ηi 2 . min {ηi },v 1 i=1 Singular Value Decomposition Orthogonal Matrix Let A be a n × m matrix with n ≥ m. Denote its m columns by a1 , . . . , am . We say A is orthogonal if its columns are orthonormal, i.e. aT aj = i 1 0 if i = j ; if i = j. The singular value decomposition (SVD) of the N × p (assume N ≥ p) matrix X has the form X = U DV T . U and V are N × p and p × p orthogonal matrices. The columns of V , denoted by v 1 , . . . , v p , span the row space of X . The columns of U , u1 , . . . , up , span the column space of X . D is a p × p diagonal matrix, with diagonal entries d1 ≥ d2 ≥ · · · ≥ dp ≥ 0, which are called singular values of X . v 1 , . . . , v p are eigenvectors of the matrix X T X , corresponding to the eigenvalues d2 ≥ d2 ≥ · · · ≥ d2 . p 1 2 u1 , . . . , up are eigenvectors of the matrix XX T , corresponding to the eigenvalues d2 ≥ d2 ≥ · · · ≥ d2 . The rest eigenvalues of XX T p 1 2 are all zero. PCA: Solution Fix V , we must have η i = V T xi , ˆ q and the problem is reduced to N xi − V q V T xi 2 . q min Vq i=1 Let X be the N × p matrix whose rows are xT , . . . , xT . Compute 1 N the singular value decomposition (SVD) of X X = U DV T . ˆ For each 1 ≤ q ≤ p, the solution V q consists of the first q columns of V . vm z m = X vm = dm um vm m-th principal direction m-th principal component loadings of the m-th principal component Applications Visualization. Compression. Computation. Clearer patterns in lower dimension. Anomaly detection. Remove redundancy. Face recognition and matching. Microarray analysis. Web link analysis. 3 0 1 2 Variances 200 100 0 Variances 300 400 4 Asset Excess Returns Comp.1 Comp.3 Comp.5 Comp.7 Comp.9 Comp.1 Comp.3 Comp.5 Comp.7 Comp.9...
View Full Document

This note was uploaded on 10/01/2013 for the course FSRM 588 taught by Professor Xiao during the Fall '13 term at Rutgers.

Ask a homework question - tutors are online