lecture_04

lecture_04

Financial Data Mining FSRM588 Lecture 04: Penalized Least Squares & Principal Component Analysis Department of Statistics & Biostatistics Rutgers University October 01 2013

Outline 1 Penalized Least Squares 2 Principal Component Analysis
Lasso Lasso solves the following optimization problem, ˆ β lasso = arg max β R p +1 - 1 2 N N X i =1 y i - β 0 - p X j =1 x ij β j 2 - λ p X j =1 | β j | for some λ 0 .

Penalized Least Squares We can use some other penalty on the parameters β j as well, and consider the following general penalized least squares problem ˆ β = arg max β R p +1 - 1 2 N N X i =1 y i - β 0 - p X j =1 x ij β j 2 - p X j =1 p λ ( | β j | ) , where p λ ( · ) is the penalty function. Best subset selection: p λ ( t ) = ( λ 2 / 2) I ( t 6 = 0) . Lasso: p λ ( t ) = λ | t | . Hard thresholding: p λ ( t ) = 1 2 [ λ 2 - ( λ - t ) 2 + ] . Elastic net: p λ ( t ) = λ at 2 + (1 - a ) | t | with 0 a 1 . Bridge regression: p λ ( t ) = λ | t | q for some 0 < q 2 . Ridge regression: p λ ( t ) = λ | t | 2 . SCAD: to be introduced.
L 1 and SCAD 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Penalty Functions t penalty L1 SCAD Hard 0 1 2 3 4 5 6 0.0 0.5 1.0 1.5 2.0 Derivatives t derivative L1 SCAD Hard

Centering Usually the intercept is not penalized, so we center y and x j first and consider the following optimization problem ˆ β = arg max β R p - 1 2 N N X i =1 y i - p X j =1 x ij β j 2 - p X j =1 p λ ( | β j | ) , with the implicit assumption that 1 0 x j = 0 , 1 j p and 1 0 y = 0 .
Canonical Model Consider the special case by assuming the columns of the input matrix N - 1 / 2 X are orthonormal, i.e. X 0 X /N = I . Let ˆ β ls j = N - 1 x 0 j y , , ˆ y ls = XX 0 y , the penalized least squares can be rewritten as ˆ β = arg max β R p - 1 2 N y - ˆ y ls 2 + p X j =1 - 1 2 ˆ β ls j - β j 2 - p λ ( | β j | ) , so that the optimization problem becomes maximizing for each β j , ˆ β j = arg max β j R - 1 2 ˆ β ls j - β j 2 + p λ ( | β j | ) . (1)

Desired Properties Lasso estimates are ˆ β lasso j = sign( ˆ β ls j )( | ˆ β ls j | - λ ) + . Consider a simplified case of. Assume we have N training points ( x 1 , y 1 ) , . . . , ( x N , y N ) from the following model y i = β 1 x i 1 + β 2 x i 2 + i , with β 1 > 0 and β 2 = 0 , and i are i.i.d. with mean zero and variance σ 2 . Desired Properties 1 Sparsity. As N goes to infinity, β 2 is estimated as zero with probability approaching one. 2 Unbiasedness. β 1 can be estimated with small bias. More specifically, we would like the estimate ˆ β 1 to have the property N ( ˆ β 1 - β 1 ) ⇒ N (0 , σ 2 ) .
Lasso Fails Rewrite λ N = N - 1 / 2 η N . From the identity P ( ˆ β Lasso 2 = 0) = P | ˆ β ls 2 | ≤ N - 1 / 2 η N = P | N ˆ β ls 2 | ≤ η N , we see that in order for this probability to approaching one, we must let the sequence η N diverge as N grows. On the other hand, N ˆ β lasso 1 - β 1 N ˆ β ls 1 - λ N - β 1 = N ˆ β ls 1 - β 1 - η N , from which we find that in order for (2) to hold, the sequence η N must converge to zero. This simple example tells us that Lasso cannot achieve both (1) and (2) at the same time .

