lecture_04

One we must let the sequence n diverge as n grows on

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: converge to zero. This simple example tells us that Lasso cannot achieve both (1) and (2) at the same time. SCAD The smoothly clipped absolute deviation (SCAD) has the derivative pλ (t) = λ I (t ≤ λ) + (aλ − t)+ I (t > λ) , (a − 1)λ with pλ (0) = 0. Often, a = 3.7 is used. The solution of (1) is given by ls ls ls when |βj | ≤ 2λ; sign(βj )(|βj | − λ)+ , scad ls ls ls ˆ {(a − 1)βj − sign(βj )aλ}/(a − 2), when 2λ < |βj | ≤ aλ; βj = ls ls βj , when |βj | > aλ. SCAD Derivative of SCAD 0.6 derivative 0.0 0.2 0.4 1.0 0.5 0.0 penalty 1.5 0.8 2.0 1.0 2.5 SCAD Penalty Function 0 1 2 3 t 4 5 6 0 1 2 3 t 4 5 6 SCAD Works Have N training points (x1 , y1 ), . . . , (xN , yN ) generated from yi = xi β + i . (2) (xi , i ), 1 ≤ i ≤ N are i.i.d. xi have mean zero and covariance matrix Σ. 2 i have mean zero and variance σ . xi and i are independent. The joint distribution of (xi , i ) satisfies some regularity conditions. The true parameter vector β = β(1) consist of a sub-vector β(1) β(2) whose components are all nonzero, and another sub-vector β(2) = 0. Theorem If λN → 0 and 1 2 √ N λN → ∞ as N → ∞, then ˆscad with probability tending to one, β(2) = 0; ˆscad the asymptotic normality holds for β(1) as √ ˆscad N β(1) − β(1) ⇒ N (0, σ 2 Σ−1 ). (1) Oracle Property Even if we know that β(2) = 0 ahead of time and perform the least squares only for β(1) , the asymptotic distribution of the estimate will ˆscad be the same with that for β(1) given in (2) of the theorem. So the message is that although we do not know the truth, but the performance is as well as if we knew the truth. This is referred to as the oracle property in the literature. The penalized least squares with SCAD penalty is non-concave so the optimization is in general challenging. Can be well approximated by a unified algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood for a broad class of concave penalty functions. Local Linear Approximation ˆ Suppose we have some initial estimate β (0) ; for example, we may take ˆ(0) = β ls . The penalty function pλ (|βj |) can be locally approximated ˆ β ˆ around β (0) by a linear function (0) (0) (0) ˆ ˆ ˆ pλ (|βj |) = pλ (|βj |) + pλ (|βj |)(|βj | − |βj |), (0) ˆ for βj ≈ βj . With this approximation, the penalized least squares becomes 2 p p 1N ˆ(1) = arg max − ˆ(0) |)|βj | , yi − − β xij βj p (|βj β ∈Rp 2N i=1 j =1 j =1 which can be solved using Lasso. 20 15 10 penalty 0 5 10 5 0 penalty 15 20 Local Linear Approximation −10 −5 0 t 5 10 −10 −5 0 t 5 10 5 1.0 Solution Paths 4 SCAD LLA Lasso Estimates 3 0.6 2 0.4 1 0.2 0 0.0 Estimates 0.8 SCAD LLA Lasso 0.0 0.5 1.0 λ 1.5 0 1 2 3 z 4 5 Local Linear Approximation Algorithm (0) (0) ˆ ˆ Let I1 = {j : pλ (|βj |) = 0} and I2 = {j : pλ (|βj |) = 0}. Let X 1 := {xj : j ∈ I1 } be the sub-matrix of X , and define X 2 similarly. Let P 1 be the projection matrix to the column space of X 1 . LLA Alg...
View Full Document

Ask a homework question - tutors are online