This preview shows page 1. Sign up to view the full content.
Unformatted text preview: converge to zero. This simple example tells us that Lasso cannot achieve
both (1) and (2) at the same time. SCAD The smoothly clipped absolute deviation (SCAD) has the derivative
pλ (t) = λ I (t ≤ λ) + (aλ − t)+
I (t > λ) ,
(a − 1)λ with pλ (0) = 0. Often, a = 3.7 is used. The solution of (1) is given by ls
ls
ls
when βj  ≤ 2λ; sign(βj )(βj  − λ)+ ,
scad
ls
ls
ls
ˆ
{(a − 1)βj − sign(βj )aλ}/(a − 2), when 2λ < βj  ≤ aλ;
βj = ls
ls
βj ,
when βj  > aλ. SCAD Derivative of SCAD 0.6 derivative 0.0 0.2 0.4 1.0
0.5
0.0 penalty 1.5 0.8 2.0 1.0 2.5 SCAD Penalty Function 0 1 2 3
t 4 5 6 0 1 2 3
t 4 5 6 SCAD Works
Have N training points (x1 , y1 ), . . . , (xN , yN ) generated from
yi = xi β + i . (2) (xi , i ), 1 ≤ i ≤ N are i.i.d.
xi have mean zero and covariance matrix Σ.
2
i have mean zero and variance σ .
xi and i are independent.
The joint distribution of (xi , i ) satisﬁes some regularity conditions.
The true parameter vector β = β(1) consist of a subvector β(1)
β(2)
whose components are all nonzero, and another subvector β(2) = 0.
Theorem
If λN → 0 and
1 2 √ N λN → ∞ as N → ∞, then
ˆscad
with probability tending to one, β(2) = 0; ˆscad
the asymptotic normality holds for β(1) as
√
ˆscad
N β(1) − β(1) ⇒ N (0, σ 2 Σ−1 ).
(1) Oracle Property Even if we know that β(2) = 0 ahead of time and perform the least
squares only for β(1) , the asymptotic distribution of the estimate will
ˆscad
be the same with that for β(1) given in (2) of the theorem.
So the message is that although we do not know the truth, but the
performance is as well as if we knew the truth. This is referred to as
the oracle property in the literature.
The penalized least squares with SCAD penalty is nonconcave so
the optimization is in general challenging.
Can be well approximated by a uniﬁed algorithm based on the local
linear approximation (LLA) for maximizing the penalized likelihood
for a broad class of concave penalty functions. Local Linear Approximation ˆ
Suppose we have some initial estimate β (0) ; for example, we may take
ˆ(0) = β ls . The penalty function pλ (βj ) can be locally approximated
ˆ
β
ˆ
around β (0) by a linear function
(0) (0) (0) ˆ
ˆ
ˆ
pλ (βj ) = pλ (βj ) + pλ (βj )(βj  − βj ), (0) ˆ
for βj ≈ βj . With this approximation, the penalized least squares becomes 2 p
p
1N ˆ(1) = arg max −
ˆ(0) )βj  ,
yi −
−
β
xij βj
p (βj β ∈Rp 2N i=1
j =1
j =1
which can be solved using Lasso. 20
15
10 penalty 0 5 10
5
0 penalty 15 20 Local Linear Approximation −10 −5 0
t 5 10 −10 −5 0
t 5 10 5 1.0 Solution Paths 4 SCAD
LLA
Lasso Estimates 3 0.6 2 0.4 1 0.2 0 0.0 Estimates 0.8 SCAD
LLA
Lasso 0.0 0.5 1.0
λ 1.5 0 1 2 3
z 4 5 Local Linear Approximation Algorithm
(0) (0) ˆ
ˆ
Let I1 = {j : pλ (βj ) = 0} and I2 = {j : pλ (βj ) = 0}.
Let X 1 := {xj : j ∈ I1 } be the submatrix of X , and deﬁne X 2 similarly.
Let P 1 be the projection matrix to the column space of X 1 .
LLA Alg...
View Full
Document
 Fall '13
 Xiao

Click to edit the document details