lecture_03

# 1 1 ls n 1 n 1 ls n

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: (2) at the same time. SCAD The smoothly clipped absolute deviation (SCAD) has the derivative pλ (t) = λ I (t ≤ λ) + (aλ − t)+ I (t &gt; λ) , (a − 1)λ with pλ (0) = 0. Often, a = 3.7 is used. The solution of (1) is given by ls ls ls when |βj | ≤ 2λ; sign(βj )(|βj | − λ)+ , scad ls ls ls ˆ {(a − 1)βj − sign(βj )aλ}/(a − 2), when 2λ &lt; |βj | ≤ aλ; βj = ls ls βj , when |βj | &gt; aλ. SCAD Derivative of SCAD 0.6 derivative 0.0 0.2 0.4 1.0 0.5 0.0 penalty 1.5 0.8 2.0 1.0 2.5 SCAD Penalty Function 0 1 2 3 t 4 5 6 0 1 2 3 t 4 5 6 SCAD Works Have N training points (x1 , y1 ), . . . , (xN , yN ) generated from yi = xi β + i . (2) (xi , i ), 1 ≤ i ≤ N are i.i.d. xi have mean zero and covariance matrix Σ. 2 i have mean zero and variance σ . xi and i are independent. The joint distribution of (xi , i ) satisﬁes some regularity conditions. The true parameter vector β = β(1) consist of a sub-vector β(1) β(2) whose components are all nonzero, and another sub-vector β(2) = 0. Theorem If λN → 0 and 1 2 √ N λN → ∞ as N → ∞, then ˆscad with probability tending to one, β(2) = 0; ˆscad the asymptotic normality holds for β(1) as √ ˆscad N β(1) − β(1) ⇒ N (0, σ 2 Σ−1 ). (1) Oracle Property Even if we know that β(2) = 0 ahead of time and perform the least squares only for β(1) , the asymptotic distribution of the estimate will ˆscad be the same with that for β(1) given in (2) of the theorem. So the message is that although we do not know the truth, but the performance is as well as if we knew the truth. This is referred to as the oracle property in the literature. The penalized least squares with SCAD penalty is non-concave so the optimization is in general challenging. Can be well approximated by a uniﬁed algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood for a broad class of concave penalty functions. Local Linear Approximation ˆ Suppose we have some initial estimate β (0) ; for example, we may take ˆ(0) = β ls . The penalty function pλ (|βj |) can be locally approximated ˆ β ˆ around β (0) by a linear function (0) (0) (0) ˆ ˆ ˆ pλ (|βj |) = pλ (|βj |) + pλ (|βj |)(|βj | − |βj |), (0) ˆ for βj ≈ βj . With this approximation, the penalized least squares becomes 2 p p 1N ˆ(1) = arg max − ˆ(0) |)|βj | , yi − − β xij βj p (|βj β ∈Rp 2N i=1 j =1 j =1 which can be solved using Lasso. 20 15 10 penalty 0 5 10 5 0 penalty 15 20 Local Linear Approximation −10 −5 0 t 5 10 −10 −5 0 t 5 10 5 1.0 Solution Paths 4 SCAD LLA Lasso Estimates 3 0.6 2 0.4 1 0.2 0 0.0 Estimates 0.8 SCAD LLA Lasso 0.0 0.5 1.0 λ 1.5 0 1 2 3 z 4 5 Local Linear Approximation Algorithm (0) (0) ˆ ˆ Let I1 = {j : pλ (|βj |) = 0} and I2 = {j : pλ (|βj |) = 0}. Let X 1 := {xj : j ∈ I1 } be the sub-matrix of X , and deﬁne X 2 similarly. Let P 1 be the projection matrix to the column space of X 1 . LLA Algorithm. 1 For j ∈ I2 , let x∗ = j 2 λ (0) ˆ pλ (|βj |) · xj , y ∗ = (I − P 1 )y , X ∗ be the matrix with columns {x∗ : j ∈ I2 } an...
View Full Document

## This note was uploaded on 10/01/2013 for the course FSRM 588 taught by Professor Xiao during the Fall '13 term at Rutgers.

Ask a homework question - tutors are online