This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: (f f f? Statistics 305: Solutions to Homework 4, — 2. (a) LLR, like all kernel methods, makes sense because it allows us to build models which
combine locality with globality. The kernel parameter  in this case a  controls the
tradeoff. LLR 1n particular uses the local linearity of smooth functions to its advantage.
Rather than assuming, like standard kernel methods implicitly do, that the function
is locally constant, it assumes the function is locally linear. So we get a ﬁrstorder
approximation at each point, rather than a 0thorder approximation. The resulting estimates will be'smooth, since the weighting function is smooth, and the  estimates are a smooth function of X and the weights, as the formulae generated 1n (c) (1)) illustrate. a, the tuning parameter, sets the “size” of the neighborhood which we consider to be
reasonable. The choice of a should be driven by the bias variance tradeoﬁ:  smaller 0 means more locality, hence smaller bias, but less data used, hence bigger
variance  bigger a means using more data for ﬁtting at each point , hence smaller variance, but
less locality, hence bigger bias As a —) O, the ﬁt converges to a piecewise linear, possibly nonContinuous ﬁt, interpo
lating or extrapolating the closest 2 :cpoints to the current so. To see why this 1s so,
consider that as a —) 0, we have Wo(z,~)/Wo(zj) —) 0 iff Ix,  zo > lzj  mol. So the
closest point will have overwhelming weight in determining the ﬁt. However our ﬁt at
2:0 has 2 degress of freedom (intercept+510pe). So we have another DF to “give” to the
second closth 2;, even if its weight 15 overwhelminingly smaller. Note that the HTF book has a mistake on this point, since it states that the limit 13 a
linear interpolation. This 13 only true if the points are equispaced (and hence the two
closest points to each 1'0 are always on two diﬂ’erent sides). AS a > 00 all the points will get almost equal weight (since Wo(a:,) —) e$p{0} = 1).
Hence we will converge to the ordinary least squares solution. We are ﬁtting a weighted least squares at each :50, so the matrix form solution is:'
(00,170), = (X’WOX)1(X’W09) Since we are dealing with simple regression, we can (and should!) simplify it and put it
in the form of ratio of sums: ll i(szzzwy — Zwmz'wazy),
b0 = %(Zw:wzy—Zw$2wy),
A = Zwamz — (Z wz)2 Where the idices of the sums have been omitted for ease of reading, and we denote = W(L'—l=';$° ). aO (d) (e) (0 Using the results from partafg‘m we a
“110)  (1 $0)(X WoX)1(X'Woy) And so setting h = [(l,a:o)(X’WoX)’1X’Wo]‘ gives us f(zo) = h‘y. Simplifying, we get
the ith entry of h as: ' l h, = 2—131042 10:22 — (z, + 1:0) 2 mm + 1:02;, 2 w] Where we have again dropped all the dummy summation variables for clarity and A is
as in (c) If y, = a+ bazi for i = 1,...,n, then from (c) we get: (ao,bo)' = (X'WoXl’1(X’Woy) = (X'WoX)‘1(X'Wo(a + 193)) =
= (X'WoX)'1(X’WoX)(a,b)' = (9, b)’ so f(:r:o) = a+ bzo. As we have shown above, ﬂaw) = h'y and therefore the bias is: f (:50)  E(h‘y) =
f ($0) — MEG!) = f ($0) '
The componentwise Taylor expansion of h‘f gives us: h'f = 17(10): hi + f'($o) Z hi($i . 20) + Z 0(hi($i  13°F) Note that this taylor expansion is not necessarily informative, if the higher derivatives
are not very small, or if h does not decrease fast enough as zi — zo increases. To prove that f (:50) is unbiased to ﬁrst order, we need to show that 2 h, = 1, 2 hitti—
zo)=
We can use the results of part (d) and, with some work, show these two facts. Here’s a simpler solution which some people suggested: using the results from part (e), ﬁrst
assume y, = 1 (i.e. a=1, b=0) , then we get from (d) and (e): f('£o) = h‘1 = 1
Now assume y. = 2:, (i.e. a=0, b=1), then we get:
f($o) = htz = 1‘0 And since we know 2 h, = 1 we get: 2: hi(:z:, — 1:0) = (z hizi) — so = 0 ”Tim/MKS ~90 EM anjeww I (7) An R function with the desired functionality is included below, 11r < function(x,y,x0,sigma,hat=FALSE) {
#Fit local linear regression at x_0 #Define the length of x
n = length(x)
#Append X with vector of ones
X = cbind(rep(1,n),x) #Define the w matrix ,
W = diag(exp( ( ( abs(xxO)/sigma )‘2 ) / 2 )) #Solve for h
h = 1:01) Z*% x %*'/. solve(t(X) %*°/. w %*% x) 'A.*% c(1,x0) #Define the fitted value
y_hat = t(h) %*% y if (hat == TRUE)
{
return(list(y_hat }
if (hat == FALSE) {
return(1ist(y_hat } y_hat, hat = h)) y_hat)) (8) The llr function was used on the abalone . 305 dataset, using Whole .wt as the predictor and Rings
as the response. The ﬁt was computed at 25 values in the interval [0, 2.5], and is shown in the ﬁgure
below. The R code for producing this graphic is also included below. Figure 1: The local linear regression ﬁt with m as Whole.wt and y as Rings. prob8 <— functionCabalone_data,x_min,x_max,points) { #Define x and y
x = abalone_data$Whole.wt
y = abalone_data$Rings #Define vector On which to calculate i_0
x0_values = x_min + (0:(points1)/(points1))*(x_maxx_min) #Define sigma
m1 = min(x) m2 = max(x)
range = m2  m1
sigma = range/3 k=1 y_hat = rep(0,points) for (x0 in x0_values) { loca1_1in = 11r(x,y,x0,sigma)
y_hat[k] = loca1_lin$y_hat
#Calculate local regression at x_0 ___.__,___.____.i,_i , i. ,, 7 kk+1. J
} plot (x , y)
lines (x0_va1ues , y_hat) } (9) The local linear regression problem, as stated in equation (1), has a free parameter 0. Rather than
just setting a to be one third of the range of Whole .wt, it would be possible to select 0 via Kfold
crossvalidation, as we selected A for ridge regression in the previous assignment. (10) As discussed in class, an attractive way to measure the effective degrees of freedom in a linear model
is to compute the trace of the hat matrix. This is motivated analytically by the calculation of Cp in
problem 3 of the previous assignment, in which Tr(H) was found to be a suitable measure of effective deguosef—ireeéom.‘ This is consistent with the deﬁnition of degrees of freedom for OLS, because in
/ that case 'I‘r(H) = p, as desired. :1: ef WMWS (11) An R function with the desired functionality is included below, local_linear <— function(x,y,hat=FALSE) { #Evaluate local linear regression at all data points
n = length(x) y_hat = rep(0,n) hat_matrix = matrix(rep(0,n‘2),n,n) #Define sigma m1 = min(x) m2 = max(x) range = m2  m1 sigma = range/3 k = 1
for (x0 in x){ #Find fit at x_0 local_lin = llr(x,y,x0,sigma,hat)
y_hat[k] < local_lin$y_hat
#Construct kth column of hat matrix
hat_matrix[k,] = local_lin$hat k = k+1 } #Define the effective degrees of freedom
df = sum(diag(hat_matrix)) if (hat == TRUE)
{
return(1ist(y_hat
if (hat == FALSE)
{
return(list(y_hat } y_hat, hat = hat_matrix,df=df)) } y_hat,df=df)) (12) In general, an F statistic for two models, M1 and M2 can be loosely written as F = 1235;231:352 I 533% n 2
By assuming that the normal linear regression is ”nested” in local linear regression(since the trace
of the hat matrix suggests that the effective degrees of freedom for local linear regression is approx—
imately 2.647 > 2), the F statistic was for comparing linear regression with local linear regression
was computed in this way (code included below) and was found to be F = 21.33363, suggesting that
local linear regression does in fact outperform linear regression because the corresponding threshold
for accepting the larger model at a ﬁve percent signiﬁcance level was 19, crudely assuming that
p1 = 112 = 2. The local linear regression plot is overlayed with the linear regression ﬁt, as shown in the following ﬁgure. It should be noted that the local linear regression ﬁt would respond even more
I to the local features of the data for a smaller choice of a. 00
00
000 O 0 00 O OOQO
CD00 Q0 oomo O O ’
0C!) 000 OD doomo moooocn~
O O O Figure 2: The local linear regression ﬁt with a: as Whole.wt and y as Rings, shown with the standard linear
regression ﬁt (straight line). prob12 <— function(abalone_data){ x abalone_data$Whole.wt
y = abalone_data$Rings
n length(x) #Calculate local linear regression
local_lin local_linear(x,y,hat=T)
y_hat = local_lin$y_hat hat_matrix = loca1_lin$hat ind = sort(x,index.return=T)$ix #Calculate linear regression
beta = lm(y”x)$coefficient
lin_fit = beta[1]+beta[2]*x #Plot regressions with deta
plot(x,y)
lines(x[ind],y_hat[ind])
lines(x[ind],lin_fit[ind]) lin_reg_RSS = sum((lin_fit  y)‘2) loc_reg_RSS = sum((y_hat  y)‘2)
print(c(lin_reg_RSS,loc_reg_RSS)) #Calculate effective degrees of freedom df = sum(diag(hat_matrix)) #Calculate F statistic F = ((lin_reg_RSS  loc_reg_RSS)/(df—2))/(loc_reg_RSS/(ndf)) #Display hat matrix par(mfrow=c(1,2))
image(t(hat_matrix[,ind]),col=gray((0:32)/32))
image(t(hat_matrix[ind,ind]),col=gray((0:64)/64))
} (13) The hat matrix from part (12) was plotted using the image function, and is shown in the following
ﬁgure. Since f is just a linear combination of the yi’s the columns of the hat matrix reﬂect which
yi’s were weighted the most heavily. Indeed, by sorting the elements of the hat matrix so that the
rows are in increasing order with respect to Whole .wt, you can see the ”window” shifting from one
end of y to the other. The light areas in the plots indicate the portion of y7: weighted most heavily.
As would be expected, as we move toward the upper end of 31,. This is an interesting and intuitive
way to see the effect that 0' has on the window in local regression. The R code for producing this
graphic was included in part (12). ‘0 0.8 06 54 02 00 Figure 3: The hat matrix, unsorted (left), and with the rows sorted according to increasing Wholewt
(right). ...
View
Full Document
 Fall '09

Click to edit the document details