Midterm Test for ST4240 Data Mining (please answer all the questions for full marks. Please send your answer to ) 1. For data A (at ), there are 5 predictors X1, … , X5 and response Y. A Single-index model (SIM) is suggested Y = g(a1*X1+… +a5*X5) + e A. Estimate the model, plot the link function and its confidence band. The estimated model is Y = g(0.008595073X1 -0.740091476X2 + 0.034182754X3 -0.671579756X4 + 0.001703434X5) The estimated function and its 95% confidence band are show in Figure 1 -2 -1 0 1 2 0 5 10 15 xalpha y Figure 1 B. which variables can be removed? Estimate the model again after removing the variables The estimated coefficients have SE respectively 0.02222250 0.01297481 0.02339194 0.01451642 0.02307927. By checking the “t-statistics ” , we can see that X1, X3 and X5 can be removed C. For a new X (X1=0, X2=0, X3 = 0, X4=0, X5=0), predict the function value i.e. E(Y|new X) and calculate its 95% confidence interval. Predict value is 1.023846, the 95% confidence interval is [0.6610302, 1.386661]
CODE xy = read.table("testdata1.dat") x = data.matrix(xy[,1:5]) y = data.matrix(xy[,6]) source("sim.R") out = sim(x, y) out$alpha out$se xalpha = x %*% out$alpha plot(xalpha, y) I = order(xalpha) lines(xalpha[I], out$predict[I]) lines(xalpha[I], out$Ln[I]) lines(xalpha[I], out$Un[I]) out = sim(x, y, xnew = c(0, 0, 0, 0, 0)) out$predict out$Ln
This note was uploaded on 10/04/2010 for the course STAT ST4240 taught by Professor Xiayingcun during the Fall '09 term at National University of Singapore.

