This preview shows pages 1–5. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: b) Statistics 371 Sample Midterm Solution In Stat 371, we deal with applications and theory of the linear model Y = X ,3 + R where
X = (1 x1 ...xp) is a n x (p +1) matrix with columns giving the values of the explanatory variates and R is a vector of random variables with independent components
Rl. ~ G(O, 0'). We represent the corresponding data model by y = Xﬂ+ r where y is the vector of observed values of the response variate. (4 marks) Give two distinct, different uses of this model in business contexts. o prediction: predict the market value of a building using the selling price (response
variate) and various explanatory variates (size, age, ...) from sales of similar
buildings 0 estimate parameters: estimate the volatility of a share price relative to an index
using past closing prices 0 look for outliers: identify extreme salaries (response variate) after adjusting for
explanatory variates such as experience , age, educational qualifications, .. For the model described above: (1 mark) What is the criterion used to produce the least squares estimates of the
parameters ,6 ? We minimize Hrll2 =y — X,6I2 or we chose [é so that F = y — Xﬁ’ is perpendicular to
Span(l,xl,...,xp) (5 marks) We know that the least squares estimate of g is :B = (X 'X )‘l X’ y and the corresponding estimator is ,3 ~ N ( ,6, 02(X 'X )'1). Suppose we want to predict the
response variate for a unit with values of the explanatory variate u’ = (1,u1,...,up). Derive
a 95% prediction interval. Be sure to explain the derivation. We know thatﬁ ~ N(,B, 02(X'X)") and hence u'ﬁ ~ N(u’,6, azu‘(X'X)'1u). We are predicting Y where Y ~ Now, 02). Hence we have Y— u‘ﬁ ~ N(0,0'2(1+ u'(X'X)“u)).
Standardizing and replacing 0' by 5', we have iY—u'Zi t 5' (1+1_4'(X'X)‘lu) Hp”) We use this random variable as the basis for our interval. Choosing c so that
Pr(tn_(p+1)ls c) = 0.95, we have Y—u'ﬁ 61/(1+u'(x'X)‘1u) Crossmultiplying and 're—arranging we get the probability statement Pr(—c S S. C) = 0.95 Pr(u',B—65'1/(l +u'(X'X)‘lu) g Y s u'ﬁ+c&./(1+u'(X'X)“u)) = 0.95 We get the 95% confidence interval by replacing the estimators by the corresponding estimates.
(u’ﬂ— can/(1+ u’(X’X)‘1u),u',AB+Ca'\/(1+ut(XtX)‘lu)) Note: many marks were lost because of conﬁtsion among parameters, estimate and
estimators. In a compensation study of the chief executive officer salaries in one state, data were
collected from 91 rural school districts in a given year. The purpose of the investigation
was to determine if the salaries were relatively “equitable”, or if some CEOs were highly
under or overpaid, relative to the others, after adjusting for qualifications. The variates measured were: experience: number of years in the current or similar job
size: number of students in the district education level: BA only, MA or PhD cost of living(col): relative cost of living in the district salary: annual salary of CEO Note that education level is captured by two explanatory variates ma=0, 1 and phd=0,
1 where 1 indicates the presence of the degree. If phd=l, then the CEO has the
equivalent to both degrees so ma is set to 1. The R output from fitting a linear model
’.l.l.‘!0' D' W Call:
lm(formula = salary ~ experience + size + ma + phd + col) Residuals:
Min 1Q Median 3Q Max
—2969.7 944.2 135.4 1059.6 3299.6 Coefficients: Estimate Std. Error tvalue Pr(>t)
(Intercept) 90717.6700 4093.0329 22.164 < 2616 ***
experience 171.1682 36.4134 4.701 9.91e06 ***
8116 3.1024 0.4598 6.747 1.74e09 ***
ma 52.28.7820 334.4497 15.634 < 2616 ***
phd 4910.7543 474.3909 10.352 < 2616 ***
C01 2917.0956 4152.1441 —0.703 0.484 Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.‘ 0.1 ‘ ' 1 Residual standard error: 1402 on 85 degrees of freedom
Multiple R—Squared: 0.889, Adjusted Rsquared: 0.8825
F—statistic: 136.2 on 5 and 85 DF, p—value: < 2.2e—16 I Since E( Y) = ,30 + ﬂlexperience+ ﬂzsize+ ﬂs ma + ,64 phd + ﬂs col , ,64 represents the
I averagechange in salary if a CEO gets a PhD, all other explanatory variates held fixed. b) (1 mark) Suppose we add a product term phd*experience with coefficient ,5” to the
I model. Carefully interpret this parameter. With the new model we have
 E(Y) = ,80 + ﬁlexperience + ,stize + ’63 ma +ﬂ4 phd +ﬂ5 col +ﬂl4experience * phd If phd=0, E(Y) = ,60 + ﬂlexperience+ ﬂzsize + A ma + +ﬂ5 col and if phd=1,
l E( Y) = ,80 + ﬁlexperience + ,stize + ’63 ma + '64 + ,65 col + ,614experience = ’60 + (,81 + ﬂ14)experience +ﬂzsize +,b’3 ma +6; + ,65 col
Hence ,3” represents the change in the rate that changing experience effects average
salary if a CEO has a PhD versus not having a PhD . Note : This was meant to be diﬂicult and it proved to be so. Good thing it was only one
mark! c) (2 marks) The Pr(>lt) for the variable col is 0.484. What does this tell us? The pvalue for
the hypothesis ,65 = 0 is large, so there is no evidence against this hypothesis. That is, there is no evidence that col effects salary, all other explanatory variates being held
ficxed; d) (4 marks) To check the contribution of size and experience to the model, a new model
with terms e2 and s2, the squares of size and experience was fit. Part of the R summary
output is shown below. Is there any evidence that these quadratic terms are necessary? Residual standard error: 1377 on 83 degrees of freedom
Multiple R—Squared: 0.8955, Adjusted R—squared: 0.8867
F—statistic: 101.6 on 7 and 83 DF, pvalue: < 2.2e16 The estimate of 0' under the full model (including the squared terms) is 1377 so the
residual sum off squares is 83 *(1377)2 = 157,378,707 The estimate of a under the restricted model (without the squared terms) is 1402 so the
residual sum off squares is 85 * (1402)2 = 167,076,340 The change in the residual sum of squares is 167,076,340 — 157,378,707 with 86: 2 degrees of freedom so the mean square is = 4,848,816. 2
To test the hypothesis that the coefficients of the squared terms are simultaneously 0, the
4,848,816 discre anc measure is
p y (1377)2 = 2.557 and the p—value is g) 0.05 < Pr(FL83 2 2.557) < 0.10. There is weak evidence against the hypothesis that the coefficients of the squared terms are 0 and hence weak evidence that they need to be
included in the model. Note that the full model has 83 degrees of freedom for estimating 0' here. (2 marks) A quantilequantile (qq) plot of the standardized residuals is shown below.
Explain how to calculate the coordinates of the point in the lower left corner of the plot. Normal oo Plot L_’ 2000 3000
l
o
o 1000
L Sample Ouanmes 1000 O
_ J__J__l 3000 2000  Theoretical Ouantiles We divide the G(0,1) into 91 bins each with probability 1/91. The xcoordinate is the “center” of the first bin qlwhere Pr(Z S ql) = The y—coordinate is the smallest standardized residual in the set of 91.
(1 mark) What does the qq plot tell us in this case? Since the points fall close to a straight line, we can be confident that the assumption of
gaussian residuals is reasonable. (2 marks) How can we detect cases with an outlier in the explanatory variates? For each case, we look at the leverages h“, the diagonal elements of the hat matrix H = X (X ’X )‘1 X '. If the leverages are close to 1 or relatively large then the corresponding
values of the explanatory variates are an outlier and are possibly inﬂuential in the fit of
the model. .7 h) (2 marks) A plot of the studentized residuals versus the case number is shown below. Assuming that the fit of the model is adequate, use the plot to provide a conclusion to the
investigation. rstu dent(b) 0 20 40 60 80 Index The purpose of the investigation was to identify outliers in the response variate, the CEO salary after accounting for the explanatory variates. Looking at the plot of the studentized residuals we see no very large values (i.e.>2.5) so it appears that the salaries are
equitable. l
! ...
View
Full
Document
 Spring '11
 na

Click to edit the document details