STAT_371_F04_Midterm__with_solutions_

STAT_371_F04_Midterm__with_solutions_ - b) Statistics 371...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: b) Statistics 371 Sample Midterm Solution In Stat 371, we deal with applications and theory of the linear model Y = X ,3 + R where X = (1 x1 ...xp) is a n x (p +1) matrix with columns giving the values of the explanatory variates and R is a vector of random variables with independent components Rl. ~ G(O, 0'). We represent the corresponding data model by y = Xfl+ r where y is the vector of observed values of the response variate. (4 marks) Give two distinct, different uses of this model in business contexts. o prediction: predict the market value of a building using the selling price (response variate) and various explanatory variates (size, age, ...) from sales of similar buildings 0 estimate parameters: estimate the volatility of a share price relative to an index using past closing prices 0 look for outliers: identify extreme salaries (response variate) after adjusting for explanatory variates such as experience , age, educational qualifications, .. For the model described above: (1 mark) What is the criterion used to produce the least squares estimates of the parameters ,6 ? We minimize Hrll2 =||y — X,6|I2 or we chose [é so that F = y — Xfi’ is perpendicular to Span(l,xl,...,xp) (5 marks) We know that the least squares estimate of g is :B = (X 'X )‘l X’ y and the corresponding estimator is ,3 ~ N ( ,6, 02(X 'X )'1). Suppose we want to predict the response variate for a unit with values of the explanatory variate u’ = (1,u1,...,up). Derive a 95% prediction interval. Be sure to explain the derivation. We know thatfi ~ N(,B, 02(X'X)") and hence u'fi ~ N(u’,6, azu‘(X'X)'1u). We are predicting Y where Y ~ Now, 02). Hence we have Y— u‘fi ~ N(0,0'2(1+ u'(X'X)“u)). Standardizing and replacing 0' by 5', we have iY—u'Zi t 5' (1+1_4'(X'X)‘lu) Hp”) We use this random variable as the basis for our interval. Choosing c so that Pr(|tn_(p+1)ls c) = 0.95, we have Y—u'fi 61/(1+u'(x'X)‘1u) Cross-multiplying and 're—arranging we get the probability statement Pr(—c S S. C) = 0.95 Pr(u',B—65'1/(l +u'(X'X)‘lu) g Y s u'fi+c&./(1+u'(X'X)“u)) = 0.95 We get the 95% confidence interval by replacing the estimators by the corresponding estimates. (u’fl— can/(1+ u’(X’X)‘1u),u',AB+Ca'\/(1+ut(XtX)‘lu)) Note: many marks were lost because of confitsion among parameters, estimate and estimators. In a compensation study of the chief executive officer salaries in one state, data were collected from 91 rural school districts in a given year. The purpose of the investigation was to determine if the salaries were relatively “equitable”, or if some CEOs were highly under- or over-paid, relative to the others, after adjusting for qualifications. The variates measured were: experience: number of years in the current or similar job size: number of students in the district education level: BA only, MA or PhD cost of living(col): relative cost of living in the district salary: annual salary of CEO Note that education level is captured by two explanatory variates ma=0, 1 and phd=0, 1 where 1 indicates the presence of the degree. If phd=l, then the CEO has the equivalent to both degrees so ma is set to 1. The R output from fitting a linear model ’.l.l.|‘!0' D' W Call: lm(formula = salary ~ experience + size + ma + phd + col) Residuals: Min 1Q Median 3Q Max —2969.7 -944.2 -135.4 1059.6 3299.6 Coefficients: Estimate Std. Error tvalue Pr(>|t|) (Intercept) 90717.6700 4093.0329 22.164 < 26-16 *** experience 171.1682 36.4134 4.701 9.91e-06 *** 8116 3.1024 0.4598 6.747 1.74e-09 *** ma 52.28.7820 334.4497 15.634 < 26-16 *** phd 4910.7543 474.3909 10.352 < 26-16 *** C01 -2917.0956 4152.1441 —0.703 0.484 Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.‘ 0.1 ‘ ' 1 Residual standard error: 1402 on 85 degrees of freedom Multiple R—Squared: 0.889, Adjusted R-squared: 0.8825 F—statistic: 136.2 on 5 and 85 DF, p—value: < 2.2e—16 I Since E( Y) = ,30 + fllexperience+ flzsize+ fls ma + ,64 phd + fls col , ,64 represents the I averagechange in salary if a CEO gets a PhD, all other explanatory variates held fixed. b) (1 mark) Suppose we add a product term phd*experience with coefficient ,5” to the I model. Carefully interpret this parameter. With the new model we have | E(Y) = ,80 + filexperience + ,stize + ’63 ma +fl4 phd +fl5 col +fll4experience * phd If phd=0, E(Y) = ,60 + fllexperience+ flzsize + A ma + +fl5 col and if phd=1, l E( Y) = ,80 + filexperience + ,stize + ’63 ma + '64 + ,65 col + ,614experience = ’60 + (,81 + fl14)experience +flzsize +,b’3 ma +6; + ,65 col Hence ,3” represents the change in the rate that changing experience effects average salary if a CEO has a PhD versus not having a PhD . Note : This was meant to be diflicult and it proved to be so. Good thing it was only one mark! c) (2 marks) The Pr(>lt|) for the variable col is 0.484. What does this tell us? The p-value for the hypothesis ,65 = 0 is large, so there is no evidence against this hypothesis. That is, there is no evidence that col effects salary, all other explanatory variates being held ficxed; d) (4 marks) To check the contribution of size and experience to the model, a new model with terms e2 and s2, the squares of size and experience was fit. Part of the R summary output is shown below. Is there any evidence that these quadratic terms are necessary? Residual standard error: 1377 on 83 degrees of freedom Multiple R—Squared: 0.8955, Adjusted R—squared: 0.8867 F—statistic: 101.6 on 7 and 83 DF, p-value: < 2.2e-16 The estimate of 0' under the full model (including the squared terms) is 1377 so the residual sum off squares is 83 *(1377)2 = 157,378,707 The estimate of a under the restricted model (without the squared terms) is 1402 so the residual sum off squares is 85 * (1402)2 = 167,076,340 The change in the residual sum of squares is 167,076,340 — 157,378,707 with 8-6: 2 degrees of freedom so the mean square is = 4,848,816. 2 To test the hypothesis that the coefficients of the squared terms are simultaneously 0, the 4,848,816 discre anc measure is p y (1377)2 = 2.557 and the p—value is g) 0.05 < Pr(FL83 2 2.557) < 0.10. There is weak evidence against the hypothesis that the coefficients of the squared terms are 0 and hence weak evidence that they need to be included in the model. Note that the full model has 83 degrees of freedom for estimating 0' here. (2 marks) A quantile-quantile (qq) plot of the standardized residuals is shown below. Explain how to calculate the coordinates of the point in the lower left corner of the plot. Normal o-o Plot L_’ 2000 3000 l o o 1000 L Sample Ouanmes 1000 O _| J__J__l -3000 -2000 - Theoretical Ouantiles We divide the G(0,1) into 91 bins each with probability 1/91. The x-coordinate is the “center” of the first bin qlwhere Pr(Z S ql) = The y—coordinate is the smallest standardized residual in the set of 91. (1 mark) What does the qq plot tell us in this case? Since the points fall close to a straight line, we can be confident that the assumption of gaussian residuals is reasonable. (2 marks) How can we detect cases with an outlier in the explanatory variates? For each case, we look at the leverages h“, the diagonal elements of the hat matrix H = X (X ’X )‘1 X '. If the leverages are close to 1 or relatively large then the corresponding values of the explanatory variates are an outlier and are possibly influential in the fit of the model. .7 h) (2 marks) A plot of the studentized residuals versus the case number is shown below. Assuming that the fit of the model is adequate, use the plot to provide a conclusion to the investigation. rstu dent(b) 0 20 40 60 80 Index The purpose of the investigation was to identify outliers in the response variate, the CEO salary after accounting for the explanatory variates. Looking at the plot of the studentized residuals we see no very large values (i.e.>2.5) so it appears that the salaries are equitable. l ! ...
View Full Document

Page1 / 5

STAT_371_F04_Midterm__with_solutions_ - b) Statistics 371...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online