Unformatted text preview: BES Tutorial Sample Solutions, S1/10
WEEK 12 TUTORIAL EXERCISES (To be discussed in the week starting
May 24) 1. Recall the Anzac Garage data (ANZACG.XLS) used in Weeks 3, 8 and 10. In Week 3 we considered the simple linear regression model given by: where price = used car price in dollars and age = age of the car in years. The EXCEL results obtained using Ordinary Least Squares are presented below: Regression Statistics R 2 0.077 Standard Error 42069 Observations 117 CoefficientsStandard Error t Stat p‐value
Intercept
47469
6748
7.035 0.000 Age ‐2658
856
‐3.106 0.002 (a) Interpret the “t‐Stat” and the “p‐values” in the EXCEL output. What do you need to assume? The tstat & pvalues in the EXCEL output are derived from twotail tests with null hypotheses that the associated population parameter equals to 0. Hence, larger tstats and lower pvalues mean we are more confident that the associated population parameter is nonzero. Here, pvalues for both intercept and Age coefficients are below 1% &, hence we can be confident that both population parameters are statistically significant (nonzero). We need to assume the disturbances are normal or because the sample size is large invoke the CLT. 1 (b) Calculate a 95% confidence interval for the coefficient on age. Standard normal critical value is 1.96 hence 95% confidence interval is: 2658 ± 1.96×856 = 2658 ± 1678 = (4336, 980) (c) Interpret the R2 value. The regression model including age explains 7.7% of the variation in used car prices. (d) Test whether the estimated coefficient of Age is significantly less than zero at the 5% level of significance. Unlike in (a) this is a onetailed test: H0: 1=0; H1: 1< 0
Decision rule: Reject H0 if b1/se(b1) < ‐1.645 Test statistic: b1/se(b1)=3.106 < 1.645 and hence reject H0 (e) Estimate a 95% confidence interval for the mean price for a second‐hand passenger car that is 10 years old and interpret the result? Note: the sample mean of age is 6.44 years. A 10 year old car is expected to be valued at $47469 10×2658=20889. Boundaries of confidence interval for this prediction can be found by: 1
, ∑ where s = 42069, se(b1)=856 and hence 42069
856 2415 Hence: 2 20889 1.98 42069 1
117 10 6.44
2415 20889 9783 We are 95% confident that the price of a 10 year old car will fall between $11,106 and $30,672. While the impact of age on price is precisely estimated, the CI is quite wide because of the large amount of unexplained variation that is indicated by the very low R2 value reported. (Note: use of normal critical values here would be acceptable given the large sample size and would make little practical difference as the critical value would be 1.96 rather than 1.98) Anzac Garage’s pricing scheme based on the age of the car is not working out very well. When its second‐hand cars are compared with cars of the same age from other dealers, prices often diverge. One of their consultants noted that the value of a second‐hand car should depend on both the Odometer reading as well as the Age of the vehicle. This consultant wanted to estimate the following two simple linear regression models separately: where Odometer = distance the car has travelled since leaving factory in kilometers. A senior consultant advised use of a multiple linear regression model instead: (f) Discuss why the simple linear regression methods may not be preferable to the multiple regression method, in general, and in the context of this problem. The resultant OLS estimates for the multiple regression model given below: The predictive performance of the model will improve as relevant variables are added to a simple regression model. Also the assumption that the disturbance is uncorrelated with the explanatory variables is critical for the unbiased estimation of coefficients of included 3 variables. In the simple price on age regression it will be violated if variables affecting price and correlated with age have been omitted from the model. This is likely to be the case here with distance the car has traveled. We see the R2 has improved (approximately doubled) with the addition of odometer and the coefficient on age is now much smaller in magnitude and is now statistically insignificant. SUMMARY OUTPUT Regression Statistics R Square 0.150 Standard Error 40568 Observations 117 CoefficientsStandard Errort Stat P‐value
Intercept 53867 6825
7.893 0.000 Odometer (km) 0.270 ‐
0.087
‐3.110 0.002 Age ‐360 1108
‐0.325 0.746 2. Computing Exercise #4 Refer to the computing program and answer part 3 on multiple regression. After estimating three import equations, the first two being simple linear regression, the third being a multiple regression containing GNR and relative prices as explanatory variables you were asked the following discussion question: Are the coefficients 1 and 2 statistically different from zero at the 5% level? Of the three regression equations you estimated, which one provides a “better” explanation of the level of imports? The pvalues for 1 and 2 are both <0.0005 and hence at all conventional significance levels one would reject the null hypotheses that these coefficients are individually equal to zero. 4 We could interpret better in a number of ways. In terms of fit the third regression is best in terms of adjusted: 0.9713 compared to 0.9457 and 0.3167 in the two simple regression models. (Notice the multiple regression model will always dominate the two simple regression models in terms of R2 but may not in terms of adjusted R2 .) In addition though you could argue that the multiple regression model is better because it guards against the omitted variable bias that is likely in the two simple linear regression models. SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9867
R Square
0.9736
0.9713
Adjusted R
Standard E 3140.3680
Observatio
26 Intercept
GNE
Price Coefficients Standard Error
16101.329
10822.442
0.249
0.011
38978.894
8255.354 t Stat
Pvalue
1.488
0.150
23.406
0.000
4.722
0.000 3. SIA: Sydney housing prices. Recall the housing price data for Sydney suburbs used in Question 6 in Week 3. Your statistically naïve friend has been doing some analysis of Sydney housing prices using these data and has asked you for help. In addition to the price data there are a number of characteristics associated with the suburb that have been collected and are likely to explain some of the large variation in housing prices across suburbs that are observed in the data. Your friend was very interested in the impact on housing prices of being located under the flight path. The regression of housing price on the flightpath variable (Model 1) provided a result that he did not expect. On your advice he ran a second regression (Model 2) that included several extra explanatory variables. Results for Model 1 and Model 2 are presented in the table, together with a full description of variables used in the analysis. 5 Housing price is the mean of the median price of houses sold in each suburb for two quarters (September and December 2002) measured in thousands of dollars; Distance to CBD is distance measured in kilometers of the suburb from Sydney’s CBD; Distance to Airport is distance measured in kilometers of the suburb from Sydney Airport; Distance to beach is distance of the suburb measured in kilometers from the nearest beach; Flightpath is a dummy variable that equals 1 if the suburb is under the flight path and equal to 0 otherwise. (a) How would you interpret the regression estimates for the parameters in Model 1 and explain why your friend found the result to be unexpected? Because the estimate of 1 is positive this means houses under the flightpath on average sell for more ($216,200 more) than houses not under the flightpath. This is surprising because you would except aircraft noise associated with being under the flighpath would be unattractive and hence lead to lower not higher prices. (b) Explain why the results in Model 1 are unreliable as a basis for determining the impact on housing prices of being located under the flight path. Which of the assumptions associated with simple linear regression has clearly been violated in Model 1? You would like to make the statement about the impact of being under the flightpath holding other factors constant. This is not possible with Model 1 as it is a simple linear regression and hence there is potential for omitted (confounding) variables that lead to biased estimates of the impact of being situated under the flightpath. For example, proximity to the beach is likely to impact on housing prices and be correlated with being under the flightpath. In Model 1, the variable Distance to beach is in the disturbance term and hence leads to a violation of assumption that E(uX) = 0. 6 (c) Write a brief description of the results for Flightpath in Model 2 in terms of the parameter estimate, its interpretation and its statistical significance. The estimated parameter indicated a $51,500 premium (much smaller than for Model 1) for suburbs under the flightpath relative to those not holding other factors constant. For statistical significance: H0: i = 0 versus H1: i ≠ 0 where i is the ith regression coefficient
Because we have a large sample size we can invoke the CLT and use standard normal critical values when evaluating the test statistics given by bi/se(bi) If we choose = 0.05 then the decision rule will be to reject if  bi/se(bi) > 1.96 The test statistic for flightpath (51.5/50.2 = 1.03) indicates that this parameter is not statistically different from zero. (d) Interpret the overall fit of Model 2. Model 2 produces an R2 of 0.372 37.2% of the variation in Sydney housing prices is explained by the explanatory variables in the regression. (e) Use Model 2 to predict the average housing price for the suburb of Randwick which is 5.21 kms from the CBD, 1.78 kms from the beach, 6.62 kms from the airport and is not deemed to be under the flight path. Prediction = 853.5 + 0 – 21.5×5.21 + 21×6.62 – 13.9×1.78 = 855.763 The predicted average house price for Randwick is $855,763 7 Multiple regression results for Sydney housing prices* Dependent variable: Housing price Model 1 Model 2 569.9
853.5 Intercept (20.6) (35.5) 216.2
51.5 Flightpath (56.0) (50.2) ‐21.5 Distance to (3.4) CBD Distance to 21.0 Airport (2.9) Distance to ‐13.9 beach (2.3) Observations 503
503 R squared 0.029
0.372 * Numbers in brackets below coefficient estimates are standard errors. Explanatory variables 8 ...
View
Full Document
 Three '11
 DenzilGFiebig

Click to edit the document details