** Subscribe** to view the full document.

** Subscribe** to view the full document.

** Subscribe** to view the full document.

**Unformatted text preview: **134 Multiple Linear Regression Model Consider the orthogonal transformation z = P’ y and its inverse p-l-l—li [7+1 )1
= Z CzZz+ Z CZZ1+ Z 6,2,
l: i=p+2—l l=p+2 =ﬁA+(ﬂ—ﬁA)+(y—m
where [M is the projection of y. on L A (X), and ﬂ is the
ﬁ — [2A is in L(X) and perpendicular to LA(X); Hy — A A l
and ”u — #4112 = Zf;+2_z z?- _ Sincey ~ N(u, 021), itfollows thatz = P’y ~ N(P’u, 021). Under the n1
hypothesis AB = 0, the mean vector projection of y on L(A
ﬁllz = 2:72p” 2,2 = SO Pi” Pl’p
P'u‘ PZ’M — 0
P33” . 0 ypothesis [l = ﬁA is in L A (X) and the column
of P2 are perpendicular to L A (X). In addition, P371, = 0 since the columns of f
are perpendicular to L(X). Hence, i. 21, 22, . . . , zn are independent normal random variables with variance 02.
ii.‘ zp+2_z, . . . , z p+1 have zero means under the null hypothesis Ali 2 0.
iii. zp+2, . . . , z” have zero means under the original model, even if the null
hypothesis is false.
Thus, i. H [L — ﬁAHZ /02 = 2:21;] z,-2 is the sum of 1 independent x12 random variables. It has a X12 distribution. ii. 5(3) is a function of zp+2, . . . , z”, whereas “[1 — ﬁAHZ is a function of zp+2_l, . . . , zp+1. Furthermore, z], 22, . . . , 2,, are independent. This shows
that S (,8) and H ﬂ, — [L A“2 are independent. I
EXERCISES ' ‘ .Ionsider the regression on time, 4.2. For the regression model y, = ,30 + 6, with
, y, =ﬁo +ﬁ1t+et, witht=1, 2, . . . , n. . n=2andy’=(2, 4), draw the datain Here, the regressor vector is x’ = ( 1, 2, . . . , two—dimensional space. Identify the
n). Take 11 = 10. Write down the matrices orthogonal projection of y onto L(X) = L(l).
X ’ X, (X’X)—1, V(,3), and the variances of ,30 Explain geometrically 30, ft, and e. and 31. 4.3. Consider the regression niodel
3’1" =.30 + 13m + 6i, i = 1,2,3. With 1 2.2
x = 3 y = 3.9
2 3.1 draw the data in three—dimensional space and
identify the orthogonal projection of y onto
L(X) = L(l, x). Explain geometrically ,8, ﬂ, ' . . and e.
1, Consider the regression model
; ' yz=ﬁo+ﬁixi+ei.i=1,2,3.With 1 2
x: 3 y= 4
2 6 draw the data in three-dimensional space and
identify the orthogonal projection of y onto
L(X) = L(l, x). Explain geometrically ,3, pi,
and e. 4.5. After ﬁtting the regression model, y: ﬂu + £31161 + 62x2 + [33363 + E
on 15 cases, it is found that the mean square
error .92 = 3 and .
‘ 0.5 0.3 0.2 0.6
0.3' 6.0 0.5 0.4
0.2 0.5 0.2 0.7 0.6 0.4 0.7 3.0 OH)“1 = Find
a. The estimate of V031).
b. The estimate of Cov(,31, 33).
c. The estimate of Cord/31, 33).
d. The estimate of V031 — 33).
a When ﬁtting the model I
E0) = ﬂo + .lel + 132X2'
to a set of n = 15 cases, we pbtained the least
squares estimates ﬁg 2 10, 61 = 12, 62 = 15,
and s2 = 2. It is also known that ,
' 1 0.25 0.25 (X’X)—1= 0.25 0.5 —0.25
0.25 —0.25 2 a. Estimate V032).
b. Test the hypothesis that ,62 = 0. Exercises 135 . Estimate the covariance'between 81 and
.32. .
. Test the hypothesis that ,81 = ,32, using both the t ratio and the 95% conﬁdence
interval. . The corrected total sum of squares,
SST = 120. Construct the AN OVA table
and test the hypothesis that ,81 = ,82 = 0.
Obtain the percentage of variation in y that
is explained by the model. . Consider a multiple regression model of the
price of houses (y) on three explanatory
variables: taxes paid (in), number of
bathrooms (x2), and square feet 053). The
incomplete (Minitab) output from a
regression on n = 28 houses is given as
follows: - The regression equation ,is price 2— 10.7 +
0.190 taxes + 81.9 baths + 0.101 sqft Predictor Coef SE Coef t p Constant —10.65 24.02
taxes 0.18966 0.05623
baths ' 81.87 47.82
sqft 0.10063 0.03125 Analysis of variance
Source DF SS 3 504541 MSFp Regression
Residual Error
Total 27 541 1 19 . Calculate the coefﬁcient of determination
R2.
. Test the null hypothesis that all three regression coefﬁcients are zero (Ho: ﬂ =
.62 51 ,83 = 0). Use signiﬁcance level 0.05. . 'Obtain a 95% conﬁdence interval of the
regression coefﬁcient for “taxes.” Can you
simplify the model by dropping “taxes”?
Obtain a 95% conﬁdence interval of the -
regression coefﬁcient for “baths.” Can you
simplify the model by dropping “baths”? 4.8. Continuation of Exercise 4.7. The incomplete
(Minitab) output from a multiple regression 136 Multiple Linear Regression Model of the price of houses on the two explanatory variables, taxes paid and square feet, is given
as follows: The regression equation is price = 4.9 + 0.242
taxes + 0.134 sqft I Predictor Coef SE Coef t p _ Constant 4.89 23.08 taxes 0.24237 0.04884 sqft 0.13397 0.02537 Analysis of variance Source DF SS MS F p
Regression 2 500074 250037
Residual Error Total 541 l 19 a. Calculate the coefﬁcient of determination
R2. b. Test the null hypothesis that both
regression coefﬁcients are zero (H0:
131 = [32 = 0). Use signiﬁcance level 0.05. 0. Test whether you can omit the variable
“taxes” from the regression model. Use
signiﬁcance level 0.05. d. Comment on the fact that the regression
coefﬁcients for taxes and square feet are
different than those shown in Exercise 4.7. Fitting the regression yi = I30 + [31161-1 + ﬁzxiz + 8,- on n = 30 cases
leads to the following results: 30 2,108 5,414
X’X: 2,108‘152,422 376,562
5,414 376,562 1,015,780
5,263
X’y= 346,867 and y’y=1,148,317
921,939 a. Use computer software to ﬁnd (X ’ X )‘1.
Obtain the least squares estimates and
their standard errors. b. Compute the t statistics to test the simple hypotheses that each regression coefﬁcient
is zero. ’ c. Determine the coefﬁcient of variation R2. (The complete data are given invthe ﬁle
abrasion.) 4.10. The following matrices were computed for a
certain regression problem: 15 3,626 44,428
X’X: 3,626 1,067,614 11,419,181 ,
44,428 11,419,181 139,063,428
2,259
X’y: 647,107
7,096,619
(X’Xr‘: 1.2463484 2.1296642 x 10-4 —4.1567125 x 10-4
77329030 x 1076 —7.0302518 x 10-7
1.9771851 x 10-7 3.452613
6“ = 0.496005
0.009191 y’y = 394,107 a. Write down the estimated regression equation. Obtain the standard errors of the
regression coefﬁcients. 'b. Compute the t statistics to test the simple
hypotheses that each regression coefﬁcient
is’equal to zero. Carry out these tests. State . your conclusions.
.A study was conducted to investigate the determinants of survival size of nonproﬁt
US. hospitals. Survival size, y, was deﬁned
to be the largest U.S. hospital (in terms of the
number of beds) exhibiting growth in market
'share. For the investigation, 10 states were
selected at random, and the survival size for
nonproﬁt hospitals in each of the selected
states was determined for two time periods I:
. 1981—1982 and 1984—1985. Furthermore, the following characteristics
were collected on each selected state for each
of the two time periods: x1 = Percentage of beds that are in for—proﬁt
hospitals.
x2 = Number of people enrolled in health maintenance organizations as a fraction , of the number of people covered by
hospital insurance.
253 = State population in thousands. X4 : Percentage of state that is urban.
The data are given in the ﬁle hospital. a. Fit the model
y = [30 + .31961 + 52362 + ,B3x3 + [34x4 + 6 b. The'inﬂuence of the percentage of beds in
for—proﬁt hospitals was of particular
interest to the investigators. What does the
analysis tell us? ' c. What further investigation might you do
with this data set. Give reasons? (1. Rather than selecting 10 states at random,
how else might you collect the data on .
survival size? Would your approach be an
improvement over the random selection? 4.12. The amount of water used by the production
facilities of a plant varies. Observations on
water usage and other, possibily related,
variables were collected for 17 months. The data are given inthe ﬁle water. The explanatory variables are TEMP : average monthly temperature(°F)
PROD = amount of production DAYS 2 number of operating days in the month PAYR = number of people on the monthly
plant payroll HOUR = number of hours shut down for
' maintenance The response variable is USAGE = monthly
water usage (gallons/ 100). a. Fit the model containing all ﬁve
independent variables,
y = ﬂo + ﬁl TEMP + 162 PROD + ,83 DAYS
' +ﬁ4PAYR+ﬁ5HOUR+e Plot residuals against ﬁtted values and
residuals against the case index, and
comment about model adequacy. b. Test the hypothesis that I31 = ’33 = .35 : 0_ 0. Which model or set of models would you
suggest for predictive purposes? Brieﬂy
justify. Exercises 137 d. Which independent variable seems to be
the most important one in determining the
amount of water used? e. Write a nontechnical paragraph that
summarizes your conclusions about plant
water usage that is supported by the data. Data on last year’s sales (y, in~100,000s of
dollars) in 15 sales districts are given in the
ﬁle sales. This ﬁle also contains promotional
expenditures (x1, in thousands of dollars), the
number of active accounts (x2), the number of
competing brands (x3), and the district
potential (X4, coded) for each of the districts. 4.13. a. A model with all four regressors is
proposed: y = .30 + [31951 + [32162 + [33163 + 134.164 + 6,
e N N (0, (72)
Interpret the parameters ﬁg, ﬁl , and ,84. b. Fit the proposed model in (a) and calculate
estimates ofﬂi, i =0, 1, . . . , 4, and 02. c. Test the following hypotheses: (i) .54 = 0; (ii) 133 = I34 = 0;
(iii) I92 =.33;' (iV).51= .32 ='ﬁ3 = 164 = 0
d. Consider the reduced (restricted) model
with ﬁ4 = 0. Estimate its coefﬁcients and
give an expression for the expected sales. 6. Using the model in (d), obtain a prediction
for the sales in a district where
x1: 3.0, x2 :45, and x3 = 10. Obtain the
corresponding 95% prediction interval. The survival rate (in percentage) of bull
semen after storage is measured at various
combinations of concentrations of three
materials (additives) that are thought to
increase the chance of survival. The data
listed below are given in the ﬁle bsemen. 4.14. % Survival % Weight 1' % Weight 2 % Weight 3 ' (y) (x1) (x2) (263)
25.5 1.74 5.30 10.80
31.2 6.32 5.42 9.40
25.9 6.22 8.41 7.20
38.4 10.52 4.63 8.50
18.4 1.19 11.60 9.40
26.7 1.22 5.85 9.90 138 Multiple Linear Regression Model % Survival ‘ % Weight 1 % Weight 2 % Weight 3 (1’) (X1) (x2) (x3)
26.4 4.10 6.62 8.00
25.9 6.32 8.72 9.10
32.0 4.08 4.42 8.70
25.2 4.15 7.60 9.20
39.7 10.15 4.83 9.40
35.9 1.72 3.12 7.60
26.5 1.70 5.30 8.20 Assume the model y = .30 + 61x1 + [32x2 +
[33X3 + 6. a.
b. d. 6. Compute X’X, (X’X)_1, and X’y. Plot the response y versus each predictor
variable. Comment on these plots. . Obtain the least squares estimates of 3 and give the ﬁtted equation. Construct a 90% conﬁdence interval for i. the predicted mean value of y when
x1=3,x2;8, andx3=9; ii. the predicted individual value of y when x1;3,x2=8,andX3=9. Construct .the' AN OVA table and test for a
signiﬁcant linear relationship between y
and the three predictor variables. An experiment was conducted to study the
toxic action of a certain chemical on silkworm larvae. The relationship of loglo
(survival time) to log10(dose) and
log10(larvae weight) was investigated. The
data, obtained by feeding each larvae‘a
precisely measured dose of the chemical in an
aqueous solution and recording the survival
time until death, are given in the following
table. The data are stored in the ﬁle silkw. 10810 10gm 10g10
Survival Time (y) Dose (x1) Weight (x2)
2.836 0.150 0.425
2.966 0.214 0.439
2.687 0.487 0.301
2.679 0.509 0.325
2.827 0.570 0.371
2.442 0.590 0.093
2.421 0.640 0.140 10£510 10810 10g10
Survival Time (y) Dose (x1) Weight (x2)
2.602 0.781 0.406
2.556 0.739 0.364
2.441 0.832 0.156
2.420 0.865 0.247
2.439 0.904 0.278
2.385 0.942 0.141 .
2.452 1.090 0.289
2.351 1.194 0.193 Assume the model y = ,60 +,61x1 +132x2 + E. a. Plot the response y versus each predictor
variable. Comment on these plots. Obtain the least squares estimates for ,8
and give the ﬁtted equation. Construct the AN OVA table and test for a
signiﬁcant linear relationship between y
and the two predictor variables. ' . Which independent variable do you consider to be the better predictor of
log(survival time)? What are your reasons? Of the models involving one or both of the
independent variables, which do you
prefer, and why? 4.16. You are given the following matrices
computed for a regression. analysis:
9 136 269 260
X’X— 136 2,114 4,176 3,583
. — 269 4,176 8,257 7,104
260 3,583 7,104 12,276
45
648
X' =
y 1,283
1,821
9.610 0.008' '—0.279 —0.044
0.008 0.509 —O.258 0.001
(mo-1:
—0.279 —0.258 0.139 0.001
-—0.044 0.001 0.001 0.0003 Ev EXERCISES '5 6 Consider the following regression model: Salary (in $1,000): 20 + 2x + 52 + 0.7xz where x is the number of years of experience,
and z is an indicator variable that is 1 if you
have obtained an MBA degree and 0
otherwise; xz is the product between years of
experience and the indicator variable 2. Graph salary (y) against years of
experience (x). Do this for both groups
(without MBA and with MBA) on the same
graph, and comment on the degree of
interaCtion. . You are interested in the starting salaries of accounting, management information
systems, and economics majors. You
consider a model that factors in the GPA of
students, obtaining the following regression
model: Salary (in $1,000): —15'+ (18)GPA + (3)1NDacc + (2-1)INDn1is INDacc is an indicator variable that is 1 if the ' student is in accounting and 0 otherwise.
lNDmiS is an indicator variable that is 1 if the . student is an MIS student and 0 otherwise. a. Calculate the expected salary difference
between an accounting and an economics
student with the same GPA. b. Calculate the expected salary difference
between an accounting and an MIS student , with the same GPA. .Me data are taken from Mazess, R. B., _ Peppler, W. W., and Gibbons, M. Total body composition by dualphoton (153 Gd) ‘
absorptiometry. American Journal of Clinical
Nutrition, 40, 834—839, 1983. The data are
given in the ﬁle bodyfat. A new method of measuring the body fat
percentage is investigated. The body fat, age
(between 23 and 61 years), and gender (4
males and 14 females) of 18 normal adults
are listed below. Graph body fat against age and gender
(you may want to overlay these two on the
same graph). Consider a regression model Exercises 163 with age and gender as the explanatory
variables. Interpret the results, and discuss
the effects of age and gender. Is it useful to
include an interaction term for age and
gender? y = % Fat x1 2 Age x2 = Gender 9.5 23 1
27.9 23 0
7.8 27 1
17.8 27 1
31.4 39 0
25.9 41 0
27.4 45 1
25.2 - 49 0
31.1— 50 0
34.7 53 0
0 0 0 0 O 0 0 0 42.0 53
29.1 54
32.5 56
30.3 57
33.0 58
33.8 58
41.1 60
34.5 61 . You are regressing fuel efﬁciency (y) on three predictor variables, 'x1 , x2, and x3, and
you obtain the following ﬁtted regression
model: ' I30 = 30 + 31161 + 32):; + 33163 The coefﬁcient of determination for this
regression model is R2 290%. A regression of XI on x2, x3‘ gives you an R2
of 60%; A regression of xz on x1, x3 gives you an R2
of 80%; and ' ' A regression of x; on x1, 152 gives you an R2
of 90%. Calculate and interpret the variance inﬂation
factors for the regression coefﬁcients ,6 1, [32,
and ,33 . 164 Specification issues in Regression Models .Which one of the following statements I Average y.
. - ‘ suggests the presence of a multicollinearity Factor 1: Factor 2: Factor 3: from 5 problem: x1 x2 x3 Experiments a. High R2 and high I ratios _1 _1 _1 79.7 b. High correlation between explanatory I 1 —1 —-1 V 74.3
variables and dependent variable . —1 1 —1 76.7 c. Low pairwise correlation among ‘ ' 1 . 1 ‘1 ' 70'0
independent variables _1 — i 1 gig 2 - _ '
d. Low R and low t ratios . _1 1 1 87.3
e. High Rzand mostly insigniﬁcant t ratios 1 1 1 73.7
5.6. The data are taken from Latter, H. O: The cuckoo’s egg. Biometrika, 1, 164—176, 1901. The data are given in the ﬁle cuckoo. Each listed yield is actually the average of
The female cuckoo lays her eggs into the ﬁve individual independent experiments. The nest of foster parents. The foster parents’are variance of individual measurements can be - usually deceived, probably because of the estimated from the ﬁve replications in each similarity in the sizes of the eggs. Latter cell. It is found that investigated this possible explanation and 8 5 measured the lengths of cuckoo eggs (in Z Z (yij — 52,-)2 millimeters) that were found in the nests of s2 : i=1f=1 : 40 0 the following three species: , . 8(5 — 1) ~ . He (1 e S arrow a. Estimate the effects of factors 1—3. That is,
V2g2 0p 23 9' 20 9 23 8 25 0 24 0 estimate the coefﬁcients in the regression
21.7 23.8 22.8 23.1 23.1 23.5 mOdel . 23‘0 23'0 Y=ﬁo+ [31161 +ﬂ2x2 +133X3 +8' Robin: - 218 23-0 23.3 22.4 23.0 23.0 Calculate the Standard errors Of the 230 22 4 23.9 223 220 22.6 coefﬁcients and interpret the results. . 220 22.1 21.1 230 Comment on the nature of the design
matrix. Wren: ‘ 19.8 221 21.5 209 22.0 21.0 b. Is it possible to learn something about
22.3 21.0 20.3 209 220 200 interactions? Consider the interaction
20.8 21.2 21.0 effect between factors 1 and 2. Write out the Xmatrix of the regression model y = .30 + 16116] + 132362 + ,33X3 + ﬂ4x1x2 + 8.
Estimate the model and comment on this Obtain the analysis of variance table and test whether or not the mean lengths of the eggs issue. » found in the nests of the three species are 5.8. In a study on the effect of coffee consumption
different. Display the data graphically, and on blood pressure, 30 patients are selected at
interpret the results. random from among the patients of a medical practice. A questionnaire is administered to 5.7. Percenta e ields from a chemical ti
g y reac on each patient to get the following information: for changing temperature (factor 1), reaction time (factor 2), and concentration of a certain x1 : Average number of cups of coffee
ingredient (factor 3) are as follows: consumed/day ...

View
Full Document

- Fall '15
- Linear Regression