homework 10 key2 - Regression coefficients Btimate...

Info icon This preview shows pages 1–14. Sign up to view the full content.

Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 2
Image of page 3

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 4
Image of page 5

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 6
Image of page 7

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 8
Image of page 9

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 10
Image of page 11

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 12
Image of page 13

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

Image of page 14
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Regression coefficients Btimate Standard Error t Probability Intercept 10.45 | 3.193 | 3.27273422... 0.00205023... Population 1. 999 0.0600 3131666667... < 0.001 of city Size of store 0.253 | 0.3450 | 074732509... 045045540... Amount spent 0.250 0.3262 0.?6640093... 0.4474392... on promotion Distance to 1. 609 0.2346 635848252... < 0.001 city center a} At a level of sig‘liflcanoe of 0.05. the result of the F test for this model is that the null hypothesis E rejetxed. b} Calculate the 95% confidence interval for the slope (BI) of the variable population or city. You may find this useful. Give your answers to 3 decimal placa. 0 50151 c} Suppose you are going to construct a new model by removing the most insignificant variable. You would first remove: population of city ' size of store amount spent on promotion distanoe to city Feedback [2 out of 4] a) You are correct. b) This is not correct. 1.878 < Bl < 2.120 4:) You are correct. Discussion a) Since the P-yalue of the F test statistic less than 0.05, the F test null hypothesis that all the model ooefi'iciens are equal to zero ls rejected and you conclude that at least one of the parameters is non-zero. b) A 95% confidence interval for the slope [ii is given by the formula: Mable; n = sample size = 50 p = number of explanatory variables = 4 b1 : estimate of the slope = 1.999 i; = two-tailed 95% critical value in the t distribution with n-p-1 r: 45) degrees of freedom = 2.0141 so,l = standard error in 01 = 0.0600 B1 = population slope = unknown b1 1 t' )< SE.”1 Therefore the confidence interval is given by: 1.999 1' 2.0141. x 0.0500 h) A 95% confidence interval for the slope B1 is given by the formula: hide. variables n = sample size = 50 p = number of explanatory variables = 4 b1 = estimate of the slope = 1.999 t" = halo—tailed 95% critical value in the t distribution with n—p—1 (= 45) degrees of freedom = 2.0141 SE”1 = standard error in bl = 0.0600 B: = population slope = unknown 01 d: t. X 55.31 Therefore the confidence interval is given by: 1.999 in 2.0141 x 0.0600 Which, when rounded to 3 decimal places, can be expressed as: 1.878 < [31 < 2.120 c) In refining a multiple regression model you may want to remove any explanatcry variables than do not significantly add to the predictive power of the model. You do this on the basis of the significance of the coeificient estimate of each variable. Any that are not statistically significant are candidates for this process. Supposing that you are followmg this procedure with this model. you want to identify the most insignificant variables of any insignificant variables. That means the variable with the regression coefficient that has the highest F-value. From the regression coefficients output. you can see that this is the variable size of store. :I 1 of 1. ID: MSTMKCMJLDDBD You are part of a team investigating the identifying motor vehicle accidents. A multiple regression model is to be constructed to predict the number of motor vehicle accidents in a Down per year based upon the population of the Down, the number of recorded trafiic olfenses per year and the average annual oemperature in the town. Data has been collected on 30 randomly seleaed towns: mm I} Find the multiple regression equation using all three explanatory vanahles. Assume that M is population. K2 is number of recorded traffic offenses per year and X3 ls average annual temperature. Give your answers to 3 decimal placu. y = 1 + 2 population + 3 no. traffic ofienofi + 4 average temp b} At a level or significance of o.osr the result of the F test for this model is that the null hypothesis [E] rejected. c) The explanatory variable that is most correlated with number of motor vehicle aocldents per year Is: population - number of ham: offenses average annual temperature d} The explanatory variable that is least mrrelated with number of motor vehicle accidents per year Is: Population number oi traffic ofiense's - average annual temperature e} The value of R2 [or this model. to 2 dedmal pla. Is equal to 1 i) The value of s for this model, to 3 decimal places, is equal to 2 9} Construct a new multiple regresswn model by removing the variable average annual ramperarure. Give your answers to 3 deflmal places. The new regression model equation Is: 9 = 3 + 3 population + 3 no. traffic offences ll] In the new model compared to the previous one. the value oi R2 {to 2 decimal places) is: h) In the new model compared to the previous one, the value of R2 (to 2 decimal places) is: increased 1 decreased u n ch an ged i) In the new model compared to the previous one, the value of s (to 3 decimal places) is: - increased decreased u n ch an ged — Feedback [3 out of 14] a) This is not correct. 1'}: 11.617 + 1.789 x population + 3.473 x no. traffic offences + 0.093 x average temp b) You are correct. c) This is not correct. The explanatory variable that is most correlated with number of motor vehicle accidents per year is population. 11) You are correct. e) You are correct. I) This is not correct. The value ofs for this model is equal to 25.076. 9) This is not correct. 1} = 19.081 + 1.789 x population + 3.472 x no. traffic offences h) This is not correct. In the new model compared to the previous one, the value of R2 is unchanged. I) This is not correct. In the new model oompared to the previous one. the value of s is decreased. DISC" SSIOI‘I Entering the data into a suitable software package. you should obtain the following results: Regresslon analysis a 0.99493951... E 2503555503... DF 55 MS F Probability IRegression 3 3,214,260.06620541...|1,071,422.95542347... 1,703.94013939... < 0.001 IResidual 25 16,343.50040161... |523.70047599... ITotal 29 3,230,617.36668702...I Regression coefficients Estimate Standard Error t Probability I Intercept 11.6165899... I 12931458362... 038983202... I 032910916... I Population 1.78873764... I 003112565... 5146827765... I < 0.001 No. of recorded 3.4723972... I 008356307... 41.55770554... < 0.001 traffic offences I Average annual temperature 009323003... I 1.60757764... 0.0579941... I 035419575... a) Therefore, the regression equation with all three variables is: bu + blxl + bzxz + b3X3 = 11.6165899... + 1.78873764... x xi + 3.4723972... >< x2 + 0.09323003... x x3 11.61? + 1.789 x xi + 3.473 x x2 + 0.093 x X3 Rounded to 3 decimal places ‘1' h) Since the P-value ofthe F test statistic less than 0.05. the F test null hypothesis that all the model coefficients are equal to zero In rejected and you conclude that at least one of the parameters is non-zero. The correlations of each explanatory variable with the response variable are: Currelaticm with no. of motor vehicle accident/year Population 0. 80265297. .. | No. of traffic offences I 0. 59266568... | | Average temp I -0. 09224906... | c) Therefore, the variable that has the highest correlation with number of motor vehicle accidents per year is population. d) The variable that has the lowest correlation with number of motor vehicle accidents per year is average annual hemperahure. e) From the regression model analysis. the value of R2 for the model (to 2 decimal places) is equal to 0.99. 0 From the regression model analysis. the value of s for the model (to 3 decimal places) is equal to 25.076. Performing a new regression analysis with the variable average annual temperature removed you should have the following results: Regression analysis R2 0. 99493886. .. 24. 6085014. . . or 55 MS F Probability Regression 2 l3,214,266.7514506... l1,607,133.3757253...l2,653.8313614... <o.oo1 | Residual 27|16,350.61521606... |aos.57934134... | | Total 29 | 3,230,617.36666667... | | Regression coefficients w 19.08069091... Population 138880285. .. No. of recorded 3.47190367... traffic offences 9) Therefore. the regression equation with the two variables population and number oflecorded traffic offences ls: ; = Do + 01x1 + bzxz = 1908069091... + 1.78880285...X1 + 3.47190367.._X2 = 19.081 + 1.789 x xi + 3.472 x x2 Rounded to 3decimai places From the new regression analysis. you can see that compared to the previous model: h) To 2 decimal places. the value of R2 is unchanged. i} To 3 decimal places. the value ofs is decreased. 2| 1 of 3 m: usr.MR.-rn.oz.oozo A companyr that manufadurs paper develops a regression model in order to predict the number of salfi it will make in a cityI (y) in terms of three variables: the population of that city {xi}. the number of companies in that city [X2} and the amount ofoompetition in the city {X3}. A sample of 60 cities is collected. Foreach city. the population of the city, the number of companies in the city and the amount of competition (in terms of net worth of competing companies) In the city are recorded. Also, the number of sales Is recorded. The following regression equation was calculated: 9: 78.62 + 10.83x1 + assoxz - 15.92><3 Along with this, the following values were calculated: 55M = 922.83 55E = 201.55 An overall F test is to be concluded in order to assas the significance of this model. So the hypotheses are: Ha: All regression Doel’ficients are zero Ha: Not all regression ooel’fiu'ents are zero Give you answer to part a) to 4 decimal places. Give your answers to part b) as whole numbers. 3) Calculate the mast statistic (F) for this test. F = 0.218 b) 111|s test statistic follows the F distribution with 5? degrees of freedom in the numerator and 1 degrees of freedom In the denominator. [D outofa] a) This is not correct. F = 85.4684 b) This is not correct. This test statistic follows the F distribution with 3 degrees of freedom in the numerator and 55 degrees of freedom in the denominator. a) In order to calculate the test statistic, you must first calculate the regression mean square (MSR) and the error mean square (MSE). These values are calculated in terms of the regression sum of squares {55M} and error sum of squares (SSE) respectively. Let p denote the number or independent variables in the model and n denote the sample size that helped develop the regression equation. So in this question, p = 3 and n = 60. ‘lhen: 55M P 922.83 3 = 302.61 MSR and SSE n - p - 1 201.55 60 - 3 - 1 159910214... MSE = The test statistic can be calculated using the following formula: hide variables MSR = regression mean square = 307.61 MSE = error mean square = 3. 59910714... F = test statistic = unknown = MSR use 307.61 359910214... = 8545841975... 85.4684 Rounded as last step 1' The test statistic can be calculated using the following formula: hide variables MSR = regression mean square = 30?. 61 MSE = error mean square = 159910714... F = test statistlc = unknown MSR _ MSE = 307.51 3. 59910714. . . 8546341975... = 85.4654- Rounded as last step b) This test statistic follows an F distribution. The numbers of degrees of freedom in the numerator and denominator of this distributlon are calculated in terms of the number of Independent variables In the model (p) and the size of the sample that is used (n). The number of degrea of freedom In the numerator is p = 3. The number ofdegrees offreedom in the denominator is n - p - 1 = 60 - 3 - J. = 56. :I ofa ID: M57.MR.TM.oz.ooao Muriel has construded a multiple regression model and has conducted an F test of the model at a level of significance of 0.05. The model has four independent variables. The rault of the F test was that the null hypothesis was rejected. Select the approprlate concluslon that can be drawn: the dependent variable is not related to any of the independent variables exactly one of the regression coefficients is non—zero all of the regression coefficients are zero - at least one of the regression coefficients is non-zero all of the regression coefficients are non—zero :| Feedback [1 out of 1] You are correct. Discussion The significanoe of a multiple regression model can be tested by using an F test. The F test for a multiple regression model has the following null and alternate hypotheses: H0=Bl=BZ=B3=Bd=D Ha: at least one is not equal to zero Therefore, If the null hypothesis is rejected, you oonclude that at least one 01' the regression coefficients ls non-zero. mat is, that the model is significant. 3 2 or 3 m: nsr.nn.m.cs.dozob The following {our diagrams depict four residual plots for four different regression models. Select the residual plot that suggeas that the assumption of independence of error terms is violated: ..vc.¢.uoc.-.-.o oun.n...'un...o.o elo .. . .uu. ....o-..-oo.oocu time :1: 8'0 .P..-.. ".o"oo_uo° a. can. 3": time [1 outh 1] You are correct. Discussion Most of the assumptions in multiple regression are assumptions about the error terms. A multiple regression model with two independent variables can be written as: \i=fici+l31"1'l'32><2"'E The a term represents an independent variable, and there are several assumptions about that variable. It is assumed to be normally distributed with a mean of D and a constant variance. (In this way, for fixed values of XI and x2, y is a random variable following the normal distribution with mean 30 + 31x1 4- 32x2 and a variance that does not depend on the values of x1 and x2.) Also. the error terms are assumed to be independent. That is, there should be no correlation between different error terms. (This means that the value assumed by y for any given values of XI and x2 is independent of the value assumed by y for other given values of x1 and x2.) The assumption that the error terms are independent is tested by plotting the residuals in the order that they were gathered in. In this way, you are testing whether there is any correiation between the data points that are recorded close to one another. It may turn out that an event half-way through the data collection process affected the variables being studied. For example, consider a regression model that is developed to calculate support for the government in terms of the age and the annual salary of a citizen. A sample of 50 people is collected to develop a regression equation. Now it might turn out that, at some point during the data collection, a political event occurs that lowers overall support for the government. The data is no longer independent: larger values for the dependent variable will tend to con'elate with other larger values for the dependent variable (and smliariy, smaller values for the dependent variable will tend to correlate with other smaller values of the dependent variable). In the midual plot: lime The residuals seem to be positive for the data points collected early on in the study while they tend to be negative for the data points collected later on in the study. There is therefore discernible correlation between the error terms. and they are not independent. 2| 1 of 3 ID: Ms-r.Mn.‘rM.oa.uom Consider the multiple regression model: Y330+31x1+32><2+5 A sample Is drawn and a prediction equation is calculated, as are the residuals. Seled the method that is most oommonly used to test each assumption of the model: Plot residuals against Plot residuals against Plot Plot values of independent residuals residuals variables against time in histogram predicted values for Y a) y has a linear relationship with each of x1 and X). b) The error terms are independent of one another. ' I c) The random variable 2 follows a normal distribution. d) The variance of e is constant. I 3 Feedback [D out of 4] :| Feedback [0 out of 4] a) This is not oorrect. To test this assumption you plot the miduals against values of the independent variahlm. b) This is not oonect. To test this assumption you plot the mlduals against time. c) This is not oonect. To test this assumption you plot the mlduals In a histogram. d) This is not oonect. To test this assumption you plot the residuals against predicted values for y. Discussion Assumptions about multiple regression models In general terms, the assumptions about multiple regression models are about the nature of the dependent variable, y. In particular, if there are two independent variables the ammption is that for given values of XI and x2, y is a normally distributed random variable with expected value E(y) = Bo + 31x1 + 32x2 and a varianoe that does not depend upon the values of the independent variables. Also, the value assumed by y at one given set of values of the independent variables is independent of the value assumed by y at another set. Most of these assumptions can be restated in terms of the error random variable, e. That is why the tests for the validity of the assumptions are all related to the residuals in the sample. Linearity It is awn-ted that the dependent variable y varies linearly with each of the independent variables. The most common way of testing this is to plot the residuals against the valuesofeach independent variable in the sample. The residuals should be centered about 0 with no overall pattern. any pattern in the residuals would suggest a non-linear relationship. For example. suppose in plotting the residuals against the variable x1. you find that the rduals tend to be positive for low and high values of XI but negative for middle values. This would suggest that y pmibly has a quadratic relationship with x1. Llnearltv It is assumed that the dependent variable y varies linearly with each of the Independent variables. The most common way of testing this is to plot the residuals against the values of each independent variable in the sample. The residuals should be centered about 0 with no overall pattern. Any pattern in the residuals would suggest a non-linear relationship. For example. suppose In plotting the residuals agalnst the variable x1. you find that the residuals tend to be positive for low and high values of x1 but negative for middle values. This would suggest that y possibly has a quadratic relationship with x1. Independence It is assumed that the error terms are independent. The main way that this assumption would be violated is through the existence of some relationship between consecutive measurements in the sample. For this reasonr to test independence of the error terms you plot the residuals In the order they were gathered. In other words, you plot residuals against time. Through this you can detect the presence of any autooorrelation in the data. Positive autooorrelation occurs when consecutive error terms have the same sign {Mime or negative) more often than would be expected. Negative autooorrelation occurs when consecutive error terms tend to switch signs more often than would be expected. Normality It is assumed that e follows a normal distribution. To test this assumption, you treat the residuals as sample data points from this random variable and put them into a frequency distribution. There are then several options available to you. You can do a goodness-of-fit test, a normal probability plot, or a simple histogram in order to assess whether 5 is normally distributed. Equalltyofvarlanoe There is an assumption that the variance of y at difierent levels of the Independent variables does not depend on the values those variables take. This assumption can be restated as: the variance of c is constant. In simple linear regression, this assumption isltested by plotting the residuals against the values of the independent variable. However in multiple regression It is most convenient to plot the residuals against y. The residuals should be centered about I] with no overall pattem. If. for example, there are extreme (postive or negative) values for the residuals for small values of y and low values for the residuals for large values of y, this would suggest that s does not have constant variance. 3| 3 of 3 ID: MST.MR.m.os.co1o A multiple regression model has be2n developed in order to preriict the number of work injuries that occur at a mechanic workshop in a month (y) based on the amount of money spent on maintaining govemment-standard safe machinery and equipment (x1) and the and the number of customers that the workshop gets a month (3(2). Y=Bo+l31x1+flzii2+5 A sample of ?5 workshops is...
View Full Document

  • Fall '13
  • ChristaLSorola

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern