Multiple Regression _ Boundless Statistics.pdf - 2021/10/4...

This preview shows page 1 out of 43 pages.

Unformatted text preview: 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics Boundless Statistics Correlation and Regression Multiple Regression 1/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics Write Wit Ad "Gramm better." - Fo Ad Grammarly Learn M Multiple Regression Models Multiple regression is used to find an equation that best predicts the Y variable as a linear function of the multiple X variables. LEARNING OBJECTIVES Describe how multiple regression can be used to predict an unknown Y value based on a corresponding set of X values or understand functional relationships between the dependent and independent variables. KEY TAKEAWAYS Key Points One use of multiple regression is prediction or estimation of an unknown Y value corresponding to a set of X values. 2/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics A second use of multiple regression is to try to understand the functional relationships between the dependent and independent variables, to try to see what might be causing the variation in the dependent variable. The main null hypothesis of a multiple regression is that there is no relationship between the X variables and the Y variables–i.e. that the fit of the observed Y values to those predicted by the multiple regression equation is no better than what you would expect by chance. Key Terms multiple regression: regression model used to find an equation that best predicts the Y variable as a linear function of multiple X variables null hypothesis: A hypothesis set up to be refuted in or- der to support an alternative hypothesis; presumed true until statistical evidence in the form of a hypothesis test indicates otherwise. When To Use Multiple Regression You use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent (Y) variable. The rest of the variables are the independent (X) variables. The purpose of a multiple regression is to find an equation that best predicts the Y variable as a linear function of the X variables. Multiple Regression For Prediction One use of multiple regression is prediction or estimation of an unknown Y value corresponding to a set of X values. For example, let’s say you’re interested in finding a suitable habitat to reintroduce the rare beach tiger beetle, Cicindela dorsalis dorsalis, which lives on sandy beaches on the Atlantic coast of North America. You’ve gone to a number of beaches that already have the beetles and measured the density of tiger beetles (the dependent variable) and several biotic and abiotic factors, such as 3/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics wave exposure, sand particle size, beach steepness, density of amphipods and other prey organisms, etc. Multiple regression would give you an equation that would relate the tiger beetle density to a function of all the other variables. Then, if you went to a beach that didn’t have tiger beetles and measured all the independent variables (wave exposure, sand particle size, etc.), you could use the multiple regression equation to predict the density of tiger beetles that could live there if you introduced them. Multiple Regression For Understanding Causes A second use of multiple regression is to try to understand the functional relationships between the dependent and independent variables, to try to see what might be causing the variation in the dependent variable. For example, if you did a regression of tiger Atlantic Beach Tiger Beetle: This is the Atlantic beach tiger beetle (Cicindela dorsalis dorsalis), which is the subject of the multiple regression study in this atom. beetle density on sand particle size by itself, you would probably see a significant relationship. If you did a regression of tiger beetle density on wave exposure by itself, you would probably see a significant relationship. However, sand particle size and wave exposure are correlated; beaches with bigger waves tend to have bigger sand particles. Maybe sand particle size is really important, and the correlation between it and wave exposure is the only reason for a significant regression between wave exposure and beetle density. Multiple regression is a statistical way to try to control for this; it can answer questions like, “If sand particle size (and every other measured variable) were the same, would the regression of beetle density on wave exposure be significant? ” Null Hypothesis The main null hypothesis of a multiple regression is that there is no relationship between the X variables and the Y variables– in other words, that the fit of the observed Y values to those predicted by the multiple regression equation is no better than what you would expect by chance. 4/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics As you are doing a multiple regression, there is also a null hypothesis for each X variable, meaning that adding that X variable to the multiple regression does not improve the fit of the multiple regression equation any more than expected by chance. Estimating and Making Inferences About the Slope The purpose of a multiple regression is to find an equation that best predicts the Y variable as a linear function of the X variables. LEARNING OBJECTIVES Discuss how partial regression coefficients (slopes) allow us to predict the value of Y given measured X values. KEY TAKEAWAYS Key Points Partial regression coefficients (the slopes ) and the intercept are found when creating an equation of regression so that they minimize the squared deviations between the expected and observed values of Y. If you had the partial regression coefficients and measured the X variables, you could plug them into the equation and predict the corresponding value of Y. The standard partial regression coefficient is the number of standard deviations that Y would change for every one standard deviation change in X , if all the 1 other X variables could be kept constant. Key Terms 5/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics standard partial regression coefficient: the number of standard deviations that Y would change for every one standard deviation change in X , if all the other X vari1 ables could be kept constant partial regression coefficient: a value indicating the ef- fect of each independent variable on the dependent variable with the influence of all the remaining variables held constant. Each coefficient is the slope between the dependent variable and each of the independent variables p-value: The probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. You use multiple regression when you have three or more measurement variables. One of the measurement variables is the dependent (Y) variable. The rest of the variables are the independent (X) variables. The purpose of a multiple regression is to find an equation that best predicts the Y variable as a linear function of the Xvariables. How It Works The basic idea is that an equation is found like this: Yexp = a + b1 X1 + b2 X2 + b3 X3 + ⋯ The Y exp is the expected value of Y for a given set of X values. b is the 1 estimated slope of a regression of Y on X , if all of the other X variables 1 could be kept constant. This concept applies similarly for b , b , et 2 3 cetera. a is the intercept. Values of b , et cetera, (the “partial regression 1 coefficients”) and the intercept are found so that they minimize the squared deviations between the expected and observed values of Y. How well the equation fits the data is expressed by R , the “coefficient 2 of multiple determination. ” This can range from 0 (for no relationship between the X and Y variables) to 1 (for a perfect fit, i.e. no difference be- 6/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics tween the observed and expected Y values). The p-value is a function of the R , the number of observations, and the number of X variables. 2 Importance of Slope (Partial Regression Coefficients) When the purpose of multiple regression is prediction, the important result is an equation containing partial regression coefficients (slopes). If you had the partial regression coefficients and measured the X variables, you could plug them into the equation and predict the corresponding value of Y. The magnitude of the partial regression coefficient depends on the unit used for each variable. It does not tell you anything about the relative importance of each variable. When the purpose of multiple regression is understanding functional relationships, the important result is an equation containing standard partial regression coefficients, like this: ′ ′ ′ 1 1 yexp = a + b x ′ ′ 2 2 + b x ′ ′ 3 3 + b x + ⋯ Where b is the standard partial regression coefficient of y on X . It is ′ 1 1 the number of standard deviations that Y would change for every one standard deviation change in X , if all the other X variables could be 1 kept constant. The magnitude of the standard partial regression coefficients tells you something about the relative importance of different variables; X variables with bigger standard partial regression coefficients have a stronger relationship with the Y variable. 7/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics Linear Regression: A graphical representation of a best fit line for simple linear regression. Evaluating Model Utility The results of multiple regression should be viewed with caution. LEARNING OBJECTIVES Evaluate the potential drawbacks of multiple regression. KEY TAKEAWAYS Key Points You should examine the linear regression of the dependent variable on each independent variable, one at a time, examine the linear regressions between each pair of independent variables, and consider what you know about the subject matter. 8/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics You should probably treat multiple regression as a way of suggesting patterns in your data, rather than rigorous hypothesis testing. If independent variables A and B are both correlated with Y, and A and B are highly correlated with each other, only one may contribute significantly to the model, but it would be incorrect to blindly conclude that the variable that was dropped from the model has no significance. Key Terms independent variable: in an equation, any variable whose value is not dependent on any other in the equation dependent variable: in an equation, the variable whose value depends on one or more variables in the equation multiple regression: regression model used to find an equation that best predicts the Y variable as a linear function of multiple X variables Multiple regression is beneficial in some respects, since it can show the relationships between more than just two variables; however, it should not always be taken at face value. It is easy to throw a big data set at a multiple regression and get an impressive-looking output. But many people are skeptical of the usefulness of multiple regression, especially for variable selection, and you should view the results with caution. You should examine the linear regression of the dependent variable on each independent variable, one at a time, examine the linear regressions between each pair of independent variables, and consider what you know about the subject matter. You should probably treat multiple regression as a way of suggesting patterns in your data, rather than rigorous hypothesis testing. If independent variables A and B are both correlated with Y, and A and B are highly correlated with each other, only one may contribute signifi- 9/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics cantly to the model, but it would be incorrect to blindly conclude that the variable that was dropped from the model has no biological importance. For example, let’s say you did a multiple regression on vertical leap in children five to twelve years old, with height, weight, age, and score on a reading test as independent variables. All four independent variables are highly correlated in children, since older children are taller, heavier, and more literate, so it’s possible that once you’ve added weight and age to the model, there is so little variation left that the effect of height is not significant. It would be biologically silly to conclude that height had no influence on vertical leap. Because reading ability is correlated with age, it’s possible that it would contribute significantly to the model; this might suggest some interesting followup experiments on children all of the same age, but it would be unwise to conclude that there was a real effect of reading ability and vertical leap based solely on the multiple regression. Linear Regression: Random data points and their linear regression. Using the Model for Estimation and Prediction Standard multiple regression involves several independent variables predicting the dependent variable. 10/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics LEARNING OBJECTIVES Analyze the predictive value of multiple regression in terms of the overall model and how well each independent variable predicts the dependent variable. KEY TAKEAWAYS Key Points In addition to telling us the predictive value of the overall model, standard multiple regression tells us how well each independent variable predicts the dependent variable, controlling for each of the other independent variables. Significance levels of 0.05 or lower are typically considered significant, and significance levels between 0.05 and 0.10 would be considered marginal. An independent variable that is a significant predictor of a dependent variable in simple linear regression may not be significant in multiple regression. Key Terms significance level: A measure of how likely it is to draw a false conclusion in a statistical test, when the results are really just random variations. multiple regression: regression model used to find an equation that best predicts the Y variable as a linear function of multiple X variables Using Multiple Regression for Prediction 11/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics Standard multiple regression is the same idea as simple linear regression, except now we have several independent variables predicting the dependent variable. Imagine that we wanted to predict a person’s height from the gender of the person and from the weight. We would use standard multiple regression in which gender and weight would be the independent variables and height would be the dependent variable. The resulting output would tell us a number of things. First, it would tell us how much of the variance in height is accounted for by the joint predictive power of knowing a person’s weight and gender. This value is denoted by R . The output would also tell us if the model allows the prediction of 2 a person’s height at a rate better than chance. This is denoted by the significance level of the model. Within the social sciences, a significance level of 0.05 is often considered the standard for what is acceptable. Therefore, in our example, if the statistic is 0.05 (or less), then the model is considered significant. In other words, there is only a 5 in a 100 chance (or less) that there really is not a relationship between height, weight and gender. If the significance level is between 0.05 and 0.10, then the model is considered marginal. In other words, the model is fairly good at predicting a person’s height, but there is between a 5-10% probability that there really is not a relationship between height, weight and gender. In addition to telling us the predictive value of the overall model, standard multiple regression tells us how well each independent variable predicts the dependent variable, controlling for each of the other independent variables. In our example, the regression analysis would tell us how well weight predicts a person’s height, controlling for gender, as well as how well gender predicts a person’s height, controlling for weight. To see if weight is a “significant” predictor of height, we would look at the significance level associated with weight. Again, significance levels of 0.05 or lower would be considered significant, and significance levels between 0.05 and 0.10 would be considered marginal. Once we have determined that weight is a significant predictor of height, we would want to more closely examine the relationship between the two variables. In other words, is the relationship positive or negative? In this example, we would expect that there would be a positive relationship. In other words, we would expect that the greater a person’s weight, the greater the height. (A negative relationship is present in the case in 12/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics which the greater a person’s weight, the shorter the height. ) We can determine the direction of the relationship between weight and height by looking at the regression coefficient associated with weight. A similar procedure shows us how well gender predicts height. As with weight, we would check to see if gender is a significant predictor of height, controlling for weight. The difference comes when determining the exact nature of the relationship between gender and height. That is, it does not make sense to talk about the effect on height as gender increases or decreases, since gender is not a continuous variable. Conclusion As mentioned, the significance levels given for each independent variable indicate whether that particular independent variable is a significant predictor of the dependent variable, over and above the other independent variables. Because of this, an independent variable that is a significant predictor of a dependent variable in simple linear regression may not be significant in multiple regression (i.e., when other independent variables are added into the equation). This could happen because the covariance that the first independent variable shares with the dependent variable could overlap with the covariance that is shared between the second independent variable and the dependent variable. Consequently, the first independent variable is no longer uniquely predictive and would not be considered significant in multiple regression. Because of this, it is possible to get a highly significant R , but have none of the independent 2 variables be significant. 13/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics Multiple Regression: This image shows data points and their linear regression. Multiple regression is the same idea as single regression, except we deal with more than one independent variables predicting the dependent variable. Interaction Models In regression analysis, an interaction may arise when considering the relationship among three or more variables. LEARNING OBJECTIVES Outline the problems that can arise when the simultaneous influence of two variables on a third is not additive. KEY TAKEAWAYS Key Points 14/43 2021/10/4 下午9:58 Multiple Regression | Boundless Statistics If two variables of interest interact, the relationship between each of the interacting variables and a third “dependent variable” depends on the value of the other interacting variable. In practice, the presence of interacting variables makes it more difficult to predict the consequences of changing the value of a variable, particularly if the variables it interacts with are hard to measure or difficult to control. The interaction between an explanatory variable and an environmental variable suggests that the effect of the explanatory variable has been moderated or modified by the environmental variable. Key Terms interaction variable: A variable constructed from an original set of variables to try to represent either all of the interaction present or some part of it. In statistics, an interaction may arise when considering the relationship among three or more variables, ...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture