Regression
Predictions and Probabilistic Models
Regression models are often used to predict a response variableLearning Objectives
Explain how to estimate the relationship among variables using regression analysisKey Takeaways
Key Points
- Regression models predict a value of the variable, given known values of thevariables. Prediction within the range of values in the data set used for model-fitting is known informally as interpolation.
- Prediction outside this range of the data is known as extrapolation. The further the extrapolation goes outside the data, the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values.
- There are certain necessary conditions for regression inference: observations must be independent, the mean response has a straight-line relationship with , the standard deviation ofis the same for all values of, and the responsevaries according to a normal distribution.
Key Terms
- interpolation: the process of estimating the value of a function at a point from its values at nearby points
- extrapolation: a calculation of an estimate of the value of some function outside the range of known values
Regression Analysis
In statistics, regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables, called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.Regression analysis is widely used for prediction and forecasting. Regression analysis is also used to understand which among the independent variables is related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.
Making Predictions Using Regression Inference
Regression models predict a value of theIt is generally advised that when performing extrapolation, one should accompany the estimated value of the dependent variable with a prediction interval that represents the uncertainty. Such intervals tend to expand rapidly as the values of the independent variable(s) move outside the range covered by the observed data.
However, this does not cover the full set of modelling errors that may be being made--in particular, the assumption of a particular form for the relation between
Conditions for Regression Inference
A scatterplot shows a linear relationship between a quantitative explanatory variable- Repeated responses are independent of each other.
- The mean response has a straight-line (i.e., "linear") relationship with:; the slopeand interceptare unknown parameters.
- The standard deviation of (call it) is the same for all values of. The value ofis unknown.
- For any fixed value of , the responsevaries according to a normal distribution.

The importance of data distribution in linear regression inference: A good rule of thumb when using the linear regression method is to look at the scatter plot of the data. This graph is a visual example of why it is important that the data have a linear relationship. Each of these four data sets has the same linear regression line and therefore the same correlation, 0.816. This number may at first seem like a strong correlation—but in reality the four data distributions are very different: the same predictions that might be true for the first data set would likely not be true for the second, even though the regression method would lead you to believe that they were more or less the same. Looking at panels 2, 3, and 4, you can see that a straight line is probably not the best way to represent these three data sets.
A Graph of Averages
A graph of averages and the least-square regression line are both good ways to summarize the data in a scatterplot.Learning Objectives
Contrast linear regression and graph of averagesKey Takeaways
Key Points
- In most cases, a line will not pass through all points in the data. A good line of regression makes the distances from the points to the line as small as possible. The most common method of doing this is called the "least-squares" method.
- Sometimes, a graph of averages is used to show a pattern between the andvariables. In a graph of averages, the-axis is divided up into intervals. The averages of thevalues in those intervals are plotted against the midpoints of the intervals.
- The graph of averages plots a typical value in each interval: some of the points fall above the least-squares regression line, and some of the points fall below that line.
Key Terms
- interpolation: the process of estimating the value of a function at a point from its values at nearby points
- extrapolation: a calculation of an estimate of the value of some function outside the range of known values
- graph of averages: a plot of the average values of one variable (say ) for small ranges of values of the other variable (say), against the value of the second variable () at the midpoints of the ranges
Linear Regression vs. Graph of Averages
Linear (straight-line) relationships between two quantitative variables are very common in statistics. Often, when we have a scatterplot that shows a linear relationship, we'd like to summarize the overall pattern and make predictions about the data. This can be done by drawing a line through the scatterplot. The regression line drawn through the points describes how the dependent variableIn most cases, a line will not pass through all points in the data. A good line of regression makes the distances from the points to the line as small as possible. The most common method of doing this is called the "least-squares" method. The least-squares regression line is of the form
Sometimes, a graph of averages is used to show a pattern between the
The points on a graph of averages do not usually line up in a straight line, making it different from the least-squares regression line. The graph of averages plots a typical
Least Squares Regression Line: Random data points and their linear regression.
The Regression Method
The regression method utilizes the average from known data to make predictions about new data.Learning Objectives
Contrast interpolation and extrapolation to predict dataKey Takeaways
Key Points
- If we know no information about the -value, it is best to make predictions about the-value using the average of the entire data set.
- If we know the independent variable, or -value, the best prediction of the dependent variable, or-value, is the average of all the-values for that specific-value.
- Generalizations and predictions are often made using the methods of interpolation and extrapolation.
Key Terms
- extrapolation: a calculation of an estimate of the value of some function outside the range of known values
- interpolation: the process of estimating the value of a function at a point from its values at nearby points
The Regression Method
The best way to understand the regression method is to use an example. Let's say we have some data about students' Math SAT scores and their freshman year GPAs in college. The average SAT score is 560, with a standard deviation of 75. The average first year GPA is 2.8, with a standard deviation of 0.5. Now, we choose a student at random and wish to predict his first year GPA. With no other information given, it is best to predict using the average. We predict his GPA is 2.8Now, let's say we pick another student. However, this time we know her Math SAT score was 680, which is significantly higher than the average. Instead of just predicting 2.8, this time we look at the graph of averages and predict her GPA is whatever the average is of all the students in our sample who also scored a 680 on the SAT. This is likely to be higher than 2.8.
To generalize the regression method:
- If you know no information (you don't know the SAT score), it is best to make predictions using the average.
- If you know the independent variable, or -value (you know the SAT score), the best prediction of the dependent variable, or-value (in this case, the GPA), is the average of all the-values for that specific-value.
Generalization
In the example above, the college only has experience with students that have been admitted; however, it could also use the regression model for students that have not been admitted. There are some problems with this type of generalization. If the students admitted all had SAT scores within the range of 480 to 780, the regression model may not be a very good estimate for a student who only scored a 350 on the SAT.Despite this issue, generalization is used quite often in statistics. Sometimes statisticians will use interpolation to predict data points within the range of known data points. For example, if no one before had received an exact SAT score of 650, we would predict his GPA by looking at the GPAs of those who scored 640 and 660 on the SAT.
Extrapolation is also frequently used, in which data points beyond the known range of values is predicted. Let's say the highest SAT score of a student the college admitted was 780. What if we have a student with an SAT score of 800, and we want to predict her GPA? We can do this by extending the regression line. This may or may not be accurate, depending on the subject matter.
Extrapolation: An example of extrapolation, where data outside the known range of values is predicted. The red points are assumed known and the extrapolation problem consists of giving a meaningful value to the blue box at
The Regression Fallacy
The regression fallacy fails to account for natural fluctuations and rather ascribes cause where none exists.Learning Objectives
Illustrate examples of regression fallacyKey Takeaways
Key Points
- Things such as golf scores, the earth's temperature, and chronic back pain fluctuate naturally and usually regress towards the mean. The logical flaw is to make predictions that expect exceptional results to continue as if they were average.
- People are most likely to take action when variance is at its peak. Then, after results become more normal, they believe that their action was the cause of the change, when in fact, it was not causal.
- In essence, misapplication of regression to the mean can reduce all events to a "just so" story, without cause or effect. Such misapplication takes as a premise that all events are random, as they must be for the concept of regression to the mean to be validly applied.
Key Terms
- regression fallacy: flawed logic that ascribes cause where none exists
- post hoc fallacy: flawed logic that assumes just because A occurred before B, then A must have caused B to happen
What is the Regression Fallacy?
The regression (or regressive) fallacy is an informal fallacy. It ascribes cause where none exists. The flaw is failing to account for natural fluctuations. It is frequently a special kind of the post hoc fallacy.Things such as golf scores, the earth's temperature, and chronic back pain fluctuate naturally and usually regress towards the mean. The logical flaw is to make predictions that expect exceptional results to continue as if they were average. People are most likely to take action when variance is at its peak. Then, after results become more normal, they believe that their action was the cause of the change, when in fact, it was not causal.
This use of the word "regression" was coined by Sir Francis Galton in a study from 1885 called "Regression Toward Mediocrity in Hereditary Stature. " He showed that the height of children from very short or very tall parents would move towards the average. In fact, in any situation where two variables are less than perfectly correlated, an exceptional score on one variable may not be matched by an equally exceptional score on the other variable. The imperfect correlation between parents and children (height is not entirely heritable) means that the distribution of heights of their children will be centered somewhere between the average of the parents and the average of the population as whole. Thus, any single child can be more extreme than the parents, but the odds are against it.

Francis Galton: A picture of Sir Francis Galton, who coined the use of the word "regression. "
Examples of the Regression Fallacy
- When his pain got worse, he went to a doctor, after which the pain subsided a little. Therefore, he benefited from the doctor's treatment.The pain subsiding a little after it has gotten worse is more easily explained by regression towards the mean. Assuming the pain relief was caused by the doctor is fallacious.
- The student did exceptionally poorly last semester, so I punished him. He did much better this semester. Clearly, punishment is effective in improving students' grades. Often, exceptional performances are followed by more normal performances, so the change in performance might better be explained by regression towards the mean. Incidentally, some experiments have shown that people may develop a systematic bias for punishment and against reward because of reasoning analogous to this example of the regression fallacy.
- The frequency of accidents on a road fell after a speed camera was installed. Therefore, the speed camera has improved road safety. Speed cameras are often installed after a road incurs an exceptionally high number of accidents, and this value usually falls (regression to mean) immediately afterwards. Many speed camera proponents attribute this fall in accidents to the speed camera, without observing the overall trend.
- Some authors have claimed that the alleged "Sports Illustrated Cover Jinx" is a good example of a regression effect: extremely good performances are likely to be followed by less extreme ones, and athletes are chosen to appear on the cover of Sports Illustrated only after extreme performances. Assuming athletic careers are partly based on random factors, attributing this to a "jinx" rather than regression, as some athletes reportedly believed, would be an example of committing the regression fallacy.
Misapplication of the Regression Fallacy
On the other hand, dismissing valid explanations can lead to a worse situation. For example: After the Western Allies invaded Normandy, creating a second major front, German control of Europe waned. Clearly, the combination of the Western Allies and the USSR drove the Germans back.The conclusion above is true, but what if instead we came to a fallacious evaluation: "Given that the counterattacks against Germany occurred only after they had conquered the greatest amount of territory under their control, regression to the mean can explain the retreat of German forces from occupied territories as a purely random fluctuation that would have happened without any intervention on the part of the USSR or the Western Allies." This is clearly not the case. The reason is that political power and occupation of territories is not primarily determined by random events, making the concept of regression to the mean inapplicable (on the large scale).
In essence, misapplication of regression to the mean can reduce all events to a "just so" story, without cause or effect. Such misapplication takes as a premise that all events are random, as they must be for the concept of regression to the mean to be validly applied.
Licenses and Attributions
More Study Resources for You
The materials found on Course Hero are not endorsed, affiliated or sponsored by the authors of the above study guide