Linear Functions and Modeling

Lines of Best Fit

Linear Regression

Linear regression is the process used to find the equation of a line of best fit that approximates the closest linear relationship between two variables. The correlation coefficient indicates the strength of the linear fit to the data.
Linear regression is a statistical method that calculates a line of best fit for a given set of data points. The line of best fit has the minimum value for the sum of the squares of the distances from the data points to the line.
The line of best fit is also called a regression line for the data in a scatterplot. The distance from each point to the line is calculated as the sum of the squares of each point's distance, which is minimized by the process of linear regression.
A correlation is a relationship between two variables. When the points in a scatterplot are very close to a line of best fit, there is a strong correlation. When they show a general linear pattern, but are not close to the line, there is a weak correlation. Both positive and negative trends may exhibit strong or weak correlations. When the points show no linear pattern at all, there is no linear correlation.

The correlation coefficient, rr, of a line of best fit is a value between –1 and 1, inclusive, that indicates the strength and direction of the correlation of the line.

  • An rr-value of 1 indicates that the line has a positive slope and all the points lie on the line.
  • A value of rr close to 1 indicates a strong positive correlation.
  • A positive rr -value closer to zero than to 1 indicates a weak positive correlation.
  • An rr -value of zero indicates no correlation.
  • A negative rr -value closer to zero than to –1 indicates a weak negative correlation.
  • A value of rr close to –1 indicates a strong negative correlation.
  • An rr -value of –1 indicates that the line has a negative slope and all the points lie on the line.
Scatterplots show how the correlation coefficient indicates the strength and direction of a linear correlation. The sign of rr indicates whether the correlation is positive or negative, and the absolute value of rr indicates the strength of the correlation. The greater the absolute value, the stronger the correlation.

Interpreting Lines of Best Fit

Technology can be used to generate a line of best fit.
For very small data sets, a line of best fit can be calculated by hand. Most often, technology such as graphing calculators, spreadsheets, or online tools is used to determine the equation of the line.
Step-By-Step Example
Determining the Equation of a Line of Best Fit

Employees at a company start with an average salary of $40,500 at year zero. The table shows the average salaries of the company's employees for selected years of service. Graph and interpret the line of best fit.

Year Average Salary
0 $40,500
1 $42,000
2 $43,500
3 $45,000
4 $45,500
5 $47,000
9 $52,000
10 $54,000
11 $55,500
12 $56,000
13 $56,500
14 $56,000
15 $56,500
16 $57,000
17 $58,000

Step 1
Create a scatterplot of the data.
Step 2
Calculate the line of best fit by using the linear regression function of a graphing calculator.
The line of best fit is:
y1,061x+41,660y \approx 1\rm{,}061x + 41\rm{,}660
Solution
Graph the line of best fit on the scatterplot.
Next, interpret the line of best fit.
  • The correlation coefficient of r0.98r\approx 0.98 means that there is a very strong positive correlation between the years and salaries. When employees work at the company for a number of years, their average salaries have increased.
  • The slope of the line is about 1,061, which means that salaries have increased by about $1,061 per year.
The yy-intercept is about 41,660, which indicates that the average salary in year zero was about $41,660. Notice that the value is slightly more than the actual average salary in year zero of $40,500. The difference is the result of the line of best fit as an approximation of the data set rather than an actual intersection through each data point.

Making Predictions Using Lines of Best Fit

A line of best fit can be used to predict data values.

The line of best fit can be used to predict values that are not in the data set.

  • Interpolation is predicting a data value between given data points.
  • Extrapolation is predicting a data value outside the set of given data points.

Predictions using the line of best fit may not always be accurate because the trend may not continue into the future. Thus, extrapolation from the line of best fit is associated with a greater degree of uncertainty than interpolation and is more likely to produce inaccurate results. By contrast, interpolation is quite useful for making accurate predictions between measured values.

The line of best fit for the average salaries at a company, where yy is the average salary in dollars and xx is the number of years the company has been in operation, is:
y=1,061.340641x+41,660.20236y = 1\rm{,}061.340641x + 41\rm{,}660.20236
To use interpolation to predict the average salary at the company after 7 years, substitute 7 for xx in the equation of the line of best fit.
y=1,061.340641(7)+41,660.20236=49,089.58685\begin{aligned}y &= 1\rm{,}061.340641(7) + 41\rm{,}660.20236 \\&= 49\rm{,}089.58685\end{aligned}
Based on the line of best fit, the average salary at the company after it has been operating for 7 years is about $49,100.