Unformatted text preview: Simple Linear Regression Chapter 13 Simple Regression -- Part 1 1 Correlation & Regression We will investigate the relationship between two variables, Y and X In general, we want to answer the question: "Does X tell us something about how Y behaves?" Simple Regression -- Part 1 2 Textbook's Example At Sunflowers Apparel, how well can you predict sales by the size of the store? You would naturally think that more stuff on display ==> higher sales. Sales from 14 stores are in Site.xls. Simple Regression -- Part 1 3 Exploring the data We use an X-Y scatter plot to display both the strength and type of relationship. We use the correlation coefficient to summarize the strength of the relationship. Simple Regression -- Part 1 4 The example
Scatter Plot of Sales versus Store Size 14 12 Annual Sales (Millions) 10 8 6 4 2 0 0 1 2 3 4 5 6 7 Square Feet (1000s) Correlation = ? Simple Regression -- Part 1 5 Computing r Many of you can do this with the calculators you used in STA 2023 Our text covers this back in the descriptive statistics chapter 3. Page 116: Cov ( X , Y ) r= S x Sy Simple Regression -- Part 1 6 Computations
Store Square Feet Annual Sales 1 1.7 3.7 2 1.6 3.9 3 2.8 6.7 4 5.6 9.5 5 1.3 3.4 6 2.2 5.6 7 1.3 3.7 8 1.1 2.7 9 3.2 5.5 10 1.5 2.9 11 5.2 10.7 12 4.6 7.6 13 5.8 11.8 14 3 4.1
Simple Regression -- Part 1 4.523367 Covar 2.999414 SD sales 1.707981 SD SqFt 0.950883 Correl 7 Excel and PhStat notes PhStat has the Scatter Plot off the descriptive statistics menu. Excel has functions COVAR and CORREL. Correlation not directly in PhStat, but there are templates Covariance.xls and Correlation.xls. Simple Regression -- Part 1 8 What is covariance? It measures how much X and Y tend to vary in the same direction. Positive covariance means ___________. A negative covariance means _________. However, it is hard to interpret because there is no "standard". What does the covariance of 4.52 mean? If store size was in actual square footage, covariance would be 4524.
Simple Regression -- Part 1 9 Correlation coefficient
A "standardized" covariance Population covariance between X and Y = Y X
Estimated by Sample covariance between X and Y r= SY S X
Simple Regression -- Part 1 10 Correlation coefficient The correlation coefficient (r) measures how much Y and X tend to vary in the same direction on a standard scale It will always be between -1 and +1
r = +1 implies a perfect positive relationship r = 1 implies a perfect negative relationship r = 0 implies no linear relationship exists! Simple Regression -- Part 1 11 Correlation patterns Simple Regression -- Part 1 12 Two other examples Files are in the simple regression lecture module: __________ and __________. Might wish to print out the page with the graph. For later, will also want the regression output. Simple Regression -- Part 1 13 Strength of the Relationship Correlation measures the strength of the linear relationship between Y and X How large does this measure (r) have to be to show a "useful" linear relationship? There is a formal hypothesis test on 500-501. For now, here is a quick "rule of thumb". Simple Regression -- Part 1 14 The quick rule of thumb Correlation is significant if: 2 r > n This is approximately what would occur in a hypothesis test at = .05 significance. If you are close to that you might want to perform the formal hypothesis test.
Simple Regression -- Part 1 15 Quick test results Site selection: n = 14, r = .95088 Next example: ____________ n = ___ and r = _____ Simple Regression -- Part 1 16 Simple Linear Regression Obtaining the Fit
Sections 12.2 and 12.3 Simple Regression -- Part 1 17 Regression analysis Correlation tells us how strongly Y and X are related. Regression analysis is the name of the procedure that estimates the form of this relationship. We'll begin with simple regression, which assumes the form: ^ = b +b X Yi 0 1 i
Simple Regression -- Part 1 18 Regression notation Y is the variable we want to predict We believe X influences how Y behaves i b0 b1 is the estimated value of Y at Xi is the Y-intercept in the equation is the slope of the regression line Simple Regression -- Part 1 19 Example (page 474)
n = 14 Sunflowers Apparel stores Y = Annual sale in Million$ units. Values range from 2.7 to 11.8 X = Size of the store in 1000-square foot units (values from 1.1 to 5.6)
Simple Regression -- Part 1 20 Scatter Plot
Sunflow ers Apparel 14 12 Annual Sales (Millions) 10 8 6 4 2 0 0 1 2 3 4 5 6 7 Size of Store (1000 sq feet) Simple Regression -- Part 1 21 Fitting the Regression Line Our goal: Find the straight line that best fits the data we've collected minimizes the error in fit The best equation will be the one that The equation is: ^ Yi = b0 + b1 X i ^ ei = Yi - Yi
22 The fit error is thus: Simple Regression -- Part 1 Obtaining the line to predict sales
Sunflow ers Apparel 14 12 Annual Sales (Millions) 10 8 6 4 2 0 0 1 2 3 4 5 6 7 Size of Store (1000 sq feet) + Errors - Errors Simple Regression -- Part 1 23 Balancing out the errors The fit error for the ith plot diagram is: point on the scatter ^ ei = Yi - Yi We would like the sum of the + errors to be the same as the sum of the errors. make this happen. However, there are many lines that can Simple Regression -- Part 1 24 The "Least Squares" Line So, which of these solutions is the best one? Select the line with the minimum sum of squared error terms: ei = ?
2 i n This requires ... (gulp!) ...
Simple Regression -- Part 1 CALCULUS!
25 The Least Squares Estimators Slope: b1 = r Sy Sx
POOF! Intercept: b0 = Y - b1 X There are many equivalent forms (478)
Simple Regression -- Part 1 26 Regression with sales data
Sunflow ers Apparel 14 12 Annual Sales (Millions) 10 8 6 4 2 0 0 1 y = 1.6699x + 0.9645 R2 = 0.9042 PHStat scatter plot Size of Store (1000 sq feet) 2 3 4 5 6 7 Excel's Trend Line function and R2
Simple Regression -- Part 1 27 Output from PHStat Simple Regression -- Part 1 28 Interpretation of results Remember the variables are
Y = Annual sales per store (in Million$) X = Size of store (1000 square feet) The estimated slope (b1) tells us: The estimated intercept (b0) tells us:
Simple Regression -- Part 1 29 Second example, via workbook
Open your data file _______________ 2. Open Simple Linear Regression.XLS 3. Copy your data to SLRData sheet (the X variable goes in column A, Y in col. B)
1. Simple Regression -- Part 1 30 Updating the formulas
The workbook assumed data was in cells A2 through B15. On the COMPUTE worksheet, need to change this to A2 through B25 or B36 or whatever. Select cell range L2:M6 2. In L2, fix the A and B upper limits 3. Hit Ctrl/Shift Enter or Apple-Key Enter
Simple Regression -- Part 1 31 Second example, interpretation Variables are Y = _____ and X = ______. Equation is: The estimated slope (b1) tells us: The estimated intercept (b0) tells us: Simple Regression -- Part 1 32 How good is our new model?
There are two standard ways to judge:
1. How much of the variation in the Y values (sales) can be attributed to the different values of X (store size)? In general, how small (or large) are the errors in fit?
Simple Regression -- Part 1 33 1. R A universal measure of fit
2 The Coefficient of Determination:
The variation in Y explained by the X - Y relationship R = The variation in Y
2 The R2 value is: Always between 0 and 1 Usually interpreted as a percentage The square of correlation (for simple regression)
Simple Regression -- Part 1 34 Output from PHStat
90.4 % of the variation in sales is due to variation in store size. Simple Regression -- Part 1 35 How is R2 computed? ANOVA table: Total variation in the Y values is SST = 116.9543 The amount of unexplained variation is SSE = 11.2067 The difference is thus the variation explained by the regression equation or SSR = 105.7476 The ratio of explained to total is how we get R2 = 105.7476 / 116.9543 = .9042
Simple Regression -- Part 1 36 Size of the typical error (SYX) For each observation i, its error is given by: ^ ei = Yi - Yi
SYX = To find the "typical error," use this formula: ei
i n 2 n-2 This is the amount by which the prediction typically misses the actual value
Simple Regression -- Part 1 37 Output from PHStat
SYX our text calls this the standard error of the estimate Simple Regression -- Part 1 38 SYX in our example The typical error (called the standard error of the estimate) for our model is: SYX = .9664 This means that: That doesn't sound so bad, if you consider that annual sales ranged from _______ to _______. Simple Regression -- Part 1 39 Our second example n = ___ R2 Y = _______ X = ________ = = SYX Simple Regression -- Part 1 40 ...
View Full Document
- Spring '08
- Regression Analysis, PHStat, Sunflow ers Apparel