Unformatted text preview: Multiple Regression
Part 4: Indicator Variables
Section 14.6 Indicator Variables Suppose some observations have a particular characteristic or attribute, while some others do not. We want to include this information in the regression model. Indicator Variables 2 Use an indicator variable Use a binary variable to "indicate" when the characteristic is present Xi = 1 if observation i has the attribute Xi = 0 if observation i does not have it Indicator Variables 3 Subdivision Prices In the data set OakKnoll.XLS are prices of 20 homes in either the OakKnoll or Hidden Hills subdivisions. We want to predict price (in $1000s) by the size of the house, but allow for neighborhood differences. The variable OakKnoll indicates which homes are in that subdivision.
Indicator Variables 4 The Oak Knoll data
Price versus Home Size
800 Hidden Hills Oak Knoll 700 Price 600 500 400 2000 3000 4000 SqFt Indicator Variables 5 The two-variable regression b1 = .1987 has the standard interpretation. Each square foot is worth .1987 $1000 or $198.70 What is the meaning of b2 = 33.5383?
Indicator Variables 6 An intercept adjustment For a indicator variable, the coefficient is an "intercept adjustment". To see this, evaluate the Y-hat equation. For Hidden Hills For Oak Knoll Indicator Variables 7 We fit two parallel equations
Price versus Home Size
850 800 750 700 Hidden Hills Oak Knoll Price 650 600 550 500 450 400 2000 3000 4000 SqFt Indicator Variables 8 Interval interpretation
The formula for a confidence interval is the same, but we would interpret it differently. Interval: 3.2986 to 63.7780 For a house of a given size, we expect a price roughly 3 to 64 thousand higher if it is in Oak Knoll rather than Hidden Hills. Indicator Variables 9 Hypothesis Test
The hypothesis test is about model form.
H0: 2 = 0 (Do not use two intercepts) H1: 2 0 (Use the two intercept model) Test as usual with tstat = b2/Sb2 = 2.340 which is significant
Indicator Variables 10 Extensions How would we handle more than two neighborhoods (more than two categories)? What would we do if we suspected the price per square foot differed across neighborhoods? This would imply lines with different slopes. Indicator Variables 11 Multiple categories Suppose there are C categories. Pick one as your "base" category. Create an indicator variable to adjust each other category away from the base. Indicator Variables 12 A times series application Data collected over time (say quarterly) If we think the Y variable depends on the calendar, we can do a kind of "seasonal adjustment" by adding quarter indicators. Q1 = 1 if this was first quarter, Q2 = 1 if a second quarter, Q3 = 1 if third. Don't use Q4 since that is the "base." Indicator Variables 13 Back to Oak Knoll The data file also contains homes in Fox Links. They surround a golf course. To handle the three-category situation, we will define another binary called FoxLinks. Together, these two indicators break the data into three groups. Indicator Variables 14 Coding Scheme
House Location Oak Knoll Fox Links Hidden Hills Value of OakKnoll Value of FoxLinks Indicator Variables 15 The 3-category model The two indicators now allow us to fit three parallel equations.
Indicator Variables 16 What we actually fit
Price versus Home Size
900 FoxLinks Hidden Hills Oak Knoll 800 HomePrice 700 600 500 400 2000 3000 4000 HomeSize Indicator Variables 17 Lines with different slopes Suppose you suspect that price per square foot was higher near the golf course. How would you investigate if that is the case? What I did was a subset analysis and fit a simple regression to each subdivision individually.
Indicator Variables 18 Three fitted-line plots
Prices in Fox Links Estates 850.0 800.0 750.0 FoxPrice 700.0 650.0 600.0 550.0 500.0 450.0 400.0 2000 2500 3000 FoxSqFt Prices in Hidden Hills 850 800 750 700 HHPrice 650 600 550 500 450 400 2000 2500 3000 HHSqFt y = 0.208 x - 18.012 R2 = 0.871 3500 4000 y = 0.223 x - 2.398 R2 = 0.886 OakPrice 850 800 750 700 650 600 550 500 450 400 2000 2500 3000 OakSqFt y = 0.195 x + 54.935 R2 = 0.947 3500 4000 Prices in Oak Knoll 3500 4000 Indicator Variables 19 The three (subset) equations
Subdivision Fox Links Oak Knoll Hidden Hills n 12 8 12 R-square 88.6 94.7 87.1 Intercept -2.398 54.935 -18.012 Slope 0.223 0.195 0.208 Judging by the slopes, price per square foot is about 10% higher around the golf course. Because of small sample sizes, inference in these models is not real precise.
Indicator Variables 20 Fitting a line with a different slope
Create a new X variable: Let FoxSqFt = SqFt * FoxLinks = SqFt =0 when FoxLinks = 1 when FoxLinks = 0 Allows us to estimate "interaction" Indicator Variables 21 Estimation results The coefficient on FoxSqFt is the price premium for living on the golf course. However, it is not significant, so we should probably go back to the previous model.
Indicator Variables 22 What is coming up The next lectures are about analyzing data collected over time. In some of the models, we will model seasonality in the data by using indicator variables for the months or quarters. Indicator Variables 23 ...
View Full Document