This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Why is LeastSquares the Most Popular?
XiaoLi Meng TFs: Casey Wolos and Paul Baines Statistics, Harvard April 28, 2009
1 But first, why are statisticians less popular than computer scientists?
Because statisticians cannot afford Cows and Guns, so we have to settle for ... 2 What are the key differences between Statisticians and Computer Scientists? Statisticians are better dressed (but not necessarily better looking) To statisticians nothing is impossible (but nothing is sure either) Statisticians model online dating (but never date online models) OK, let's see how statisticians model Least Squares!
3 Carl Friedrich Gauss (17771855) Courtesy http://math.hope.edu/newsletter/200607/gauss.jpg 4 Francis Galton (18221911) Courtesy http://en.wikipedia.org/wiki/File:Francis_Galton_1850s.jpg 5 Courtesy http://en.wikipedia.org/wiki/Moore's_law GOAL: Find the line that best fits the data 6 We mean linear line, not any curve!
Collection 1 50 48 46 44 42 40 38 36 34 32 30 0 2 4 6 TIME 8 10 12
Line Scatter Plot 7 Common Assumptions underlying Linear Regression The conditional mean E(YX) is linear in X Conditional distributions P(YX) are the same, other than the changing means All Yi's are independent given the Xi's and the model Plot Taken from The Statistical Sleuth, by Ramsey and Schafer, Duxbury 2002 8 Simple Linear Regression Best Fit
What makes a line the "best fit"? We want to collectively minimize the residuals. ^ Suppose that yi is the value predicted by the line. So should we minimize... n i =1
n i =1 ^ ( yi  yi ) ?
^  yi  yi  ?
2 ^ ( yi  yi ) ? i =1
9 n So how do we find the "best" line?
Collection 1 50 48 46 44 42 40 38 36 34 32 30 0 2 4 6 TIME 8 10 12
Line Scatter Plot 10 Simple Linear Regression Least Squares
Collection 1 50 48 46 44
Line Scatter Plot Transistor Count 42 40 38 36 34 32 30 6 TIME STOCK_PRICE = 1.74TIME + 32.1; r2 = 0.92 0 2 4 8 10 12 11 Simple Linear Regression Least Squares (LS)
Collection 1 50 48 46 44 42 40 38 36 34 32 30 6 TIME STOCK_PRICE = 1.74TIME + 32.1; r2 = 0.92 Sum of squares = 27.99 0 2 4 8 10 12
Line Scatter Plot 12 If you were using least squares to fit a regression line and someone asked why, what would you say?
1. 2. 3. 4. 5. 6. 7. C om m at pu ic ta G tio ally eo na e m lly asy et ef Ev ric al fic er ... yb ly i od nte Th rp y er el r.. e se . ar e do so es m it e ot Ih h I w ave e... as n to o id ld ea to do so M Mathematically easy Computationally efficient Geometrically interpretable Everybody else does it There are some other sophisticated reasons I have no idea I was told to do so 32% 22% 17% 10% 7% 5% 7% at he 13 Simple Linear Regression
intercept slope error Regression Model: Fitted Values: y i = + xi + i ^ ^ ^ i = + xi y estimated intercept and slope Residual: ^ i = yi  yi
2
n want these to be small! LS minimizes ^ ( yi  yi ) i =1 n ^ x )2 ^ = i =1 ( yi   i
14 A Statistically Sound Reason for Least Squares
But, what is the most common distribution for these errors? The leastsquares method is very related to the normal distribution and we will see how
Plot Taken from The Statistical Sleuth, by Ramsey and Schafer, Duxbury 2002 15 Modeling the Residuals
The most common assumption for linear regression is that the residuals
are independent follow a normal (Gaussian) distribution have the same variance 2 i ~ N (0, )
2 i = 1, ..., n
2 and since y i = + xi + i yi ~ N ( + xi , )
This is the normal model with homogeneous variance 16 Distribution of Yii given Xii Distribution of Y given X
0.4 i = {Yi  X i } = + X i Yi = i + i probability density 0.0 0.1 0.2 0.3 i  3 i  2 i  i
value of Yi Y
i i + i + 2 i + 3
17 Normal Density
In general, if Y is normal with mean and variance 2, the density at Y=y is fY ( y  , ) = 1 2
2 e  1 2 2 ( y  )2 Then, P(Y y ) = 
18 y fY ( y )dy = ( y ) Fitting the Model
How do we fit this model? yi ~ N ( + xi , )
2 We typically use Maximum Likelihood Estimation, selecting the values of and (and ) that maximize the probability (or density) that the data would be observed
19 Formally, what is a likelihood?
Let P(Y) be probability (or density) of an event Y given the parameter value . The likelihood function L(Y) is defined as L (  Y ) P (Y  )
20 http://www.malevole.com/mv/misc/killerquiz/ 21 Killer v. Programming Language Inventor
Programming Language Inventor 1 4 Killer No Glasses Glasses 4 1 P(Glasses  Inventor ) = 4 / 5 = 0.8 P(Glasses  Killer ) = 1 / 5 = 0.2 Say that we discover an additional photograph, and the person is wearing glasses. Is he more likely 22 to be a killer or an inventor? Maximum Likelihood Estimate (MLE)
In our example, the event is Y={Wearing Glasses} and the parameter takes two values: =1 if serial killer =0 if programming language inventor The MLE is the value of the parameter that maximizes the likelihood. L( = 0  Y ) = 0.8 > 0.2 = L( = 1  Y )
In this case, the MLE is =0; our MLE estimate is that the person in the new picture is a programming language inventor, if there's no additional information. 23 Warning!!
In general, P( A  B) P( B  A)
This is often called the "Prosecutor's Fallacy": P(EvidenceInnocent) P(InnocentEvidence)
24 Why should we use the MLE?
1. 2. 3. 4. 5. 6. 7. au S u se it re is ly on mo s e sh t ef f.. ou . B ld ec It us s au ... se oun B ds of ec th co au e se lik ol B it ec el is i au th ... se B e ec it be au us s. se es .. it al m lt in h. im .. iz es ... B Because it is most efficient asymptotically Surely one should use the most likely value! It sounds cool Because of the likelihood principle Because it is the best way to summarize data Because it uses all the information Because it minimizes uncertainty 24% 22% 20% 15% 10% 5% 5% ec 25 MLE for the Normal Model
We can apply the same principle to the normal model for linear regression with homogeneous variance: yi ~ N ( + xi , )
2 Let = {, , } What values of the parameters maximize the likelihood of observing the data Y? ^ ^ ^ ( MLE , MLE , MLE ) = arg max L( , ,  Y )
( , , )
26 MLE for the Normal Model with Equal Variances yi ~ N ( + xi , 2 )
L( , , 2 1 1 exp   y) = 2 2 2 2 n 1 2  y) =  log(2 )  2 2 2
n 2 i =1
n n ( yi   xi ) 2 ( yi   xi ) 2 l ( , , 2 i =1 The MLE is found by maximizing the likelihood which is the same as minimizing the sum of square n residuals! yi ( xi  x ) ^ MLE = i =n1 Gauss (1809) ( xi  x ) 2 27
i =1 27 An Alternative Model for the Residuals
So far, we assumed that the residuals are independent follow a normal distribution have the same variance 2 But it may be more realistic to assume that the residuals are independent follow a normal distribution have difference variances i2 28 Maximum Likelihood for Normal Model with Unequal Variances
yi ~ N ( + xi , i )
L( , , 1 ,..., 2 2 n 2  y) = i =1 n 1 2 2 i exp  1 2 2 i =1 n ( yi   xi ) 2 2 i l ( , , 1 ,..., 2 2 n 1  y) =  2 i =1 n 1 log(2 i )  2 i =1 n ( yi   xi ) 2 i2 The MLE is again found by maximizing the likelihood which is the same as minimizing the sum of WEIGHTED square residuals!
29 Weighted Least Squares
When the residuals have unequal variances, we find the MLE for the linear regression by minimizing the sum of WEIGHTED square residuals, if the i are known. n yi ( xi  x ) 2 i ^ ( w) MLE = i =n1 ( xi  x ) 2 2 i =1 i Note that if the variances are equal, they cancel, and we obtain the earlier MLE for homogeneous variance.
30 Fitting the Model
We have been assuming that the i2 are known When the i2's are unequal and unknown, estimating these values may not be straightforward If every yi has a different unknown i2, it will be impossible to estimate all of these variances! If we model the unknown variances, we may need iteratively reweighted least squares
31 Dealing with Outliers
When some of the residuals are larger than usual, one of two approaches may be appropriate, depending on the context: model the residuals with different variances (heteroskedasticity) and use weighted least squares, as we discussed or, assume that the residuals share the same distribution but that the distribution has heavier tails than the normal In particular, we'll consider the t distribution as a model for the residuals.
32 Another Model for the Residuals
We've been assuming that i i ~ N (0, ) ~ N (0,1) Instead of assuming that the residuals follow a normal distribution, we may want to assume that they follow a t distribution: i ~ td A t distribution has heavier tails than a normal distribution, but as the degrees of freedom increase, the t increasingly resembles the normal
2 y i = + xi + i 33 0.4 P(2<Y<2) Standard Normal 0.3 t, df=30 t, df=10 T, df=1 density 0.0 4 0.1 0.2 2 0 2 4
34 0.4 P(2<Y<2) Standard Normal 0.954 0.3 t, df=30 t, df=10 T, df=1 density 0.0 4 0.1 0.2 2 0 2 4
35 0.4 P(2<Y<2) Standard Normal 0.954 0.3 t, df=30 t, df=10 T, df=1 density 0.0 4 0.1 0.2 2 0 2 4
36 0.4 P(2<Y<2) Standard Normal 0.954 0.945 0.3 t, df=30 t, df=10 T, df=1 density 0.0 4 0.1 0.2 2 0 2 4
37 0.4 P(2<Y<2) Standard Normal 0.954 0.945 0.3 t, df=30 t, df=10 T, df=1 density 0.0 4 0.1 0.2 2 0 2 4
38 0.4 P(2<Y<2) Standard Normal 0.954 0.945 0.927 0.3 t, df=30 t, df=10 T, df=1 density 0.0 4 0.1 0.2 2 0 2 4
39 0.4 P(2<Y<2) Standard Normal 0.954 0.945 0.927 0.3 t, df=30 t, df=10 t, df=1 density 0.0 4 0.1 0.2 2 0 2 4
40 0.4 P(2<Y<2) Standard Normal 0.954 0.945 0.927 0.705 0.3 t, df=30 t, df=10 t, df=1 density 0.0 4 0.1 0.2 2 0 2 4
41 0.4 P(2<Y<2) Standard Normal 0.954 0.945 0.927 0.705 0.3 t, df=30 t, df=10 t, df=1 density 0.0 4 0.1 0.2 2 0 2 4
42 Where does the tdistribution come from?
To understand where this distribution comes from, let's begin with a standard normal random variable: Z ~ N ( 0,1) 43 From Normal to Chisquare with 1 df
If you square it, you get a chisquare distribution with one degree of freedom. Z ~
2 2 (1) 44 From Chisquare df=1 to df=d
Add up d independent 1df chisquared variables and you get a chisquare variable with d degrees of freedom. 2 2 12(1) + 2(1) + ... + d (1) = (2d ) 45 From Chisquare to t
Finally, if and Z ~ N (0,1)
2 d Z is independent of Then the t distribution with d degrees of freedom is td = Z 2 d d
46 tdistribution
Formally, when (Y)/ has a tdistribution with d degrees of freedom, the density of Y at y is 1 y  d + fY ( y  , ) 2 d +1  2 47 How do we estimate the parameters for this model?
As before, we write down the likelihood:
L( , ,  y ) 2 1 yi   xi d +  d +1 2 i =1 n l ( , ,  y ) = n log( )  d + 1 2 i =1 n 2 d + yi   xi log Again as before, we look for the values that maximize the likelihood, which is the same as minimizing the circled expression 48 Reexpressing the tregression
The tregression model is given by: y i = + xi + i
where: i = Z i
qi Z i ~ N (0,1), qi ~ 2 / , Z i qi .
2 yi  qi ~ N + xi , qi The conditional distribution of y given q is: Same as normal regression with = / qi
2 i 2
49 The EM Algorithm via Data Augmentation
By treating the ( q1 , ..., qn ) as missing data, the ExpectationMaximization (EM) algorithm works as follows: The EStep: Fill in the "missing data" with its conditional expectation given the observed data and the parameter estimate from the previous iteration. The MStep: Maximize the "imputed" completedata loglikelihood function with respect to the parameters. 50 EM: Iteratively Reweighted Least Squares (IRLS) Pick starting values: (0) EStep: = ( , (0)
(t ) (0) , ( ) )
2 (0) w (t +1) i = E(qi  ( yi , xi ), ) = ^ yi(t ) = (t ) + (t ) xi +1
^ + ( yi  yi ) / ( )
(t ) 2 (t ) ; where
MStep: (t +1) =
n wi ( t +1) yi ( xi  x ) i =1 n wi (t +1) ( xi  x ) 2 i =1 , (t +1) = y  (t +1) x ( ) 2 ( t +1) 1 n (t +1) ^ = wi ( yi  yi ( t +1) ) 2 n i =1 51 An Even Better EM
In fact, by considering a rescaled "missing data", we can construct a better EM with no extra computation. The resulting algorithm is identical to IRWL except that it replaces ( 2 )(t +1)
by ( 2 ) ( t +1) = 1 n (t +1) ^ = wi ( yi  yi (t +1) ) 2 n i =1
( t +1) 1 w
i ^ wi ( t +1) ( yi  yi (t +1) ) 2 i =1 n Let's see how it works...
52 i 53 Algorithms/Procedures
Least Squares Models Normal regression with homogeneous variances Normal regression with heterogeneous variances tregression Reweighted Least Squares Iteratively Reweighted Least Squares 54 55 STAT 105 RealLife Statistics: Your Chance for Happiness (or Misery) ?
Spring 2010: Empirical and Mathematical Reasoning 16
Hope to see you there!
56 Stat 105 Grand Finale Guest Speaker The Netflix Prize: The Quest for $1,000,000 Dr. Robert Bell AT&T Labs Wednesday, April 29, 2009 12:30 pm Science Center A
57 ...
View
Full
Document
This note was uploaded on 07/26/2009 for the course COMPUTERSC CS51 taught by Professor Gregmorrisett during the Spring '09 term at Harvard.
 Spring '09
 GREGMORRISETT

Click to edit the document details