Lecture22 - Why is Least-Squares the Most Popular Xiao-Li Meng TFs Casey Wolos and Paul Baines Statistics Harvard 1 But first why are statisticians

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Why is Least-Squares the Most Popular? Xiao-Li Meng TFs: Casey Wolos and Paul Baines Statistics, Harvard April 28, 2009 1 But first, why are statisticians less popular than computer scientists? Because statisticians cannot afford Cows and Guns, so we have to settle for ... 2 What are the key differences between Statisticians and Computer Scientists? Statisticians are better dressed (but not necessarily better looking) To statisticians nothing is impossible (but nothing is sure either) Statisticians model on-line dating (but never date on-line models) OK, let's see how statisticians model Least Squares! 3 Carl Friedrich Gauss (1777-1855) Courtesy http://math.hope.edu/newsletter/2006-07/gauss.jpg 4 Francis Galton (1822-1911) Courtesy http://en.wikipedia.org/wiki/File:Francis_Galton_1850s.jpg 5 Courtesy http://en.wikipedia.org/wiki/Moore's_law GOAL: Find the line that best fits the data 6 We mean linear line, not any curve! Collection 1 50 48 46 44 42 40 38 36 34 32 30 0 2 4 6 TIME 8 10 12 Line Scatter Plot 7 Common Assumptions underlying Linear Regression The conditional mean E(Y|X) is linear in X Conditional distributions P(Y|X) are the same, other than the changing means All Yi's are independent given the Xi's and the model Plot Taken from The Statistical Sleuth, by Ramsey and Schafer, Duxbury 2002 8 Simple Linear Regression Best Fit What makes a line the "best fit"? We want to collectively minimize the residuals. ^ Suppose that yi is the value predicted by the line. So should we minimize... n i =1 n i =1 ^ ( yi - yi ) ? ^ | yi - yi | ? 2 ^ ( yi - yi ) ? i =1 9 n So how do we find the "best" line? Collection 1 50 48 46 44 42 40 38 36 34 32 30 0 2 4 6 TIME 8 10 12 Line Scatter Plot 10 Simple Linear Regression Least Squares Collection 1 50 48 46 44 Line Scatter Plot Transistor Count 42 40 38 36 34 32 30 6 TIME STOCK_PRICE = 1.74TIME + 32.1; r2 = 0.92 0 2 4 8 10 12 11 Simple Linear Regression Least Squares (LS) Collection 1 50 48 46 44 42 40 38 36 34 32 30 6 TIME STOCK_PRICE = 1.74TIME + 32.1; r2 = 0.92 Sum of squares = 27.99 0 2 4 8 10 12 Line Scatter Plot 12 If you were using least squares to fit a regression line and someone asked why, what would you say? 1. 2. 3. 4. 5. 6. 7. C om m at pu ic ta G tio ally eo na e m lly asy et ef Ev ric al fic er ... yb ly i od nte Th rp y er el r.. e se . ar e do so es m it e ot Ih h I w ave e... as n to o id ld ea to do so M Mathematically easy Computationally efficient Geometrically interpretable Everybody else does it There are some other sophisticated reasons I have no idea I was told to do so 32% 22% 17% 10% 7% 5% 7% at he 13 Simple Linear Regression intercept slope error Regression Model: Fitted Values: y i = + xi + i ^ ^ ^ i = + xi y estimated intercept and slope Residual: ^ i = yi - yi 2 n want these to be small! LS minimizes ^ ( yi - yi ) i =1 n ^ x )2 ^ = i =1 ( yi - - i 14 A Statistically Sound Reason for Least Squares But, what is the most common distribution for these errors? The least-squares method is very related to the normal distribution and we will see how Plot Taken from The Statistical Sleuth, by Ramsey and Schafer, Duxbury 2002 15 Modeling the Residuals The most common assumption for linear regression is that the residuals are independent follow a normal (Gaussian) distribution have the same variance 2 i ~ N (0, ) 2 i = 1, ..., n 2 and since y i = + xi + i yi ~ N ( + xi , ) This is the normal model with homogeneous variance 16 Distribution of Yii given Xii Distribution of Y given X 0.4 i = {Yi | X i } = + X i Yi = i + i probability density 0.0 0.1 0.2 0.3 i - 3 i - 2 i - i value of Yi Y i i + i + 2 i + 3 17 Normal Density In general, if Y is normal with mean and variance 2, the density at Y=y is fY ( y | , ) = 1 2 2 e - 1 2 2 ( y - )2 Then, P(Y y ) = - 18 y fY ( y )dy = ( y ) Fitting the Model How do we fit this model? yi ~ N ( + xi , ) 2 We typically use Maximum Likelihood Estimation, selecting the values of and (and ) that maximize the probability (or density) that the data would be observed 19 Formally, what is a likelihood? Let P(Y|) be probability (or density) of an event Y given the parameter value . The likelihood function L(|Y) is defined as L ( | Y ) P (Y | ) 20 http://www.malevole.com/mv/misc/killerquiz/ 21 Killer v. Programming Language Inventor Programming Language Inventor 1 4 Killer No Glasses Glasses 4 1 P(Glasses | Inventor ) = 4 / 5 = 0.8 P(Glasses | Killer ) = 1 / 5 = 0.2 Say that we discover an additional photograph, and the person is wearing glasses. Is he more likely 22 to be a killer or an inventor? Maximum Likelihood Estimate (MLE) In our example, the event is Y={Wearing Glasses} and the parameter takes two values: =1 if serial killer =0 if programming language inventor The MLE is the value of the parameter that maximizes the likelihood. L( = 0 | Y ) = 0.8 > 0.2 = L( = 1 | Y ) In this case, the MLE is =0; our MLE estimate is that the person in the new picture is a programming language inventor, if there's no additional information. 23 Warning!! In general, P( A | B) P( B | A) This is often called the "Prosecutor's Fallacy": P(Evidence|Innocent) P(Innocent|Evidence) 24 Why should we use the MLE? 1. 2. 3. 4. 5. 6. 7. au S u se it re is ly on mo s e sh t ef f.. ou . B ld ec It us s au ... se oun B ds of ec th co au e se lik ol B it ec el is i au th ... se B e ec it be au us s. se es .. it al m lt in h. im .. iz es ... B Because it is most efficient asymptotically Surely one should use the most likely value! It sounds cool Because of the likelihood principle Because it is the best way to summarize data Because it uses all the information Because it minimizes uncertainty 24% 22% 20% 15% 10% 5% 5% ec 25 MLE for the Normal Model We can apply the same principle to the normal model for linear regression with homogeneous variance: yi ~ N ( + xi , ) 2 Let = {, , } What values of the parameters maximize the likelihood of observing the data Y? ^ ^ ^ ( MLE , MLE , MLE ) = arg max L( , , | Y ) ( , , ) 26 MLE for the Normal Model with Equal Variances yi ~ N ( + xi , 2 ) L( , , 2 1 1 exp - | y) = 2 2 2 2 n 1 2 | y) = - log(2 ) - 2 2 2 n 2 i =1 n n ( yi - - xi ) 2 ( yi - - xi ) 2 l ( , , 2 i =1 The MLE is found by maximizing the likelihood which is the same as minimizing the sum of square n residuals! yi ( xi - x ) ^ MLE = i =n1 Gauss (1809) ( xi - x ) 2 27 i =1 27 An Alternative Model for the Residuals So far, we assumed that the residuals are independent follow a normal distribution have the same variance 2 But it may be more realistic to assume that the residuals are independent follow a normal distribution have difference variances i2 28 Maximum Likelihood for Normal Model with Unequal Variances yi ~ N ( + xi , i ) L( , , 1 ,..., 2 2 n 2 | y) = i =1 n 1 2 2 i exp - 1 2 2 i =1 n ( yi - - xi ) 2 2 i l ( , , 1 ,..., 2 2 n 1 | y) = - 2 i =1 n 1 log(2 i ) - 2 i =1 n ( yi - - xi ) 2 i2 The MLE is again found by maximizing the likelihood which is the same as minimizing the sum of WEIGHTED square residuals! 29 Weighted Least Squares When the residuals have unequal variances, we find the MLE for the linear regression by minimizing the sum of WEIGHTED square residuals, if the i are known. n yi ( xi - x ) 2 i ^ ( w) MLE = i =n1 ( xi - x ) 2 2 i =1 i Note that if the variances are equal, they cancel, and we obtain the earlier MLE for homogeneous variance. 30 Fitting the Model We have been assuming that the i2 are known When the i2's are unequal and unknown, estimating these values may not be straightforward If every yi has a different unknown i2, it will be impossible to estimate all of these variances! If we model the unknown variances, we may need iteratively reweighted least squares 31 Dealing with Outliers When some of the residuals are larger than usual, one of two approaches may be appropriate, depending on the context: model the residuals with different variances (heteroskedasticity) and use weighted least squares, as we discussed or, assume that the residuals share the same distribution but that the distribution has heavier tails than the normal In particular, we'll consider the t distribution as a model for the residuals. 32 Another Model for the Residuals We've been assuming that i i ~ N (0, ) ~ N (0,1) Instead of assuming that the residuals follow a normal distribution, we may want to assume that they follow a t distribution: i ~ td A t distribution has heavier tails than a normal distribution, but as the degrees of freedom increase, the t increasingly resembles the normal 2 y i = + xi + i 33 0.4 P(-2<Y<2) Standard Normal 0.3 t, df=30 t, df=10 T, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 34 0.4 P(-2<Y<2) Standard Normal 0.954 0.3 t, df=30 t, df=10 T, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 35 0.4 P(-2<Y<2) Standard Normal 0.954 0.3 t, df=30 t, df=10 T, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 36 0.4 P(-2<Y<2) Standard Normal 0.954 0.945 0.3 t, df=30 t, df=10 T, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 37 0.4 P(-2<Y<2) Standard Normal 0.954 0.945 0.3 t, df=30 t, df=10 T, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 38 0.4 P(-2<Y<2) Standard Normal 0.954 0.945 0.927 0.3 t, df=30 t, df=10 T, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 39 0.4 P(-2<Y<2) Standard Normal 0.954 0.945 0.927 0.3 t, df=30 t, df=10 t, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 40 0.4 P(-2<Y<2) Standard Normal 0.954 0.945 0.927 0.705 0.3 t, df=30 t, df=10 t, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 41 0.4 P(-2<Y<2) Standard Normal 0.954 0.945 0.927 0.705 0.3 t, df=30 t, df=10 t, df=1 density 0.0 -4 0.1 0.2 -2 0 2 4 42 Where does the t-distribution come from? To understand where this distribution comes from, let's begin with a standard normal random variable: Z ~ N ( 0,1) 43 From Normal to Chi-square with 1 df If you square it, you get a chi-square distribution with one degree of freedom. Z ~ 2 2 (1) 44 From Chi-square df=1 to df=d Add up d independent 1-df chi-squared variables and you get a chi-square variable with d degrees of freedom. 2 2 12(1) + 2(1) + ... + d (1) = (2d ) 45 From Chi-square to t Finally, if and Z ~ N (0,1) 2 d Z is independent of Then the t distribution with d degrees of freedom is td = Z 2 d d 46 t-distribution Formally, when (Y-)/ has a t-distribution with d degrees of freedom, the density of Y at y is 1 y - d + fY ( y | , ) 2 d +1 - 2 47 How do we estimate the parameters for this model? As before, we write down the likelihood: L( , , | y ) 2 1 yi - - xi d + - d +1 2 i =1 n l ( , , | y ) = -n log( ) - d + 1 2 i =1 n 2 d + yi - - xi log Again as before, we look for the values that maximize the likelihood, which is the same as minimizing the circled expression 48 Re-expressing the t-regression The t-regression model is given by: y i = + xi + i where: i = Z i qi Z i ~ N (0,1), qi ~ 2 / , Z i qi . 2 yi | qi ~ N + xi , qi The conditional distribution of y given q is: Same as normal regression with = / qi 2 i 2 49 The EM Algorithm via Data Augmentation By treating the ( q1 , ..., qn ) as missing data, the Expectation-Maximization (EM) algorithm works as follows: The E-Step: Fill in the "missing data" with its conditional expectation given the observed data and the parameter estimate from the previous iteration. The M-Step: Maximize the "imputed" complete-data loglikelihood function with respect to the parameters. 50 EM: Iteratively Reweighted Least Squares (IRLS) Pick starting values: (0) E-Step: = ( , (0) (t ) (0) , ( ) ) 2 (0) w (t +1) i = E(qi | ( yi , xi ), ) = ^ yi(t ) = (t ) + (t ) xi +1 ^ + ( yi - yi ) / ( ) (t ) 2 (t ) ; where M-Step: (t +1) = n wi ( t +1) yi ( xi - x ) i =1 n wi (t +1) ( xi - x ) 2 i =1 , (t +1) = y - (t +1) x ( ) 2 ( t +1) 1 n (t +1) ^ = wi ( yi - yi ( t +1) ) 2 n i =1 51 An Even Better EM In fact, by considering a re-scaled "missing data", we can construct a better EM with no extra computation. The resulting algorithm is identical to IRWL except that it replaces ( 2 )(t +1) by ( 2 ) ( t +1) = 1 n (t +1) ^ = wi ( yi - yi (t +1) ) 2 n i =1 ( t +1) 1 w i ^ wi ( t +1) ( yi - yi (t +1) ) 2 i =1 n Let's see how it works... 52 i 53 Algorithms/Procedures Least Squares Models Normal regression with homogeneous variances Normal regression with heterogeneous variances t-regression Reweighted Least Squares Iteratively Reweighted Least Squares 54 55 STAT 105 Real-Life Statistics: Your Chance for Happiness (or Misery) ? Spring 2010: Empirical and Mathematical Reasoning 16 Hope to see you there! 56 Stat 105 Grand Finale Guest Speaker The Netflix Prize: The Quest for $1,000,000 Dr. Robert Bell AT&T Labs Wednesday, April 29, 2009 1-2:30 pm Science Center A 57 ...
View Full Document

This note was uploaded on 07/26/2009 for the course COMPUTERSC CS51 taught by Professor Gregmorrisett during the Spring '09 term at Harvard.

Ask a homework question - tutors are online