lecture20

4/21/2009 1 COMPUTER SCIENCE 51 Spring 2009 cs51.seas.harvard.edu Prof. Greg Morrisett Prof. Ramin Zabih Least Median Squares Find the “thinnest ruler” that covers more than half the data 0 50 100 150 200 250 55 65 75 85 95 105 115 MPT Speed Code for LS vs LMedS ( define data ; sample data set (list '(15.85 235) '(15.69 280) '(15.38 360) '(15.3 442) '(14.84 528))) ( define (squared-errors m b data) ( let* ([predict ( lambda (x) (+ (* m x) b))] [residual ( lambda (pt) (- (predict (car pt)) (cadr pt)))]) (map square (map residual data)))) ( define (median l) (nth (floor (/ (length l) 2)) (sort l <))) ( define (sum-list l) (foldr + 0 l)) ( define (data-err-ssem b data) (sum-list (squared-errors m b data))) ( define (data-err-med m b data) ( median (squared-errors m b data))) LMedS example With 5 points, minimize the 3 rd largest (and 3 rd smallest) squared error Equivalently: find the thinnest ruler that covers 3 of the 5 points Where “thin” is measured vertically Our line is the center of the ruler Obvious questions: Is this a good idea? Is it easy to find the best line? Minimal example 0 10 20 30 40 50 60 70 80 55 65 75 85 95 105 115 MPT Speed LS fit LMedS fit Application: baseball Model fitting: is the Yankees defense getting worse? Extrapolation: how many runs will the Yankees give up next inning? 0 1 2 3 4 5 6 0 10 20 30 40 50 60 70 80 90 Runs Inning

4/21/2009 2 Effect of outliers http://newyork.yankees.mlb.com/media/video.jsp?content_id=4208413 0 2 4 6 8 10 12 14 16 0 20 40 60 80 100 Runs Inning Two intermixed lines What about cases like this?
