lecture12-annotated - Machine Learning 10-701/15-781, Fall...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon
1 © Eric Xing @ CMU, 2006-2008 1 Machine Learning Machine Learning 10 10 -701/15 701/15 -781, Fall 2008 781, Fall 2008 Overfitting Overfitting and Model Selection and Model Selection Eric Xing Eric Xing Lecture 12, October 15, 2008 Reading: Chap. 1&2, CB & Chap 5,6, TM © Eric Xing @ CMU, 2006-2008 2 Outline z Overfitting z kNN z Regression z Bias-variance decomposition z The battle against overfitting: each learning algorithm has some "free knobs" that one can "tune" (i.e., heck) to make the algorithm generalizes better to test data. But is there a more principled way? z Cross validation z Regularization z Feature selection z Model selection --- Occam's razor z Model averaging
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 © Eric Xing @ CMU, 2006-2008 3 Overfitting: kNN © Eric Xing @ CMU, 2006-2008 4 Another example: z Regression
Background image of page 2
3 © Eric Xing @ CMU, 2006-2008 5 Overfitting, con'd z The models: z Test errors: © Eric Xing @ CMU, 2006-2008 6 What is a good model? Low Robustness Robust Model Low quality /High Robustness Model built Known Data New Data LEGEND
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 © Eric Xing @ CMU, 2006-2008 7 Bias-variance decomposition z Now let's look more closely into two sources of errors in an functional approximator: z In the following we show the Bias-variance decomposition using LR as an example. © Eric Xing @ CMU, 2006-2008 8 Loss functions for regression z Let t be the true (target) output and y ( x ) be our estimate. The expected squared loss is z Out goal is to choose y ( x ) that minimize E ( L ): z Calculus of variations: ∫∫ = dxdt t x p x y t L L E ) , ( )) ( , ( ) ( ∫∫ = dxdt t x p x y t ) , ( )) ( ( 2 0 2 = = dt t x p x y t x y L E ) , ( )) ( ( ) ( ) ( = dt t x tp dt t x p x y ) , ( ) , ( ) ( [] [ ] x t E t E dt x t tp dt x p t x tp x y x t | ) | ( ) ( ) , ( ) ( * | = = = =
Background image of page 4
5 © Eric Xing @ CMU, 2006-2008 9 Expected loss z Let h ( x ) = E [ t|x ] be the optimal predictor, and y ( x ) our actual predictor, which will incur the following expected loss z is a noisy term, and we can do no better than this. Thus it is a lower bound of the expected loss. z The other part of the error come from , and let's take a close look of it. z We will assume y ( x ) = y ( x|w ) is a parametric model and the parameters w are fit to a training set D . ( thus we write y ( x;D ) ) () + = dxdt t x p t x h x h x y t x y E ) , ( ) ( ) ( ) ( ) ) ( ( 2 2 ( )( ) ( ) + + = dxdt t x p t x h t x h x h x y x h x y ) , ( ) ( ) ( ) ( ) ( ) ( ) ( 2 2 2 ( ) + = dxdt t x p t x h dx x p x h x y ) , ( ) ( ) ( ) ( ) ( 2 2 There is an error on pp47 ( ) dxdt t x p t x h ) , ( ) ( 2 ( ) dx x p x h x y ) ( ) ( ) ( 2 © Eric Xing @ CMU, 2006-2008 10 Bias-variance decomposition z For one data set D and one test point x z since the predictor y depend on the data training data D , write E D [ y ( x,D )] for the expected predictor over the ensemble of datasets, then (using the same trick) we have: z Surely this error term depends on the training data, so we take an expectation over them: z Putting things together: expected loss = (bias) 2 + variance + noise ( ) [ ] [ ] ( ) 2 2 ) ( ) ; ( ) ; ( ) ; ( ) ( ) ; ( x h D x y E D x y E D x y x h D x y D D + = [ ] ( ) [ ] ( ) [] ) ( ) ; ( ) ; ( ) ; ( ) ( ) ; ( ) ; ( ) ; ( x h D x y E D x y E D x y x h D x y E D x y E D x y D D D D + + = 2 2 2 2 2 2 ) ; ( ) ; ( ) ( ) ; ( ) ( ) ; ( D x y E D x y E x h D x y E x h D x y E D D D D + =
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 © Eric Xing @ CMU, 2006-2008 11 Recall Structural Risk Minimization z Which hypothesis space should we choose?
Background image of page 6
Image of page 7
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 01/26/2010 for the course MACHINE LE 10701 taught by Professor Ericp.xing during the Fall '08 term at Carnegie Mellon.

Page1 / 19

lecture12-annotated - Machine Learning 10-701/15-781, Fall...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online