This preview shows pages 1–11. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: The Learning Problem and Regularization 9.520 Class 02, 13 February 2006 Tomaso Poggio Plan Learning as function approximation Empirical Risk Minimization Generalization and Wellposedness Regularization Appendix: Sample and Approximation Error About This Class Theme We introduce the learning problem as the problem of function approximation from sparse data. We define the key ideas of loss functions, empirical error and gen eralization error. We then introduce the Empirical Risk Minimization approach and the two key requirements on algorithms using it: wellposedness and consistency. We then describe a key algorithm Tikhonov regular ization that satisfies these requirements. Math Required Familiarity with basic ideas in probability theory. Data Generated By A Probability Distribution We assume that X and Y are two sets of random variables. We are given a training set S consisting n samples drawn i.i.d. from the probability distribution ( z ) on Z = X Y : ( x 1 , y 1 ) , . . . , ( x n , y n ) that is z 1 , . . . , z n We will make frequent use of the conditional probability of y given x , written p ( y  x ): ( z ) = p ( x, y ) = p ( y  x ) p ( x ) It is crucial to note that we view p ( x, y ) as fixed but un known . Probabilistic setting X Y P(x) P(yx) Hypothesis Space The hypothesis space H is the space of functions that we allow our algorithm to provide. For many algorithms (such as optimization algorithms) it the space the algorithm is allowed to search. As we will see, it is often important to choose the hypothesis space as a function of the amount of data available. Learning As Function Approximation From Samples: Regression and Classification The basic goal of supervised learning is to use the training set S to learn a function f S that looks at a new x value x new and predicts the associated value of y : y pred = f S ( x new ) If y is a realvalued random variable, we have regression . If y takes values from an unordered finite set, we have pattern classification . In twoclass pattern classification problems, we assign one class a y value of 1, and the other class a y value of 1. Loss Functions In order to measure goodness of our function, we need a loss function V . In general, we let V ( f, z ) = V ( f ( x ) , y ) denote the price we pay when we see x and guess that the associated y value is f ( x ) when it is actually y . Common Loss Functions For Regression For regression, the most common loss function is square loss or L2 loss: V ( f ( x ) , y ) = ( f ( x ) y ) 2 We could also use the absolute value, or L1 loss: V ( f ( x ) , y ) = f ( x ) y   Vapniks more generalinsensitive loss function is: V ( f ( x ) , y ) = (  f ( x ) y  ) + Common Loss Functions For Classification For binary classification, the most intuitive loss is the 01 loss: V ( f ( x ) , y ) = ( yf ( x )) where ( yf ( x )) is the step function. For tractability and other rea sons,...
View
Full
Document
This note was uploaded on 11/11/2011 for the course BIO 9.07 taught by Professor Ruthrosenholtz during the Spring '04 term at MIT.
 Spring '04
 RuthRosenholtz

Click to edit the document details