This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan • Bayesian interpretation of Regularization • Bayesian interpretation of the regularizer • Bayesian interpretation of quadratic loss • Bayesian interpretation of SVM loss • Consistency check of MAP and mean solutions for quadratic loss • Synthesizing kernels from data: bayesian foundations • Selection (called “alignment”) as a special case of kernel synthesis Bayesian Interpretation of RN, SVM, and BPD in Regression Consider min f ∈H 1 ‘ ‘ X i =1 ( y i f ( x i )) 2 + λ k f k 2 K We will show that there is a Bayesian interpretation of RN in which the data term – that is the term with the loss function – is a model of the noise and the stabilizer is a prior on the hypothesis space of functions f . Definitions 1. D ‘ = { ( x i , y i ) } for i = 1 , ··· , ‘ is the set of training examples 2. P [ f  D ‘ ] is the conditional probability of the function f given the examples g . 3. P [ D ‘  f ] is the conditional probability of g given f , i.e. a model of the noise. 4. P [ f ] is the a priori probability of the random field f . Posterior Probability The posterior distribution P [ f  g ] can be computed by ap plying Bayes rule: P [ f  D ‘ ] = P [ D ‘  f ] P [ f ] P ( D ‘ ) . If the noise is normally distributed with variance σ , then the probability P [ D ‘  f ] is P [ D ‘  f ] = 1 Z L e 1 2 σ 2 ∑ ‘ i =1 ( y i f ( x i )) 2 where Z L is a normalization constant. Posterior Probability Informally (we will make it precise later), if P [ f ] = 1 Z r ek f k 2 K where Z r is another normalization constant, then P [ f  D ‘ ] = 1 Z D Z L Z r e 1 2 σ 2 ∑ ‘ i =1 ( y i f ( x i )) 2 + k f k 2 K MAP Estimate One of the several possible estimates of f from P [ f  D ‘ ] is the so called MAP estimate, that is max P [ f  D ‘ ] = min ‘ X i =1 ( y i f ( x i )) 2 + 2 σ 2 k f k 2 K . which is the same as the regularization functional if λ = 2 σ 2 /‘. Bayesian Interpretation of the Data Term (quadratic loss) As we just showed, the quadratic loss (the standard RN case) corresponds in the Bayesian interpretation to as suming that the data y i are affected by additive indepen dent Gaussian noise processes, i.e. y i = f ( x i ) + i with E [ j j ] = 2 δ i,j P ( y  f ) ∝ exp( X ( y i f ( x i )) 2 ) Bayesian Interpretation of the Data Term (nonquadratic loss) To find the Bayesian interpretation of the SVM loss, we now assume a more general form of noise. We assume that the data are affected by additive independent noise sam pled form a continuous mixture of Gaussian distributions with variance β and mean μ according to P ( y  f ) ∝ exp Z ∞ dβ Z ∞∞ dμ q βe β ( y f ( x ) μ ) 2 P ( β, μ ) , The previous case of quadratic loss corresponds to P ( β, μ ) = δ β 1 2 σ 2 δ ( μ ) . Bayesian Interpretation of the Data Term (absolute loss) To find P ( β, μ ) that yields a given loss function...
View
Full
Document
This note was uploaded on 11/11/2011 for the course BIO 9.07 taught by Professor Ruthrosenholtz during the Spring '04 term at MIT.
 Spring '04
 RuthRosenholtz

Click to edit the document details