This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSC 411 / CSC D11 Bayesian Methods 11 Bayesian Methods So far, we have considered statistical methods which select a single “best” model given the data. This approach can have problems, such as overfitting when there is not enough data to fully con strain the model fit. In contrast, in the “pure” Bayesian approach, as much as possible we only com pute distributions over unknowns; we never maximize anything. For example, consider a model parameterized by some weight vector w , and some training data D that comprises inputoutput pairs x i ,y i , for i = 1 ...N . The posterior probability distribution over the parameters, conditioned on the data is, using Bayes’ rule, given by p ( w D ) = p ( D w ) p ( w ) p ( D ) (1) The reason we want to fit the model in the first place is to allow us to make predictions with future test data. That is, given some future input x new , we want to use the model to predict y new . To accomplish this task through estimation in previous chapters, we used optimization to find ML or MAP estimates of w , e.g., by maximizing (1). In a Bayesian approach, rather than estimation a single best value for w , we computer (or approximate) the entire posterior distribution p ( w D ) . Given the entire distribution, we can still make predictions with the following integral: p ( y new D ,x new ) = integraldisplay p ( y new , w D ,x new ) d w = integraldisplay p ( y new  w , D ,x new ) p ( w D ,x new ) d w (2) The first step in this equality follows from the Sum Rule. The second follows from the Product Rule. Additionally, the outputs y new and training data D are independent conditioned on w , so p ( y new  w , D ) = p ( y new  w ) . That is, given w , we have all available information about making predictions that we could possibly get from the training data D (according to the model). Finally, given D , it is safe to assume that x new , in itself, provides no information about W . With these assumptions we have the following expression for our predictions: p ( y new D ,x new ) = integraldisplay p ( y new  w ,x new ) p ( w D ) d w (3) In the case of discrete parameters w , the integral becomes a summation. The posterior distribution p ( y new D ,x new ) tells us everything there is to know about our beliefs about the new value y new . There are many things we can do with this distribution. For example, we could pick the most likely prediction, i.e., arg max y p ( y new D ,x new ) , or we could compute the variance of this distribution to get a sense of how much confidence we have in the prediction. We could sample from this distribution in order to visualize the range of models that are plausible for this data....
View
Full
Document
This note was uploaded on 11/09/2010 for the course CS CSCD11 taught by Professor Davidfleet during the Spring '10 term at University of Toronto.
 Spring '10
 DavidFleet
 Machine Learning

Click to edit the document details