{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}


MIT15_097S12_lec15 - 15.097 Probabilistic Modeling and...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
15.097: Probabilistic Modeling and Bayesian Analysis Ben Letham and Cynthia Rudin Credits: Bayesian Data Analysis by Gelman, Carlin, Stern, and Rubin 1 Introduction and Notation Up to this point, most of the machine learning tools we discussed (SVM, Boosting, Decision Trees,...) do not make any assumption about how the data were generated. For the remainder of the course, we will make distri- butional assumptions, that the underlying distribution is one of a set. Given data, our goal then becomes to determine which probability distribution gen- erated the data. We are given m data points y 1 , . . . , y m , each of arbitrary dimension. Let y = { y 1 , . . . , y m } denote the full set of data. Thus y is a random variable, whose probability density function would in probability theory typically be denoted as f y ( { y 1 , . . . , y m } ). We will use a standard (in Bayesian analysis) shorthand notation for probability density functions, and denote the proba- bility density function of the random variable y as simply p ( y ). We will assume that the data were generated from a probability distribution that is described by some parameters θ (not necessarily scalar). We treat θ as a random variable. We will use the shorthand notation p ( y | θ ) to represent the family of conditional density functions over y , parameterized by the ran- dom variable θ . We call this family p ( y | θ ) a likelihood function or likelihood model for the data y , as it tells us how likely the data y are given the model specified by any value of θ . We specify a prior distribution over θ , denoted p ( θ ). This distribution rep- resents any knowledge we have about how the data are generated prior to 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
observing them. Our end goal is the conditional density function over θ , given the observed data, which we denote as p ( θ | y ). We call this the posterior distribution, and it informs us which parameters are likely given the observed data. We, the modeler, specify the likelihood function (as a function of y and θ ) and the prior (we completely specify this) using our knowledge of the system at hand. We then use these quantities, together with the data, to compute the posterior. The likelihood, prior, and posterior are all related via Bayes’ rule: p ( y | θ ) p ( θ ) p ( y | θ ) p ( θ ) p ( θ | y ) = = , (1) p ( y ) p ( y | θ ' ) p ( θ ' ) ' where the second step uses the law of total probability. Unfortunately the integral in the denominator, called the partition function , is often intractable. This is what makes Bayesian analysis difficult, and the remainder of the notes will essentially be methods for avoiding that integral. Coin Flip Example Part 1. Suppose we have been given data from a se- ries of m coin flips, and we are not sure if the coin is fair or not. We might assume that the data were generated by a sequence of independent draws from a Bernoulli distribution, parameterized by θ , which is the probability of flipping Heads.
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}