15.097:
Probabilistic
Modeling
and
Bayesian
Analysis
Ben
Letham
and
Cynthia
Rudin
Credits:
Bayesian
Data
Analysis
by
Gelman,
Carlin,
Stern,
and
Rubin
1
Introduction
and
Notation
Up
to
this
point,
most
of
the
machine
learning
tools
we
discussed
(SVM,
Boosting,
Decision
Trees,...)
do
not
make
any
assumption
about
how
the
data
were
generated.
For
the
remainder
of
the
course,
we
will
make
distri
butional
assumptions,
that
the
underlying
distribution
is
one
of
a
set.
Given
data,
our
goal
then
becomes
to
determine
which
probability
distribution
gen
erated
the
data.
We
are
given
m
data
points
y
1
, . . . , y
m
,
each
of
arbitrary
dimension.
Let
y
=
{
y
1
, . . . , y
m
}
denote
the
full
set
of
data.
Thus
y
is
a
random
variable,
whose
probability
density
function
would
in
probability
theory
typically
be
denoted
as
f
y
(
{
y
1
, . . . , y
m
}
).
We
will
use
a
standard
(in
Bayesian
analysis)
shorthand
notation
for
probability
density
functions,
and
denote
the
proba
bility
density
function
of
the
random
variable
y
as
simply
p
(
y
).
We
will
assume
that
the
data
were
generated
from
a
probability
distribution
that
is
described
by
some
parameters
θ
(not
necessarily
scalar).
We
treat
θ
as
a
random
variable.
We
will
use
the
shorthand
notation
p
(
y

θ
)
to
represent
the
family
of
conditional
density
functions
over
y
,
parameterized
by
the
ran
dom
variable
θ
.
We
call
this
family
p
(
y

θ
) a
likelihood
function
or
likelihood
model
for
the
data
y
,
as
it
tells
us
how
likely
the
data
y
are
given
the
model
specified
by
any
value
of
θ
.
We
specify
a
prior
distribution
over
θ
,
denoted
p
(
θ
).
This
distribution
rep
resents
any
knowledge
we
have
about
how
the
data
are
generated
prior
to
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
observing
them.
Our
end
goal
is
the
conditional
density
function
over
θ
,
given
the
observed
data,
which
we
denote
as
p
(
θ

y
).
We
call
this
the
posterior
distribution,
and
it
informs
us
which
parameters
are
likely
given
the
observed
data.
We,
the
modeler,
specify
the
likelihood
function
(as
a
function
of
y
and
θ
)
and
the
prior
(we
completely
specify
this)
using
our
knowledge
of
the
system
at
hand.
We
then
use
these
quantities,
together
with
the
data,
to
compute
the
posterior.
The
likelihood,
prior,
and
posterior
are
all
related
via
Bayes’
rule:
p
(
y

θ
)
p
(
θ
)
p
(
y

θ
)
p
(
θ
)
p
(
θ

y
) =
=
,
(1)
p
(
y
)
p
(
y

θ
'
)
p
(
θ
'
)
dθ
'
where
the
second
step
uses
the
law
of
total
probability.
Unfortunately
the
integral
in
the
denominator,
called
the
partition
function
,
is
often
intractable.
This
is
what
makes
Bayesian
analysis
diﬃcult,
and
the
remainder
of
the
notes
will
essentially
be
methods
for
avoiding
that
integral.
Coin
Flip
Example
Part
1.
Suppose
we
have
been
given
data
from
a
se
ries
of
m
coin
ﬂips,
and
we
are
not
sure
if
the
coin
is
fair
or
not.
We
might
assume
that
the
data
were
generated
by
a
sequence
of
independent
draws
from
a
Bernoulli
distribution,
parameterized
by
θ
,
which
is
the
probability
of
ﬂipping
Heads.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '12
 CynthiaRudin
 Normal Distribution, Probability theory, Yi, coin flip example

Click to edit the document details