This preview shows page 1. Sign up to view the full content.
Unformatted text preview: as
Z x+
P x = !0 x, px ;
lim 2.20 2.21 1
then P X = 1 = 6 , as will the probability of the other events. Finally we can show that the
entropy of X is negative in nity,
Z1
6
0 logpx0 dx0 = , X 1 log1 = ,1 ;
2.22
hX = , ,1 px
i=1 6 yet X is pretty clearly random.
Though di erential entropy does not provide an absolute measure of randomness or code
length, it does provide a relative measure of these properties. A random variable X is less
predictable than Y whenever hX hY . Similarly an event from X requires more bits
on average to encode than an event from Y . 2.3 Samples versus Distributions
A random variable is a mathematical structure that is intended to model the behavior of a
physical process. In some cases there are excellent physical reasons to believe that an RV
is an accurate model of a process. In many other cases the properties of a random physical
31 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY process may well be unknown. In these cases we may still wish to use the theory of probability
to analyze a system. The rst step is to nd an accurate RV model of our data.
In order to insure that our probabilistic inferences will be correct, our model must be as
accurate as possible. While it is possible to model every coin as a fair coin, we do so at
our peril. It makes much more sense to perform a large number of experiments intended to
test the hypothesis, Is the coin fair?" There are two important intuitions behind nding an
accurate model of a random process. First and foremost, you want a model that seems to
explain the data well. It might not make sense to guess that a coin is fair if after 500 ips
heads has come up 400 times. But it is also important that your model be plausible. If after
a lifetime of experience you realize that most coins are pretty fair, than perhaps it makes
more sense to assume that 400 heads is unusual but not su ciently unusual to assume that
the coin is biased. 2.3.1 Model Selection, Likelihood and Cross Entropy
The eld of statistics provides many tools for testing the validity of random models. A lot of
this work shares a particular form, called maximum likelihood model selection. The goal is to
select the most probable model given a large sample of measurements. Maximum likelihood
selection proceeds in steps: 1 guess the de nition of a random variable that might model
the process; 2 evaluate the goodness" of the model by computing the probability that the
data observed could have been generated by the model, 3 after evaluating many models
retain the model that makes the data most probable. The probability of a sample a under
the RV X is the probability of the cooccurrence of the trials in a, `a = PX a = PX x1 = xa ; x2 = xa ; ::: :
1 2 2.23 The probability of a sample is usually called its likelihood and is denoted `a.
Justi cation for maximum likelihood model selection is based on Bayes' law. The likelihood of a sample is really a conditional probability, P a j X . Bayes' law allows us to turn
the conditional around and nd the most likely model given the sample, P X j a = P a j X P X :
Pa
32 2.24 2.3. SAMPLES VERSUS DISTRIBUTIONS AITR 1548 In order to compute the model likelihood one must multiply the sample likelihood by the
correcting factor P X . The unconditioned probability of the sample P a could well be
Pa
arbitrary, because the sample is the same for all of the models we will evaluate. The prior
probability of the model, P X , poses more problems. Maximum likelihood model selection
assumes that all of the models that are to be evaluated are equally likely to have occured, i.e.
P X is constant. As a result P X is the same for all models, and the most probable model
Pa
is the one that makes the data most probable.
When reliable information about the prior probability of a model is available we can use
Bayes' law directly. This technique is known as maximum a posteriori model selection. For
instance, over a wide variety of experiments, we may have observed that fair coins are far
more common than unfair coins. It is very implausible that any particular coin would be
unfair, but not impossible. While our prior knowledge should bias us toward the conclusion
that a new coin is fair, it does not determine our conclusion. The likelihood of a model
together with its prior probability can be used to determine which model has the highest
probability of explaining the data. We want a model that both explains the data well and is
plausible.
In general, evaluating joint probability over a large number of random variables is intractable. In practice most maximum likelihood schemes assume that the di erent trials of X are
independent. The probability of cooccurrence is then the product of the independent RVs,
Y
`a = PX xa :
xa2a Maximizing `a is still a daunting process. Signi cant simpli cation can be obtained by
maximizing the logarithm of `,
X
log `a = log PX xa :
xa2a Log likelihood has the same maximum as `a, but has a much simpler derivative.
There is an interesting parallel between...
View
Full
Document
 Spring '10
 Cudeback
 The Land

Click to edit the document details