Though di erential entropy does not provide an

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: as Z x+ P x = !0 x, px ; lim 2.20 2.21 1 then P X = 1 = 6 , as will the probability of the other events. Finally we can show that the entropy of X is negative in nity, Z1 6 0  logpx0 dx0 = , X 1 log1 = ,1 ; 2.22 hX  = , ,1 px i=1 6 yet X is pretty clearly random. Though di erential entropy does not provide an absolute measure of randomness or code length, it does provide a relative measure of these properties. A random variable X is less predictable than Y whenever hX  hY . Similarly an event from X requires more bits on average to encode than an event from Y . 2.3 Samples versus Distributions A random variable is a mathematical structure that is intended to model the behavior of a physical process. In some cases there are excellent physical reasons to believe that an RV is an accurate model of a process. In many other cases the properties of a random physical 31 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY process may well be unknown. In these cases we may still wish to use the theory of probability to analyze a system. The rst step is to nd an accurate RV model of our data. In order to insure that our probabilistic inferences will be correct, our model must be as accurate as possible. While it is possible to model every coin as a fair coin, we do so at our peril. It makes much more sense to perform a large number of experiments intended to test the hypothesis, Is the coin fair?" There are two important intuitions behind nding an accurate model of a random process. First and foremost, you want a model that seems to explain the data well. It might not make sense to guess that a coin is fair if after 500 ips heads has come up 400 times. But it is also important that your model be plausible. If after a lifetime of experience you realize that most coins are pretty fair, than perhaps it makes more sense to assume that 400 heads is unusual but not su ciently unusual to assume that the coin is biased. 2.3.1 Model Selection, Likelihood and Cross Entropy The eld of statistics provides many tools for testing the validity of random models. A lot of this work shares a particular form, called maximum likelihood model selection. The goal is to select the most probable model given a large sample of measurements. Maximum likelihood selection proceeds in steps: 1 guess the de nition of a random variable that might model the process; 2 evaluate the goodness" of the model by computing the probability that the data observed could have been generated by the model, 3 after evaluating many models retain the model that makes the data most probable. The probability of a sample a under the RV X is the probability of the co-occurrence of the trials in a, `a = PX a = PX x1 = xa ; x2 = xa ; ::: : 1 2 2.23 The probability of a sample is usually called its likelihood and is denoted `a. Justi cation for maximum likelihood model selection is based on Bayes' law. The likelihood of a sample is really a conditional probability, P a j X . Bayes' law allows us to turn the conditional around and nd the most likely model given the sample, P X j a = P a j X  P X : Pa 32 2.24 2.3. SAMPLES VERSUS DISTRIBUTIONS AI-TR 1548 In order to compute the model likelihood one must multiply the sample likelihood by the correcting factor P X . The unconditioned probability of the sample P a could well be Pa arbitrary, because the sample is the same for all of the models we will evaluate. The prior probability of the model, P X , poses more problems. Maximum likelihood model selection assumes that all of the models that are to be evaluated are equally likely to have occured, i.e. P X  is constant. As a result P X is the same for all models, and the most probable model Pa is the one that makes the data most probable. When reliable information about the prior probability of a model is available we can use Bayes' law directly. This technique is known as maximum a posteriori model selection. For instance, over a wide variety of experiments, we may have observed that fair coins are far more common than unfair coins. It is very implausible that any particular coin would be unfair, but not impossible. While our prior knowledge should bias us toward the conclusion that a new coin is fair, it does not determine our conclusion. The likelihood of a model together with its prior probability can be used to determine which model has the highest probability of explaining the data. We want a model that both explains the data well and is plausible. In general, evaluating joint probability over a large number of random variables is intractable. In practice most maximum likelihood schemes assume that the di erent trials of X are independent. The probability of co-occurrence is then the product of the independent RVs, Y `a = PX xa : xa2a Maximizing `a is still a daunting process. Signi cant simpli cation can be obtained by maximizing the logarithm of `, X log `a = log PX xa : xa2a Log likelihood has the same maximum as `a, but has a much simpler derivative. There is an interesting parallel between...
View Full Document

Ask a homework question - tutors are online