1995_Viola_thesis_registrationMI

Measuring x allows us to predict y it is also

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: n of another RV, as in Y = F X , Y and X are said to be dependent. Measuring X allows us to predict Y . It is also possible that two RV's are related but not directly predictable from each other. An example is a noisy voltage source that is powering a noisy current source. Actually measuring voltage tells you something about current, but it doesn't tell you everything. There is still unpredictability that arises from the current source itself. Finally, it is possible that two RV's are completely independent. For example, two di erent rolls of a fair die are considered independent. Dependency can be formalized by examining the joint distribution of two RV's, P X; Y . The joint distribution tells us about the co-occurrence of events from the RVs X and Y . It is a complete description of the random behavior of both X and Y . From the joint distribution one can compute the marginal distributions: X P X  = P X; Y = y y2 Y 25 Paul A. Viola and CHAPTER 2. PROBABILITY AND ENTROPY P Y  = Two variables are independent if X x2 X P X = x; Y  : P X; Y  = P X   P Y  : 2.5 They are considered dependent when the joint distribution is not the product of marginal distributions. A closely related distribution, the conditional distribution, P Y j X , is the probability of Y if we knew X . It is de ned as:  P Y j X  = PPX; Y  : X  Complete, functional dependence can be determined from conditional probability when it is the case that for all x 2 X that P Y = F x j X = x = 1 : What is known as Bayes' Law can be concluded from the following equation:   P X j Y  = PPX; Y  P X  = P Y j X  P X  : Y  P X  P Y Bayes' Law inverts conditional probabilities. It is quite useful in situations where one would like to conclude the distribution of X from a measurement of Y , but in principle all that is known is P Y j X . 2.2 Entropy Entropy is a statistic that summarizes randomness. The de nition of a random variable makes no mention of how random the variable is. Is a lottery number more or less random than the roll of a die? Entropy helps us answer this question. As we will see, the more random a variable is the more entropy it will have. Much additional material on entropy can be found in the excellent textbook by Cover and Thomas Cover and Thomas, 1991. 26 2.2. ENTROPY AI-TR 1548 Entropy in one form or another is a very old concept. Its origins clearly date back to the rst work on thermodynamics in the last century. Nonetheless, most of the credit for de ning entropy and promoting its use in data analysis and engineering falls to Shannon Shannon, 1948. The most straightforward de nition of entropy is as an expectation: X H X  = ,EX logP X  = , logP X = xiP X = xi : xi 2 X where we de ne 0 log0 = 0 here and elsewhere in the thesis. The classical de nition of entropy applies only to discrete random variables. We will present the de nition of continuous entropy, known as di erential entropy, later. Entropy is typically de ned in terms of the logarithm base 2. In that case entropy is given in units of bits. Entropy is Code Length One way of measuring randomness is to compose the shortest message that describes either one or a number of trials of an RV. A trial of a fair coin takes one bit of information to encode: a 1 for heads and a 0 for tails. There is no more e cient technique for encoding a single trial2. This restriction does not apply to a message that describes a sample of many trials. If the coin in question comes up tails only one time in a thousand, there are a number of straightforward schemes for encoding a sample that require less than one bit per trial. For instance one could send the length of the sample a and then the position of the zeros. It would take logNa bits to send the length and logNa bits to send the position of each zero. The length of a message describing Na ips of our coin would on average be E lengtha = logNa + P X = 0Na logNa : In this coding scheme a message describing one thousand trials will on average take about 20 bits. The number of bits it takes to encode a sample is dependent both on the number of events and the distribution of the random variable. A discussion and comparison of coding schemes could take up quite a lot of space. Luckily, using the Kraft inequality it can be proven that on average one needs to send H X  bits to communicate a trial of the random variable X . Furthermore, Shannon showed that is possible 2 Provided the coin doesn't have two heads. 27 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY to construct a code that will take at most H X  + 1 bits on average. A simple algorithm discovered by Hu man can construct the shortest possible codes for any random variable. Because entropy is a bound on the code length that is required to transmit a trial, entropy is often called information. Conditional Entropy, Joint Entropy, and Mutual Information The concept of mutual information plays a critical role in this thesis. One...
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online