This preview shows page 1. Sign up to view the full content.
Unformatted text preview: n of another RV, as in Y = F X , Y and X are said to be dependent.
Measuring X allows us to predict Y . It is also possible that two RV's are related but not
directly predictable from each other. An example is a noisy voltage source that is powering
a noisy current source. Actually measuring voltage tells you something about current, but it
doesn't tell you everything. There is still unpredictability that arises from the current source
itself. Finally, it is possible that two RV's are completely independent. For example, two
di erent rolls of a fair die are considered independent.
Dependency can be formalized by examining the joint distribution of two RV's, P X; Y .
The joint distribution tells us about the cooccurrence of events from the RVs X and Y . It is
a complete description of the random behavior of both X and Y . From the joint distribution
one can compute the marginal distributions:
X
P X =
P X; Y = y
y2 Y 25 Paul A. Viola and CHAPTER 2. PROBABILITY AND ENTROPY P Y = Two variables are independent if X
x2 X P X = x; Y : P X; Y = P X P Y : 2.5 They are considered dependent when the joint distribution is not the product of marginal
distributions. A closely related distribution, the conditional distribution, P Y j X , is the
probability of Y if we knew X . It is de ned as:
P Y j X = PPX; Y :
X
Complete, functional dependence can be determined from conditional probability when it is
the case that for all x 2 X that P Y = F x j X = x = 1 :
What is known as Bayes' Law can be concluded from the following equation:
P X j Y = PPX; Y P X = P Y j X P X :
Y P X
P Y
Bayes' Law inverts conditional probabilities. It is quite useful in situations where one would
like to conclude the distribution of X from a measurement of Y , but in principle all that is
known is P Y j X . 2.2 Entropy
Entropy is a statistic that summarizes randomness. The de nition of a random variable
makes no mention of how random the variable is. Is a lottery number more or less random
than the roll of a die? Entropy helps us answer this question. As we will see, the more
random a variable is the more entropy it will have. Much additional material on entropy can
be found in the excellent textbook by Cover and Thomas Cover and Thomas, 1991.
26 2.2. ENTROPY AITR 1548 Entropy in one form or another is a very old concept. Its origins clearly date back to the
rst work on thermodynamics in the last century. Nonetheless, most of the credit for de ning
entropy and promoting its use in data analysis and engineering falls to Shannon Shannon,
1948. The most straightforward de nition of entropy is as an expectation:
X
H X = ,EX logP X = ,
logP X = xiP X = xi :
xi 2 X where we de ne 0 log0 = 0 here and elsewhere in the thesis. The classical de nition of
entropy applies only to discrete random variables. We will present the de nition of continuous
entropy, known as di erential entropy, later. Entropy is typically de ned in terms of the
logarithm base 2. In that case entropy is given in units of bits. Entropy is Code Length
One way of measuring randomness is to compose the shortest message that describes either
one or a number of trials of an RV. A trial of a fair coin takes one bit of information to
encode: a 1 for heads and a 0 for tails. There is no more e cient technique for encoding a
single trial2. This restriction does not apply to a message that describes a sample of many
trials. If the coin in question comes up tails only one time in a thousand, there are a number
of straightforward schemes for encoding a sample that require less than one bit per trial. For
instance one could send the length of the sample a and then the position of the zeros. It
would take logNa bits to send the length and logNa bits to send the position of each zero.
The length of a message describing Na ips of our coin would on average be E lengtha = logNa + P X = 0Na logNa :
In this coding scheme a message describing one thousand trials will on average take about
20 bits. The number of bits it takes to encode a sample is dependent both on the number of
events and the distribution of the random variable.
A discussion and comparison of coding schemes could take up quite a lot of space. Luckily,
using the Kraft inequality it can be proven that on average one needs to send H X bits to
communicate a trial of the random variable X . Furthermore, Shannon showed that is possible
2 Provided the coin doesn't have two heads. 27 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY to construct a code that will take at most H X + 1 bits on average. A simple algorithm
discovered by Hu man can construct the shortest possible codes for any random variable.
Because entropy is a bound on the code length that is required to transmit a trial, entropy
is often called information. Conditional Entropy, Joint Entropy, and Mutual Information
The concept of mutual information plays a critical role in this thesis. One...
View
Full
Document
This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.
 Spring '10
 Cudeback
 The Land

Click to edit the document details