Unformatted text preview: of the key problems
that we will need to solve is, How likely is it that the random variable Y is functionally
dependent on X ?" In Section 2.1 we saw that two RV's were independent if and only if their
joint density was the product of their marginal densities see 2.5. Entropy will allow us to
quantify the extent to which two RV's are dependent.
Quantifying dependence is very much like quantifying randomness. Total dependence
implies that a measurement of one RV completely determines the other i.e. knowledge of
X removes any randomness from Y . Independence is just the opposite i.e. knowledge
of X does not help you predict Y . Just as joint and conditional distributions relate the
cooccurrences of two RV's, entropy can be used to relate the predictability of two RV's.
Conditional entropy and joint entropy are de ned as: H Y j X EX EY logP Y j X
and H Y; X EX EY logP Y; X :
Conditional entropy is a measure of the randomness of Y given knowledge of X . Note that it
is an expectation over the di erent events of X , so it measures on average just how random
Y is given X . H Y j X = x is the randomness one expects from Y if X takes on a particular
value. Random variables are considered independent when H Y j X = H Y
or, H X; Y = H X + H Y :
28 2.2. ENTROPY AITR 1548 As Y becomes more dependent on X , H Y j X gets smaller. However, conditional
entropy by itself is not a measure of dependency. A small value for H Y jX may not imply
dependence, it may only imply that H Y is small. The mutual information, MI, between
two random variables is given by I X; Y = H Y , H Y j X : 2.6 I X; Y is a measure of the reduction in the entropy of Y given X .
A number of simple logarithm equalities can be used to prove relations between conditional
and joint entropy. For instance, conditional entropy can be expressed in terms of marginal
and joint entropies:
H Y j X = H X; Y , H X :
This allows us to provide three equivalent expression for mutual information and a useful
identity: I X; Y = H Y , H Y j X
= H X + H Y , H X; Y
= H X , H X j Y
= I Y; X : 2.7
2.8
2.9
2.10 An extremely useful inequality on expectations, known as Jensen's inequality, allows us
to prove that for any concave function F that E F X F E X :
A function is concave when its second derivative is negative everywhere. Using the fact that
the logarithm function is concave, Jensen's inequality allows us to prove the following useful
inequalities: H X 0
H Y H Y j X
I X; Y 0 29 2.11
2.12
2.13 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY 2.2.1 Di erential Entropy
While a number of the main theorems of entropy apply both to continuous and discrete
distributions, a number of other theorems change signi cantly. The continuous version of
entropy is called di erential entropy, and is de ned as:
Z1
hX ,EX logpX = , px logpxdx :
2.14
,1 For the most part di erential entropies can be manipulated, and obey the same identities,
as entropy. In fact all of the equalities and inequalities of the previous section hold except
for 2.11. Throughout the thesis, when entropy is mentioned, it is to be understood as the
applicable form of entropy. When the di erence matters we will be explicit.
The most perplexing di erence between entropy and di erential entropy is that there is
no longer a direct relationship between hX and code length. It is possible to construct
examples where di erential entropy is negative. This is an implication of the fact that pX
can take on values greater than 1. Code length, however, is never negative.
Di erential entropy does not provide an absolute measure of randomness. Discouragingly,
it is even the case that a density with a di erential entropy of negative in nity may still be
unpredictable. Examples of this sort can be constructed by embedding a discrete process
into a continuous space. For example one could model the roll of a die as a continuous RV.
The density would then be made up of a series of delta functions centered at the points one
through six. A delta function, often called a Dirac delta function, can be de ned from a box
car function,
81
bx; xlow; xhigh = : xhigh ,xlow if xlow x xhigh :
2.15
0
otherwise
The box car function is de ned so that it integrates to one. The delta function is a box car
function in the limit as it approaches zero width,
x = !0 bx; 0; :
lim
The delta function, because it is a box car, integrates to one. It can be shown that
Z1
f x0 =
x0 , xf xdx
,1 30 2.16
2.17 2.3. SAMPLES VERSUS DISTRIBUTIONS AITR 1548 Furthermore, from the de nition of convolution,
Z1
f gx f x , x0gx0dx0 2.18 we can see that the delta function is the identity operator,
Z1
f x =
x , x0f x0dx0 = f x : 2.19 ,1 ,1 The density of the continuous model of die rolling can be formulated as px = 6
X1
x , i
i=1 6 which will integrate to 1. Furthermore if we de ne the probability of an event...
View
Full Document
 Spring '10
 Cudeback
 The Land, Probability distribution, Probability theory, probability density function, Mutual Information, Paul A. Viola

Click to edit the document details