1995_Viola_thesis_registrationMI

in section 21 we saw that two rvs were independent

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: of the key problems that we will need to solve is, How likely is it that the random variable Y is functionally dependent on X ?" In Section 2.1 we saw that two RV's were independent if and only if their joint density was the product of their marginal densities see 2.5. Entropy will allow us to quantify the extent to which two RV's are dependent. Quantifying dependence is very much like quantifying randomness. Total dependence implies that a measurement of one RV completely determines the other i.e. knowledge of X removes any randomness from Y . Independence is just the opposite i.e. knowledge of X does not help you predict Y . Just as joint and conditional distributions relate the co-occurrences of two RV's, entropy can be used to relate the predictability of two RV's. Conditional entropy and joint entropy are de ned as: H Y j X  EX EY logP Y j X  and H Y; X  EX EY logP Y; X  : Conditional entropy is a measure of the randomness of Y given knowledge of X . Note that it is an expectation over the di erent events of X , so it measures on average just how random Y is given X . H Y j X = x is the randomness one expects from Y if X takes on a particular value. Random variables are considered independent when H Y j X  = H Y  or, H X; Y  = H X  + H Y  : 28 2.2. ENTROPY AI-TR 1548 As Y becomes more dependent on X , H Y j X  gets smaller. However, conditional entropy by itself is not a measure of dependency. A small value for H Y jX  may not imply dependence, it may only imply that H Y  is small. The mutual information, MI, between two random variables is given by I X; Y  = H Y  , H Y j X  : 2.6 I X; Y  is a measure of the reduction in the entropy of Y given X . A number of simple logarithm equalities can be used to prove relations between conditional and joint entropy. For instance, conditional entropy can be expressed in terms of marginal and joint entropies: H Y j X  = H X; Y  , H X  : This allows us to provide three equivalent expression for mutual information and a useful identity: I X; Y  = H Y  , H Y j X  = H X  + H Y  , H X; Y  = H X  , H X j Y  = I Y; X  : 2.7 2.8 2.9 2.10 An extremely useful inequality on expectations, known as Jensen's inequality, allows us to prove that for any concave function F that E F X   F E X  : A function is concave when its second derivative is negative everywhere. Using the fact that the logarithm function is concave, Jensen's inequality allows us to prove the following useful inequalities: H X   0 H Y   H Y j X  I X; Y   0 29 2.11 2.12 2.13 Paul A. Viola CHAPTER 2. PROBABILITY AND ENTROPY 2.2.1 Di erential Entropy While a number of the main theorems of entropy apply both to continuous and discrete distributions, a number of other theorems change signi cantly. The continuous version of entropy is called di erential entropy, and is de ned as: Z1 hX  ,EX logpX  = , px logpxdx : 2.14 ,1 For the most part di erential entropies can be manipulated, and obey the same identities, as entropy. In fact all of the equalities and inequalities of the previous section hold except for 2.11. Throughout the thesis, when entropy is mentioned, it is to be understood as the applicable form of entropy. When the di erence matters we will be explicit. The most perplexing di erence between entropy and di erential entropy is that there is no longer a direct relationship between hX  and code length. It is possible to construct examples where di erential entropy is negative. This is an implication of the fact that pX  can take on values greater than 1. Code length, however, is never negative. Di erential entropy does not provide an absolute measure of randomness. Discouragingly, it is even the case that a density with a di erential entropy of negative in nity may still be unpredictable. Examples of this sort can be constructed by embedding a discrete process into a continuous space. For example one could model the roll of a die as a continuous RV. The density would then be made up of a series of delta functions centered at the points one through six. A delta function, often called a Dirac delta function, can be de ned from a box car function, 81 bx; xlow; xhigh = : xhigh ,xlow if xlow x xhigh : 2.15 0 otherwise The box car function is de ned so that it integrates to one. The delta function is a box car function in the limit as it approaches zero width, x = !0 bx; 0;  : lim The delta function, because it is a box car, integrates to one. It can be shown that Z1 f x0 = x0 , xf xdx ,1 30 2.16 2.17 2.3. SAMPLES VERSUS DISTRIBUTIONS AI-TR 1548 Furthermore, from the de nition of convolution, Z1 f  gx f x , x0gx0dx0 2.18 we can see that the delta function is the identity operator, Z1   f x = x , x0f x0dx0 = f x : 2.19 ,1 ,1 The density of the continuous model of die rolling can be formulated as px = 6 X1 x , i i=1 6 which will integrate to 1. Furthermore if we de ne the probability of an event...
View Full Document

Ask a homework question - tutors are online