sol5 - 6. Monotonicity of entropy p er element. For a...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 6. Monotonicity of entropy p er element. For a stationary stochastic process X 1 , X2 , . . . , Xn , show that (a) H (X1 , X2 , . . . , Xn ) H (X1 , X2 , . . . , Xn−1 ) ≤ . n n−1 ECE 5620 (b) Homework #5: Solutions 1. (4.51) Spring 2011 H (X1 , X2 , . . . , Xn ) ≥ H (Xn |Xn−1 , . . . , X1 ). n (4.52) Solution: Monotonicity of entropy per element. (a) By the chain rule for entropy, H (X1 , X2 , . . . , Xn ) n = = = n i−1 ) i=1 H (Xi |X n H (Xn |X n−1 ) + (4.53) n−1 i−1 ) i=1 H (Xi |X n H (Xn |X n−1 ) + H (X1 , X2 , . . . , Xn−1 ) . n (4.54) (4.55) From stationarity it follows that for all 1 ≤ i ≤ n , H (Xn |X n−1 ) ≤ H (Xi |X i−1 ), Entropy Rates of a Stochastic Process 68 which further implies, by averaging b oth sides, that, H (Xn |X n−1 ) ≤ = Combining (4.55) and (4.57) yields, H (X1 , X2 , . . . , Xn ) n ≤ = n−1 i−1 ) i=1 H (Xi |X n−1 H (X1 , X2 , . . . , Xn−1 ) . n−1 (4.56) (4.57) 1 H (X1 , X2 , . . . , Xn−1 ) + H (X1 , X2 , . . . , Xn−1 ) n n−1 H (X1 , X2 , . . . , Xn−1 ) . (4.58) n−1 (b) By stationarity we have for all 1 ≤ i ≤ n , H (Xn |X n−1 ) ≤ H (Xi |X i−1 ), which implies that H (Xn |X n−1 ) = ≤ = n n−1 ) i=1 H (Xn |X n n i−1 ) i=1 H (Xi |X n H (X1 , X2 , . . . , Xn ) . n 7. Entropy rates of Markov chains. (a) Find the entropy rate of the two-state Markov chain with transition matrix 1 − p01 p01 . p10 1 − p10 1 (b) What values of p01 , p10 maximize the rate of part (a)? (c) Find the entropy rate of the two-state Markov chain with transition matrix P= P= 1−p p 1 0 . (4.59) (4.60) (4.61) En.tropy Rates of a Stochastic Process 2 69 Solution: Entropy rates of Markov chains. (a) The stationary distribution is easily calculated. (See EIT pp. 62–63.) p10 p01 µ0 = , µ0 = . p01 + p10 p01 + p10 Therefore the entropy rate is H (X2 |X1 ) = µ0 H (p01 ) + µ1 H (p10 ) = p10 H (p01 ) + p01 H (p10 ) . p01 + p10 (b) The entropy rate is at most 1 bit b ecause the process has only two states. This rate can b e achieved if (and only if ) p 01 = p10 = 1/2 , in which case the process is actually i.i.d. with Pr(Xi = 0) = Pr(Xi = 1) = 1/2 . (c) As a sp ecial case of the general two-state Markov chain, the entropy rate is H (X2 |X1 ) = µ0 H (p) + µ1 H (1) = H (p) . p+1 (d) By straightforward calculus, we find that the maximum value of H (X ) of part (c) √ occurs for p = (3 − 5)/2 = 0.382 . The maximum value is √ 5−1 H (p) = H (1 − p) = H = 0.694 bits . 2 √ Note that ( 5 − 1)/2 = 0.618 is (the reciprocal of ) the Golden Ratio. (e) The Markov chain of part (c) forbids consecutive ones. Consider any allowable sequence of symb ols of length t . If the first symb ol is 1, then the next symb ol must b e 0; the remaining N (t − 2) symb ols can form any allowable sequence. If the first symb ol is 0, then the remaining N (t − 1) symb ols can b e any allowable sequence. So the numb er of allowable sequences of length t satisfies the recurrence N (t) = N (t − 1) + N (t − 2) N (1) = 2, N (2) = 3 (The initial conditions are obtained by observing that for t = 2 only the sequence 11 is not allowed. We could also choose N (0) = 1 as an initial condition, since there is exactly one allowable sequence of length 0, namely, the empty sequence.) The sequence N (t) grows exp onentially, that is, N (t) ≈ cλ t , where λ is the maximum magnitude solution of the characteristic equation 1 = z −1 + z −2 . √ Solving the characteristic equation yields λ = (1 + 5)/2 , the Golden Ratio. (The sequence {N (t)} is the sequence of Fib onacci numb ers.) Therefore √ 1 H0 = lim log N (t) = log(1 + 5)/2 = 0.694 bits . n→∞ t Since there are only N (t) p ossible outcomes for X 1 , . . . , Xt , an upp er b ound on H (X1 , . . . , Xt ) is log N (t) , and so the entropy rate of the Markov chain of part (c) is at most H0 . In fact, we saw in part (d) that this upp er b ound can b e achieved. 2 values in {0, 1} , with Pr{Xi = 1} = otherwise. Let n ≥ 3 . 3. 1 2 n−1 i=1 Xi . Let Xn = 1 if is odd and Xn = 0 (a) Show that Xi and Xj are indep endent, for i = j , i, j ∈ {1, 2, . . . , n} . (b) Find H (Xi , Xj ) , for i = j . (c) Find H (X1 , X2 , . . . , Xn ) . Is this equal to nH (X1 ) ? Solution: (Pairwise Independence) X1 , X2 , . . . , Xn−1 are i.i.d. Bernoulli(1/2) random k variables. We will first prove that for any k ≤ n − 1 , the probability that i=1 Xi is odd is 1/2 . We will prove this by induction. Clearly this is true for k = 1 . Assume that it is true for k − 1 . Let Sk = k=1 Xi . Then i P (Sk odd) = P (Sk−1 odd)P (Xk = 0) + P (Sk−1 even)P (Xk = 1) (4.64) 11 11 + (4.65) = 22 22 1 . (4.66) = 2 Hence for all k ≤ n − 1 , the probability that S k is odd is equal to the probability that it is even. Hence, 1 (4.67) P (Xn = 1) = P (Xn = 0) = . 2 (a) It is clear that when i and j are b oth less than n , X i and Xj are indep endent. The only p ossible problem is when j = n . Taking i = 1 without loss of generality, P (X1 = 1, Xn = 1) = P (X1 = 1, n−1 Xi i=2 n−1 = P (X1 = 1)P ( even) Xi even) (4.68) (4.69) i=2 11 22 = P (X1 = 1)P (Xn = 1) = (4.70) (4.71) and similarly for other p ossible values of the pair (X 1 , Xn ) . Hence X1 and Xn are indep endent. (b) Since Xi and Xj are indep endent and uniformly distributed on {0, 1} , H (Xi , Xj ) = H (Xi ) + H (Xj ) = 1 + 1 = 2 bits. (4.72) (c) By the chain rule and the indep endence of X 1 , X2 , . . . , Xn1 , we have H (X1 , X2 , . . . , Xn ) = H (X1 , X2 , . . . , Xn−1 ) + H (Xn |Xn−1 , . . . , X1 )(4.73) = n−1 H (Xi ) + 0 (4.74) i=1 = n − 1, 72 (4.75) Entropy Rates of a Stochastic Process since Xn is a function of the previous Xi ’s. The total entropy is not n , which is what would b e obtained if the Xi ’s were all indep endent. This example illustrates that pairwise indep endence does not imply complete indep endence. 11. Stationary processes. Let . . . , X−1 , X0 , X1 , . . . b e a stationary (not necessarily Markov) stochastic process. Which of the following statements are true? Prove or provide a counterexample. (a) H (Xn |X0 ) = H (X−n |X0 ) . 3 (b) H (Xn |X0 ) ≥ H (Xn−1 |X0 ) . (c) H (Xn |X1 , X2 , . . . , Xn−1 , Xn+1 ) is nonincreasing in n . (d) H (Xn |X1 , . . . , Xn−1 , Xn+1 , . . . , X2n ) is non-increasing in n . (a) H (Xn |X0 ) = H (X−n |X0 ) . (b) H (Xn |X0 ) ≥ H (Xn−1 |X0 ) . (c) H (Xn |X1 , X2 , . . . , Xn−1 , Xn+1 ) is nonincreasing in n . 4. (d) H (Xn |X1 , . . . , Xn−1 , Xn+1 , . . . , X2n ) is non-increasing in n . Solution: Stationary processes. (a) H (Xn |X0 ) = H (X−n |X0 ) . This statement is true, since H (Xn |X0 ) = H (Xn , X0 ) − H (X0 ) H (X−n |X0 ) = H (X−n , X0 ) − H (X0 ) (4.76) (4.77) and H (Xn , X0 ) = H (X−n , X0 ) by stationarity. (b) H (Xn |X0 ) ≥ H (Xn−1 |X0 ) . This statement is not true in general, though it is true for first order Markov chains. A simple counterexample is a p eriodic process with p eriod n . Let X 0 , X1 , X2 , . . . , Xn−1 b e i.i.d. uniformly distributed binary random variables and let X k = Xk−n for k ≥ n . In this case, H (Xn |X0 ) = 0 and H (Xn−1 |X0 ) = 1 , contradicting the statement H (Xn |X0 ) ≥ H (Xn−1 |X0 ) . n (c) H (Xn |X1 −1 , Xn+1 ) is non-increasing in n . n n This statement is true, since by stationarity H (X n |X1 −1 , Xn+1 ) = H (Xn+1 |X2 , Xn+2 ) ≥ n H (Xn+1 |X1 , Xn+2 ) where the inequality follows from the fact that conditioning reduces entropy. 12. The entropy rate of a dog looking for a b one. A dog walks on the integers, (d) This is true: p ossibly reversing direction at each step with probability p = .1. Let X 0 = 0 . The first step is equally likely to b e p ositive or negative. A typical walk might look like this: H (Xn |X1 , . . . Xn−1 , Xn+1 , . . . X2n ) = H (Xn+1 |X2 , . . . Xn , Xn+2 , . . . X2n+1 ) ≥ , −4, −3 −2 −1 0 , . . ). (X0 , X1 , . . .) = (0, −1, −2, −3H (Xn+1,|X1 , X2 , . ., .1Xn ,.Xn+2 , . . . X2(n+1) ). (a)The equality 1follows .from )stationarity. The inequality follows because conditioning reduces Find H (X , X2 , . . , Xn . entropy. entropy rate of this browsing dog. (b) Find the 5. (c) What is the exp ected numb er of steps the dog takes b efore reversing direction? Let X be Bernoulli( ). opy rate of a dogprefix-free codebrequires one bit per source symbol, but the Solution: The entr Then the optimal looking for a one. entropy can be made arbitrarily small by sending to zero or one. 4 Yes-No questions are asked sequentially until X is determined. (“Determined” doesn’t mean that a “Yes” answer is received.) Questions cost one unit each. How should the player proceed? What is the exp ected payoff ? 6. (c) Continuing (b), what if v (x) is fixed, but p(x) can b e chosen by the computer (and then announced to the player)? The computer wishes to minimize the player’s exp ected return. What should p(x) b e? What is the exp ected return to the player? Solution: The game of Hi-Lo. (a) The first thing to recognize in this problem is that the player cannot cover more than 63 values of X with 6 questions. This can b e easily seen by induction. With one question, there is only one value of X that can b e covered. With two questions, there is one value of X that can b e covered with the first question, and dep ending on the answer to the first question, there are two p ossible values of X that can b e asked in the next question. By extending this argument, we see that we can ask at more 63 different questions of the form “Is X = i ?” with 6 questions. (The fact that we have narrowed the range at the end is irrelevant, if we have not isolated the value of X .) Thus if the player seeks to maximize his return, he should choose the 63 most valuable outcomes for X , and play to isolate these values. The probabilities are 5 Data Compression 110 irrelevant to this procedure. He will choose the 63 most valuable outcomes, and his first question will b e “Is X = i ?” where i is the median of these 63 numb ers. After isolating to either half, his next question will b e “Is X = j ?”, where j is the median of that half. Proceeding this way, he will win if X is one of the 63 most valuable outcomes, and lose otherwise. This strategy maximizes his exp ected winnings. (b) Now if arbitrary questions are allowed, the game reduces to a game of 20 questions to determine the ob ject. The return in this case to the player is x p(x)(v (x) − l(x)) , where l(x) is the numb er of questions required to determine the ob ject. Maximizing the return is equivalent to minimizing the exp ected numb er of questions, and thus, as argued in the text, the optimal strategy is to construct a Huffman code for the source and use that to construct a question strategy. His exp ected return is therefore b etween p(x)v (x) − H and p(x)v (x) − H − 1 . (c) A computer wishing to minimize the return to player will want to minimize p(x)v (x) − H (X ) over choices of p(x) . We can write this as a standard minimization problem with constraints. Let J (p) = pi vi + pi log pi + λ pi (5.24) and differentiating and setting to 0, we obtain vi + log pi + 1 + λ = 0 (5.25) or after normalizing to ensure that the p i ’s form a probability distribution, pi = To complete the proof, we let ri = 2−vi −vj . j2 2−vi 2−vj j (5.26) , and rewrite the return as pi log pi = pi log pi − pi log 2−vi = pi vi + pi log pi − pi log ri − log( = D(p||r ) − log( 2−vj ), (5.27) 2−vj ) (5.28) (5.29) and thus the return is minimized by choosing p i = ri . This is the distribution that the computer must choose to minimize the return to the player. 20. Huffman codes with costs. Words like Run! Help! and Fire! are short, not b ecause they are frequently used, but p erhaps b ecause time is precious in the situations in which these words are required. Supp ose that X = i with probability p i , i = 1, 2, . . . , m. Let li b e the numb er of binary symb ols in the codeword associated with X = i, and let c i denote the cost p er letter of the codeword when X = i. Thus the average cost C of the description of X is C = m pi ci li . i=1 6 (b) How would you use the Huffman code procedure to minimize C over all uniquely decodable codes? Let CH uf f man denote this minimum. (c) Can you show that C ∗ ≤ CH uf f man ≤ C ∗ + 7. m pi ci ? i=1 Solution: Huffman codes with costs. (a) We wish to minimize C = pi ci ni sub ject to 2−ni ≤ 1 . We will assume −ni and let Q = equality in the constraint and let ri = 2 i pi ci . Let qi = (pi ci )/Q . Then q also forms a probability distribution and we can write C as C= pi ci ni 1 ri qi =Q qi log − qi log qi ri = Q(D(q||r) + H (q)). =Q qi log (5.30) (5.31) (5.32) (5.33) Since the only freedom is in the choice of r i , we can minimize C by choosing r = q or pi ci n∗ = − log , (5.34) i pj cj where we have ignored any integer constraints on n i . The minimum cost C ∗ for this assignment of codewords is C ∗ = QH (q) (5.35) (b) If we use q instead of p for the Huffman procedure, we obtain a code minimizing exp ected cost. (c) Now we can account for the integer constraints. Let ni = − log qi (5.36) Then − log qi ≤ ni < − log qi + 1 (5.37) Multiplying by pi ci and summing over i , we get the relationship C ∗ ≤ CH uf f man < C ∗ + Q. 7 (5.38) Data Compression 8. (a) Since li = log 119 1 pi , we have log 1 1 ≤ li < log + 1 pi pi (5.45) which implies that H (X ) ≤ L = pi li < H (X ) + 1. (5.46) The difficult part is to prove that the code is a prefix code. By the choice of l i , we have 2−li ≤ pi < 2−(li −1) . (5.47) Thus Fj , j > i differs from Fi by at least 2−li , and will therefore differ from Fi is at least one place in the first li bits of the binary expansion of Fi . Thus the codeword for Fj , j > i , which has length lj ≥ li , differs from the codeword for Fi at least once in the first li places. Thus no codeword is a prefix of any other codeword. (b) We build the following table Symb ol Probability Fi in decimal Fi in binary li Codeword 1 0.5 0.0 0.0 1 0 2 0.25 0.5 0.10 2 10 3 0.125 0.75 0.110 3 110 4 0.125 0.875 0.111 3 111 The Shannon code in this case achieves the entropy b ound (1.75 bits) and is optimal. 29. Optimal codes for dyadic distributions. For a Huffman code tree, define the probability of a node as the sum of the probabilities of all the leaves under that node. Let the random variable X b e drawn from a dyadic distribution, i.e., p(x) = 2 −i , for some i , for all x ∈ X . Now consider a binary Huffman code for this distribution. (a) Argue that for any node in the tree, the probability of the left child is equal to the probability of the right child. (b) Let X1 , X2 , . . . , Xn b e drawn i.i.d. ∼ p(x) . Using the Huffman code for p(x) , we map X1 , X2 , . . . , Xn to a sequence of bits Y1 , Y2 , . . . , Yk(X1 ,X2 ,...,Xn ) . (The length of this sequence will dep end on the outcome X 1 , X2 , . . . , Xn .) Use part (a) to argue that the sequence Y1 , Y2 , . . . , forms a sequence of fair coin flips, i.e., that 1 Pr{Yi = 0} = Pr{Yi = 1} = 2 , indep endent of Y1 , Y2 , . . . , Yi−1 . Thus the entropy rate of the coded sequence is 1 bit p er symb ol. (c) Give a heuristic argument why the encoded sequence of bits for any code that achieves the entropy b ound cannot b e compressible and therefore should have an entropy rate of 1 bit p er symb ol. Solution: Optimal codes for dyadic distributions. 8 sample the mixture. You proceed, mixing and tasting, stopping when the bad b ottle has b een determined. (c) What is the minimum exp ected numb er of tastings required to determine the bad wine? 9. (d) What mixture should b e tasted first? Solution: Bad Wine (a) If we taste one b ottle at a time, to minimize the exp ected numb er of tastings the order of tasting should b e from the most likely wine to b e bad to the least. The exp ected numb er of tastings required is 6 i=1 pi li = 1 × 6 4 2 2 1 8 +2× +3× +4× +5× +5× 23 23 23 23 23 23 55 23 = 2.39 = (b) The first b ottle to b e tasted should b e the one with probability 8 23 . (c) The idea is to use Huffman coding. With Huffman coding, we get codeword lengths as (2, 2, 2, 3, 4, 4) . The exp ected numb er of tastings required is 6 i=1 pi li = 2 × 8 6 4 2 2 1 +2× +2× +3× +4× +4× 23 23 23 23 23 23 54 23 = 2.35 = Data Compression 127 (d) The mixture of the first and second b ottles should b e tasted first. 33. Huffman vs. Shannon. A random variable X takes on three values with probabilities 0.6, 0.3, and 0.1. (a) What are the lengths of the binary Huffman codewords for X ? What are the lengths of the binary Shannon codewords (l(x) = log( p(1 ) ) ) for X ? x (b) What is the smallest integer D such that the exp ected Shannon codeword length with a D -ary alphab et equals the exp ected Huffman codeword length with a D ary alphab et? Solution: Huffman vs. Shannon (a) It is obvious that an Huffman code for the distribution (0.6,0.3,0.1) is (1,01,00), 1 with codeword lengths (1,2,2). The Shannon code would use lengths log p , which gives lengths (1,2,4) for the three symb ols. (b) For any D > 2 , the Huffman code for the three symb ols are all one character. The 1 Shannon code length log D p would b e equal to 1 for all symb ols if log D 011 = 1 , . i.e., if D = 10 . Hence for D ≥ 10 , the Shannon code is also optimal. 34. Huffman algorithm for tree construction. Consider the following problem: m binary signals S1 , S2 , . . . , Sm are available at times T1 ≤ T2 ≤ . . . ≤ Tm , and we would like to find their sum S1 ⊕ S2 ⊕ · · · ⊕ Sm using 2-input gates, each gate with 1 time unit delay, so that the final result is available as quickly as p ossible. A simple greedy algorithm is to combine the earliest two results, forming the partial result at time max(T1 , T2 ) + 1 . We now have a new problem with S 1 ⊕ S2 , S3 , . . . , Sm , available 9 at times max(T1 , T2 ) + 1, T3 , . . . , Tm . We can now sort this list of T’s, and apply the same merging step again, rep eating this until we have the final result. (a) Argue that the ab ove procedure is optimal, in that it constructs a circuit for which the final result is available as quickly as p ossible. ...
View Full Document

Ask a homework question - tutors are online