Unformatted text preview: 6. Monotonicity of entropy p er element. For a stationary stochastic process X 1 , X2 , . . . , Xn ,
show that
(a)
H (X1 , X2 , . . . , Xn )
H (X1 , X2 , . . . , Xn−1 )
≤
.
n
n−1 ECE 5620
(b)
Homework #5: Solutions
1. (4.51)
Spring 2011 H (X1 , X2 , . . . , Xn )
≥ H (Xn Xn−1 , . . . , X1 ).
n (4.52) Solution: Monotonicity of entropy per element.
(a) By the chain rule for entropy,
H (X1 , X2 , . . . , Xn )
n =
=
= n
i−1 )
i=1 H (Xi X n
H (Xn X n−1 ) + (4.53)
n−1
i−1 )
i=1 H (Xi X n
H (Xn X n−1 ) + H (X1 , X2 , . . . , Xn−1 )
.
n (4.54)
(4.55) From stationarity it follows that for all 1 ≤ i ≤ n ,
H (Xn X n−1 ) ≤ H (Xi X i−1 ), Entropy Rates of a Stochastic Process 68 which further implies, by averaging b oth sides, that,
H (Xn X n−1 ) ≤
=
Combining (4.55) and (4.57) yields,
H (X1 , X2 , . . . , Xn )
n ≤
= n−1
i−1 )
i=1 H (Xi X n−1
H (X1 , X2 , . . . , Xn−1 )
.
n−1 (4.56)
(4.57) 1 H (X1 , X2 , . . . , Xn−1 )
+ H (X1 , X2 , . . . , Xn−1 )
n
n−1
H (X1 , X2 , . . . , Xn−1 )
.
(4.58)
n−1 (b) By stationarity we have for all 1 ≤ i ≤ n , H (Xn X n−1 ) ≤ H (Xi X i−1 ), which implies that
H (Xn X n−1 ) =
≤
= n
n−1 )
i=1 H (Xn X n
n
i−1 )
i=1 H (Xi X
n
H (X1 , X2 , . . . , Xn )
.
n 7. Entropy rates of Markov chains.
(a) Find the entropy rate of the twostate Markov chain with transition matrix
1 − p01
p01
.
p10
1 − p10
1
(b) What values of p01 , p10 maximize the rate of part (a)?
(c) Find the entropy rate of the twostate Markov chain with transition matrix
P= P= 1−p p
1
0 . (4.59)
(4.60)
(4.61) En.tropy Rates of a Stochastic Process
2 69 Solution: Entropy rates of Markov chains.
(a) The stationary distribution is easily calculated. (See EIT pp. 62–63.)
p10
p01
µ0 =
, µ0 =
.
p01 + p10
p01 + p10
Therefore the entropy rate is
H (X2 X1 ) = µ0 H (p01 ) + µ1 H (p10 ) = p10 H (p01 ) + p01 H (p10 )
.
p01 + p10 (b) The entropy rate is at most 1 bit b ecause the process has only two states. This
rate can b e achieved if (and only if ) p 01 = p10 = 1/2 , in which case the process is
actually i.i.d. with Pr(Xi = 0) = Pr(Xi = 1) = 1/2 .
(c) As a sp ecial case of the general twostate Markov chain, the entropy rate is
H (X2 X1 ) = µ0 H (p) + µ1 H (1) = H (p)
.
p+1 (d) By straightforward calculus, we ﬁnd that the maximum value of H (X ) of part (c)
√
occurs for p = (3 − 5)/2 = 0.382 . The maximum value is
√
5−1
H (p) = H (1 − p) = H
= 0.694 bits .
2
√
Note that ( 5 − 1)/2 = 0.618 is (the reciprocal of ) the Golden Ratio.
(e) The Markov chain of part (c) forbids consecutive ones. Consider any allowable
sequence of symb ols of length t . If the ﬁrst symb ol is 1, then the next symb ol
must b e 0; the remaining N (t − 2) symb ols can form any allowable sequence. If
the ﬁrst symb ol is 0, then the remaining N (t − 1) symb ols can b e any allowable
sequence. So the numb er of allowable sequences of length t satisﬁes the recurrence
N (t) = N (t − 1) + N (t − 2) N (1) = 2, N (2) = 3 (The initial conditions are obtained by observing that for t = 2 only the sequence
11 is not allowed. We could also choose N (0) = 1 as an initial condition, since
there is exactly one allowable sequence of length 0, namely, the empty sequence.)
The sequence N (t) grows exp onentially, that is, N (t) ≈ cλ t , where λ is the
maximum magnitude solution of the characteristic equation
1 = z −1 + z −2 . √
Solving the characteristic equation yields λ = (1 + 5)/2 , the Golden Ratio. (The
sequence {N (t)} is the sequence of Fib onacci numb ers.) Therefore
√
1
H0 = lim
log N (t) = log(1 + 5)/2 = 0.694 bits .
n→∞ t
Since there are only N (t) p ossible outcomes for X 1 , . . . , Xt , an upp er b ound on
H (X1 , . . . , Xt ) is log N (t) , and so the entropy rate of the Markov chain of part (c)
is at most H0 . In fact, we saw in part (d) that this upp er b ound can b e achieved. 2 values in {0, 1} , with Pr{Xi = 1} =
otherwise. Let n ≥ 3 . 3. 1
2 n−1
i=1 Xi . Let Xn = 1 if is odd and Xn = 0 (a) Show that Xi and Xj are indep endent, for i = j , i, j ∈ {1, 2, . . . , n} .
(b) Find H (Xi , Xj ) , for i = j .
(c) Find H (X1 , X2 , . . . , Xn ) . Is this equal to nH (X1 ) ?
Solution: (Pairwise Independence) X1 , X2 , . . . , Xn−1 are i.i.d. Bernoulli(1/2) random
k
variables. We will ﬁrst prove that for any k ≤ n − 1 , the probability that
i=1 Xi is
odd is 1/2 . We will prove this by induction. Clearly this is true for k = 1 . Assume
that it is true for k − 1 . Let Sk = k=1 Xi . Then
i
P (Sk odd) = P (Sk−1 odd)P (Xk = 0) + P (Sk−1 even)P (Xk = 1)
(4.64)
11 11
+
(4.65)
=
22 22
1
.
(4.66)
=
2
Hence for all k ≤ n − 1 , the probability that S k is odd is equal to the probability that
it is even. Hence,
1
(4.67)
P (Xn = 1) = P (Xn = 0) = .
2
(a) It is clear that when i and j are b oth less than n , X i and Xj are indep endent.
The only p ossible problem is when j = n . Taking i = 1 without loss of generality,
P (X1 = 1, Xn = 1) = P (X1 = 1, n−1 Xi
i=2
n−1 = P (X1 = 1)P ( even)
Xi even) (4.68)
(4.69) i=2 11
22
= P (X1 = 1)P (Xn = 1)
= (4.70)
(4.71) and similarly for other p ossible values of the pair (X 1 , Xn ) . Hence X1 and Xn
are indep endent.
(b) Since Xi and Xj are indep endent and uniformly distributed on {0, 1} ,
H (Xi , Xj ) = H (Xi ) + H (Xj ) = 1 + 1 = 2 bits. (4.72) (c) By the chain rule and the indep endence of X 1 , X2 , . . . , Xn1 , we have
H (X1 , X2 , . . . , Xn ) = H (X1 , X2 , . . . , Xn−1 ) + H (Xn Xn−1 , . . . , X1 )(4.73)
= n−1 H (Xi ) + 0 (4.74) i=1 = n − 1, 72 (4.75) Entropy Rates of a Stochastic Process since Xn is a function of the previous Xi ’s. The total entropy is not n , which is
what would b e obtained if the Xi ’s were all indep endent. This example illustrates
that pairwise indep endence does not imply complete indep endence.
11. Stationary processes. Let . . . , X−1 , X0 , X1 , . . . b e a stationary (not necessarily
Markov) stochastic process. Which of the following statements are true? Prove or
provide a counterexample.
(a) H (Xn X0 ) = H (X−n X0 ) . 3 (b) H (Xn X0 ) ≥ H (Xn−1 X0 ) . (c) H (Xn X1 , X2 , . . . , Xn−1 , Xn+1 ) is nonincreasing in n . (d) H (Xn X1 , . . . , Xn−1 , Xn+1 , . . . , X2n ) is nonincreasing in n . (a) H (Xn X0 ) = H (X−n X0 ) . (b) H (Xn X0 ) ≥ H (Xn−1 X0 ) . (c) H (Xn X1 , X2 , . . . , Xn−1 , Xn+1 ) is nonincreasing in n . 4. (d) H (Xn X1 , . . . , Xn−1 , Xn+1 , . . . , X2n ) is nonincreasing in n .
Solution: Stationary processes.
(a) H (Xn X0 ) = H (X−n X0 ) .
This statement is true, since
H (Xn X0 ) = H (Xn , X0 ) − H (X0 ) H (X−n X0 ) = H (X−n , X0 ) − H (X0 ) (4.76)
(4.77) and H (Xn , X0 ) = H (X−n , X0 ) by stationarity. (b) H (Xn X0 ) ≥ H (Xn−1 X0 ) .
This statement is not true in general, though it is true for ﬁrst order Markov chains.
A simple counterexample is a p eriodic process with p eriod n . Let X 0 , X1 , X2 , . . . , Xn−1
b e i.i.d. uniformly distributed binary random variables and let X k = Xk−n for
k ≥ n . In this case, H (Xn X0 ) = 0 and H (Xn−1 X0 ) = 1 , contradicting the
statement H (Xn X0 ) ≥ H (Xn−1 X0 ) . n
(c) H (Xn X1 −1 , Xn+1 ) is nonincreasing in n .
n
n
This statement is true, since by stationarity H (X n X1 −1 , Xn+1 ) = H (Xn+1 X2 , Xn+2 ) ≥
n
H (Xn+1 X1 , Xn+2 ) where the inequality follows from the fact that conditioning
reduces entropy. 12. The entropy rate of a dog looking for a b one. A dog walks on the integers,
(d) This is true:
p ossibly reversing direction at each step with probability p = .1. Let X 0 = 0 . The
ﬁrst step is equally likely to b e p ositive or negative. A typical walk might look like this:
H (Xn X1 , . . . Xn−1 , Xn+1 , . . . X2n ) = H (Xn+1 X2 , . . . Xn , Xn+2 , . . . X2n+1 ) ≥ , −4, −3 −2 −1 0 , . . ).
(X0 , X1 , . . .) = (0, −1, −2, −3H (Xn+1,X1 , X2 , . ., .1Xn ,.Xn+2 , . . . X2(n+1) ). (a)The equality 1follows .from )stationarity. The inequality follows because conditioning reduces
Find H (X , X2 , . . , Xn .
entropy. entropy rate of this browsing dog.
(b) Find the
5. (c) What is the exp ected numb er of steps the dog takes b efore reversing direction?
Let X be Bernoulli( ). opy rate of a dogpreﬁxfree codebrequires one bit per source symbol, but the
Solution: The entr Then the optimal looking for a one.
entropy can be made arbitrarily small by sending to zero or one. 4 YesNo questions are asked sequentially until X is determined. (“Determined”
doesn’t mean that a “Yes” answer is received.) Questions cost one unit each. How
should the player proceed? What is the exp ected payoﬀ ? 6. (c) Continuing (b), what if v (x) is ﬁxed, but p(x) can b e chosen by the computer
(and then announced to the player)? The computer wishes to minimize the player’s
exp ected return. What should p(x) b e? What is the exp ected return to the player?
Solution: The game of HiLo.
(a) The ﬁrst thing to recognize in this problem is that the player cannot cover more
than 63 values of X with 6 questions. This can b e easily seen by induction.
With one question, there is only one value of X that can b e covered. With two
questions, there is one value of X that can b e covered with the ﬁrst question,
and dep ending on the answer to the ﬁrst question, there are two p ossible values
of X that can b e asked in the next question. By extending this argument, we see
that we can ask at more 63 diﬀerent questions of the form “Is X = i ?” with 6
questions. (The fact that we have narrowed the range at the end is irrelevant, if
we have not isolated the value of X .)
Thus if the player seeks to maximize his return, he should choose the 63 most
valuable outcomes for X , and play to isolate these values. The probabilities are 5 Data Compression 110 irrelevant to this procedure. He will choose the 63 most valuable outcomes, and
his ﬁrst question will b e “Is X = i ?” where i is the median of these 63 numb ers.
After isolating to either half, his next question will b e “Is X = j ?”, where j is
the median of that half. Proceeding this way, he will win if X is one of the 63
most valuable outcomes, and lose otherwise. This strategy maximizes his exp ected
winnings.
(b) Now if arbitrary questions are allowed, the game reduces to a game of 20 questions
to determine the ob ject. The return in this case to the player is
x p(x)(v (x) −
l(x)) , where l(x) is the numb er of questions required to determine the ob ject.
Maximizing the return is equivalent to minimizing the exp ected numb er of questions, and thus, as argued in the text, the optimal strategy is to construct a
Huﬀman code for the source and use that to construct a question strategy. His
exp ected return is therefore b etween
p(x)v (x) − H and
p(x)v (x) − H − 1 .
(c) A computer wishing to minimize the return to player will want to minimize
p(x)v (x) − H (X ) over choices of p(x) . We can write this as a standard minimization problem with constraints. Let
J (p) = pi vi + pi log pi + λ pi (5.24) and diﬀerentiating and setting to 0, we obtain
vi + log pi + 1 + λ = 0 (5.25) or after normalizing to ensure that the p i ’s form a probability distribution,
pi =
To complete the proof, we let ri = 2−vi
−vj .
j2 2−vi
2−vj
j (5.26) , and rewrite the return as pi log pi = pi log pi − pi log 2−vi = pi vi + pi log pi − pi log ri − log( = D(pr ) − log( 2−vj ), (5.27)
2−vj ) (5.28)
(5.29) and thus the return is minimized by choosing p i = ri . This is the distribution
that the computer must choose to minimize the return to the player.
20. Huﬀman codes with costs. Words like Run! Help! and Fire! are short, not b ecause
they are frequently used, but p erhaps b ecause time is precious in the situations in which
these words are required. Supp ose that X = i with probability p i , i = 1, 2, . . . , m. Let
li b e the numb er of binary symb ols in the codeword associated with X = i, and let c i
denote the cost p er letter of the codeword when X = i. Thus the average cost C of
the description of X is C = m pi ci li .
i=1 6 (b) How would you use the Huﬀman code procedure to minimize C over all uniquely
decodable codes? Let CH uf f man denote this minimum.
(c) Can you show that
C ∗ ≤ CH uf f man ≤ C ∗ + 7. m pi ci ?
i=1 Solution: Huﬀman codes with costs.
(a) We wish to minimize C =
pi ci ni sub ject to
2−ni ≤ 1 . We will assume
−ni and let Q =
equality in the constraint and let ri = 2
i pi ci . Let qi =
(pi ci )/Q . Then q also forms a probability distribution and we can write C as
C= pi ci ni 1
ri
qi
=Q
qi log −
qi log qi
ri
= Q(D(qr) + H (q)). =Q qi log (5.30)
(5.31)
(5.32)
(5.33) Since the only freedom is in the choice of r i , we can minimize C by choosing
r = q or
pi ci
n∗ = − log
,
(5.34)
i
pj cj
where we have ignored any integer constraints on n i . The minimum cost C ∗ for
this assignment of codewords is
C ∗ = QH (q) (5.35) (b) If we use q instead of p for the Huﬀman procedure, we obtain a code minimizing
exp ected cost.
(c) Now we can account for the integer constraints.
Let
ni = − log qi (5.36) Then
− log qi ≤ ni < − log qi + 1 (5.37) Multiplying by pi ci and summing over i , we get the relationship
C ∗ ≤ CH uf f man < C ∗ + Q. 7 (5.38) Data Compression
8.
(a) Since li = log 119
1
pi , we have
log 1
1
≤ li < log + 1
pi
pi (5.45) which implies that
H (X ) ≤ L = pi li < H (X ) + 1. (5.46) The diﬃcult part is to prove that the code is a preﬁx code. By the choice of l i ,
we have
2−li ≤ pi < 2−(li −1) .
(5.47)
Thus Fj , j > i diﬀers from Fi by at least 2−li , and will therefore diﬀer from Fi
is at least one place in the ﬁrst li bits of the binary expansion of Fi . Thus the
codeword for Fj , j > i , which has length lj ≥ li , diﬀers from the codeword for
Fi at least once in the ﬁrst li places. Thus no codeword is a preﬁx of any other
codeword.
(b) We build the following table
Symb ol Probability Fi in decimal Fi in binary li Codeword
1
0.5
0.0
0.0
1
0
2
0.25
0.5
0.10
2
10
3
0.125
0.75
0.110
3
110
4
0.125
0.875
0.111
3
111
The Shannon code in this case achieves the entropy b ound (1.75 bits) and is
optimal.
29. Optimal codes for dyadic distributions. For a Huﬀman code tree, deﬁne the
probability of a node as the sum of the probabilities of all the leaves under that node.
Let the random variable X b e drawn from a dyadic distribution, i.e., p(x) = 2 −i , for
some i , for all x ∈ X . Now consider a binary Huﬀman code for this distribution.
(a) Argue that for any node in the tree, the probability of the left child is equal to the
probability of the right child.
(b) Let X1 , X2 , . . . , Xn b e drawn i.i.d. ∼ p(x) . Using the Huﬀman code for p(x) , we
map X1 , X2 , . . . , Xn to a sequence of bits Y1 , Y2 , . . . , Yk(X1 ,X2 ,...,Xn ) . (The length
of this sequence will dep end on the outcome X 1 , X2 , . . . , Xn .) Use part (a) to
argue that the sequence Y1 , Y2 , . . . , forms a sequence of fair coin ﬂips, i.e., that
1
Pr{Yi = 0} = Pr{Yi = 1} = 2 , indep endent of Y1 , Y2 , . . . , Yi−1 .
Thus the entropy rate of the coded sequence is 1 bit p er symb ol.
(c) Give a heuristic argument why the encoded sequence of bits for any code that
achieves the entropy b ound cannot b e compressible and therefore should have an
entropy rate of 1 bit p er symb ol.
Solution: Optimal codes for dyadic distributions.
8 sample the mixture. You proceed, mixing and tasting, stopping when the bad b ottle
has b een determined.
(c) What is the minimum exp ected numb er of tastings required to determine the bad
wine?
9. (d) What mixture should b e tasted ﬁrst?
Solution: Bad Wine
(a) If we taste one b ottle at a time, to minimize the exp ected numb er of tastings the
order of tasting should b e from the most likely wine to b e bad to the least. The
exp ected numb er of tastings required is
6
i=1 pi li = 1 × 6
4
2
2
1
8
+2×
+3×
+4×
+5×
+5×
23
23
23
23
23
23 55
23
= 2.39
= (b) The ﬁrst b ottle to b e tasted should b e the one with probability 8
23 . (c) The idea is to use Huﬀman coding. With Huﬀman coding, we get codeword lengths
as (2, 2, 2, 3, 4, 4) . The exp ected numb er of tastings required is
6
i=1 pi li = 2 × 8
6
4
2
2
1
+2×
+2×
+3×
+4×
+4×
23
23
23
23
23
23 54
23
= 2.35
= Data Compression 127 (d) The mixture of the ﬁrst and second b ottles should b e tasted ﬁrst.
33. Huﬀman vs. Shannon. A random variable X takes on three values with probabilities 0.6, 0.3, and 0.1.
(a) What are the lengths of the binary Huﬀman codewords for X ? What are the
lengths of the binary Shannon codewords (l(x) = log( p(1 ) ) ) for X ?
x
(b) What is the smallest integer D such that the exp ected Shannon codeword length
with a D ary alphab et equals the exp ected Huﬀman codeword length with a D ary alphab et?
Solution: Huﬀman vs. Shannon
(a) It is obvious that an Huﬀman code for the distribution (0.6,0.3,0.1) is (1,01,00),
1
with codeword lengths (1,2,2). The Shannon code would use lengths log p ,
which gives lengths (1,2,4) for the three symb ols.
(b) For any D > 2 , the Huﬀman code for the three symb ols are all one character. The
1
Shannon code length log D p would b e equal to 1 for all symb ols if log D 011 = 1 ,
.
i.e., if D = 10 . Hence for D ≥ 10 , the Shannon code is also optimal.
34. Huﬀman algorithm for tree construction. Consider the following problem: m
binary signals S1 , S2 , . . . , Sm are available at times T1 ≤ T2 ≤ . . . ≤ Tm , and we
would like to ﬁnd their sum S1 ⊕ S2 ⊕ · · · ⊕ Sm using 2input gates, each gate with
1 time unit delay, so that the ﬁnal result is available as quickly as p ossible. A simple
greedy algorithm is to combine the earliest two results, forming the partial result at
time max(T1 , T2 ) + 1 . We now have a new problem with S 1 ⊕ S2 , S3 , . . . , Sm , available
9
at times max(T1 , T2 ) + 1, T3 , . . . , Tm . We can now sort this list of T’s, and apply the
same merging step again, rep eating this until we have the ﬁnal result.
(a) Argue that the ab ove procedure is optimal, in that it constructs a circuit for which
the ﬁnal result is available as quickly as p ossible. ...
View
Full Document
 '08
 WAGNER
 Entropy, Stochastic process, Markov chain, Ri, Xn

Click to edit the document details