You've reached the end of your free preview.
Want to read all 295 pages?
Unformatted text preview: LECTURE NOTES ON INFORMATION THEORY
Preface
“There is a whole book of readymade, long and convincing, lavishly composed telegrams for all occasions. Sending such a
telegram costs only twenty-five cents. You see, what gets transmitted over the telegraph is not the text of the telegram, but
simply the number under which it is listed in the book, and the
signature of the sender. This is quite a funny thing, reminiscent
of Drugstore Breakfast #2. Everything is served up in a ready
form, and the customer is totally freed from the unpleasant
necessity to think, and to spend money on top of it.”
Little Golden America. Travelogue by I. Ilf and E. Petrov, 1937.
[Pre-Shannon encoding, courtesy of M. Raginsky] These notes are a graduate-level introduction to the mathematics of Information Theory.
They were created by Yury Polyanskiy and Yihong Wu, who used them to teach at MIT
(2012, 2013 and 2016) and UIUC (2013, 2014). The core structure and flow of material is
largely due to Prof. Sergio Verdu,
´ whose wonderful class at Princeton University [Ver07]
shaped up our own perception of the subject. Specifically, we follow Prof. Verdu’s
´ style in
relying on single-shot results, Feinstein’s lemma and information spectrum methods. We
have added a number of technical refinements and new topics, which correspond to our own
interests (e.g., modern aspects of finite blocklength results and applications of information
theoretic methods to statistical decision theory).
Compared to the more popular “typicality” and “method of types” approaches (as
in Cover-Thomas [CT06] and Csisz´ar-K¨orner [CK81]), these notes prepare the reader to
consider delay-constraints (“non-asymptotics”) and to simultaneously treat continuous and
discrete sources/channels.
We are especially thankful to Dr. O. Ordentlich, who contributed a lecture on lattice
codes. Initial version was typed by Qingqing Huang and Austin Collins, who also created
many graphics. Rachel Cohen have also edited YDULRXV parts. Aolin Xu, Pengkun Yang and
Ganesh Ajjanagadde have contributed suggestions and corrections to the content.
We are indebted to all of them.
Y. Polyanskiy
Y. Wu
27 Feb 2015 1 Contents Contents 2 I Information measures 7 1 Information measures: entropy and divergence
8
1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Differential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Information measures: mutual information
2.1 Divergence: main inequality . . . . . . . . . . . . . . . . . . . . .
2.2 Conditional divergence . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Conditional mutual information and conditional independence
2.5 Strong data-processing inequalities . . . . . . . . . . . . . . . .
2.6* How to avoid measurability problems? . . . . . . . . . . . . . . 3 4 5 .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. 20
20
20
23
27
28
29 Sufficient statistic. Continuity of divergence and mutual information
3.1 Sufficient statistics and data-processing . . . . . . . . . . . . . . . . . . . . .
3.2 Geometric interpretation of mutual information . . . . . . . . . . . . . . . .
3.3 Variational characterizations of divergence: Donsker-Varadhan . . . . . . .
3.4 Variational characterizations of divergence: Gelfand-Yaglom-Perez . . . .
3.5 Continuity of divergence. Dependence on σ-algebra. . . . . . . . . . . . . .
3.6 Variational characterizations and continuity of mutual information . . . . .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. 31
31
33
34
35
36
38 .
.
.
.
.
.
. 39
39
40
42
43
46
46
49 .
.
.
.
.
. Extremization of mutual information: capacity saddle point
4.1 Convexity of information measures . . . . . . . . . . . . . . . . .
4.2* Local behavior of divergence . . . . . . . . . . . . . . . . . . . . .
4.3* Local behavior of divergence and Fisher information . . . . . . .
4.4 Extremization of mutual information . . . . . . . . . . . . . . . .
4.5 Capacity = information radius . . . . . . . . . . . . . . . . . . . .
4.6 Existence of caod (general case) . . . . . . . . . . . . . . . . . . .
4.7 Gaussian saddle point . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. Single-letterization. Probability of error. Entropy rate.
51
5.1 Extremization of mutual information for memoryless sources and channels . . . . . 51
5.2* Gaussian capacity via orthogonal symmetry . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Information measures and probability of error . . . . . . . . . . . . . . . . . . . . . . . 53
2 5.4
5.5
5.6
5.7
5.8* Fano, LeCam and minimax risks . . . .
Entropy rate . . . . . . . . . . . . . . . .
Entropy and symbol (bit) error rate . .
Mutual information rate . . . . . . . . .
Toeplitz matrices and Szeg¨o’s theorem .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. II Lossless data compression 55
57
58
59
60 62 6 Variable-length Lossless Compression
63
6.1 Variable-length, lossless, optimal compressor . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Uniquely decodable codes, prefix codes and Huffman codes . . . . . . . . . . . . . . . 72 7 Fixed-length (almost lossless) compression. Slepian-Wolf problem.
7.1 Fixed-length code, almost lossless . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Linear Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Compression with Side Information at both compressor and decompressor
7.4 Slepian-Wolf (Compression with Side Information at Decompressor only)
7.5 Multi-terminal Slepian Wolf . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6* Source-coding with a helper (Ahlswede-K¨orner-Wyner) . . . . . . . . . . . .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. Compressing stationary ergodic
8.1 Bits of ergodic theory . . . . .
8.2 Proof of Shannon-McMillan . .
8.3* Proof of Birkhoff-Khintchine .
8.4* Sinai’s generator theorem . . . .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. 90
. 91
. 93
. 95
. 98 .
.
.
.
.
. 101
102
102
104
105
107
108 8 9 sources
. . . . . .
. . . . . .
. . . . . .
. . . . . . .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. Universal compression
9.1 Arithmetic coding . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Combinatorial construction of Fitingof . . . . . . . . . . .
9.3 Optimal compressors for a class of sources. Redundancy.
9.4* Approximate minimax solution: Jeffreys prior . . . . . . .
9.5 Sequential probability assignment: Krichevsky-Trofimov .
9.6 Lempel-Ziv compressor . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. 76
76
81
83
84
85
87 III Binary hypothesis testing 111 10 Binary hypothesis testing
10.1 Binary Hypothesis Testing . . . .
10.2 Neyman-Pearson formulation . . .
10.3 Likelihood ratio tests . . . . . . . .
10.4 Converse bounds on R(P, Q) . . .
10.5 Achievability bounds on R(P, Q)
10.6 Asymptotics . . . . . . . . . . . . . 112
112
113
115
117
118
120 .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. 11 Hypothesis testing asymptotics I
121
11.1 Stein’s regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.2 Chernoff regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3 11.3 Basics of Large deviation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
12 Information projection and Large deviation
12.1 Large-deviation exponents . . . . . . . . . . . .
12.2 Information Projection . . . . . . . . . . . . . .
12.3 Interpretation of Information Projection . . .
12.4 Generalization: Sanov’s theorem . . . . . . . . .
.
.
. 131
131
133
135
136 13 Hypothesis testing asymptotics II
13.1 (E0 , E1 )-Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2 Equivalent forms of Theorem 13.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3*Sequential Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
138
140
142 IV Channel coding 146 14 Channel coding
14.1 Channel Coding . . . . . . . . . . . . . .
14.2 Basic Results . . . . . . . . . . . . . . .
14.3 General (Weak) Converse Bounds . . .
14.4 General achievability bounds: Preview
15 Channel coding: achievability bounds
15.1 Information density . . . . . . . . . . . .
15.2 Shannon’s achievability bound . . . . .
15.3 Dependence-testing bound . . . . . . .
15.4 Feinstein’s Lemma . . . . . . . . . . . . .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
. 16 Linear codes. Channel capacity
16.1 Linear coding . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2 Channels and channel capacity . . . . . . . . . . . . . . . . . .
16.3 Bounds on C ; Capacity of Stationary Memoryless Channels
16.4 Examples of DMC . . . . . . . . . . . . . . . . . . . . . . . . .
16.5*Information Stability . . . . . . . . . . . . . . . . . . . . . . . .
17 Channels with input constraints. Gaussian channels.
17.1 Channel coding with input constraints . . . . . . . . . . .
?
17.2 Capacity under input constraint C (P ) = Ci (P ) . . . . . .
17.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.4*Non-stationary AWGN . . . . . . . . . . . . . . . . . . . . .
17.5*Stationary Additive Colored Gaussian noise channel . . .
17.6*Additive White Gaussian Noise channel with Intersymbol
17.7*Gaussian channels with amplitude constraints . . . . . . .
17.8*Gaussian channels with fading . . . . . . . . . . . . . . . . .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Interference
. . . . . . . .
. . . . . . . . .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
. 147
147
150
152
153 .
.
.
. 154
154
156
157
158 .
.
.
.
. 160
160
163
166
169
170 .
.
.
.
.
.
.
. 174
174
176
178
180
181
183
183
184 18 Lattice codes (by O. Ordentlich)
185
18.1 Lattice Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
18.2 First Attempt at AWGN Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
18.3 Nested Lattice Codes/Voronoi Constellations . . . . . . . . . . . . . . . . . . . . . . . 189
4 18.4 Dirty Paper Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
18.5 Construction of Good Nested Lattice Pairs . . . . . . . . . . . . . . . . . . . . . . . . . 193
19 Channel coding: energy-per-bit, continuous-time channels
19.1 Energy per bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.2 What is N0 ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.3 Capacity of the continuous-time band-limited AWGN channel . .
19.4 Capacity of the continuous-time band-unlimited AWGN channel .
19.5 Capacity per unit cost . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. 195
195
197
200
201
203 20 Advanced channel coding. Source-Channel separation.
20.1 Strong Converse . . . . . . . . . . . . . . . . . . . . . . . . .
20.2 Stationary memoryless channel without strong converse .
20.3 Channel Dispersion . . . . . . . . . . . . . . . . . . . . . . .
20.4 Normalized Rate . . . . . . . . . . . . . . . . . . . . . . . .
20.5 Joint Source Channel Coding . . . . . . . . . . . . . . . . . .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. 207
207
211
212
214
214 .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. 21 Channel coding with feedback
218
21.1 Feedback does not increase capacity for stationary memoryless channels . . . . . . . 218
21.2*Alternative proof of Theorem 21.1 and Massey’s directed information . . . . . . . . 221
21.3 When is feedback really useful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
22 Capacity-achieving codes via Forney concatenation
22.1 Error exponents . . . . . . . . . . . . . . . . . . . . . . .
22.2 Achieving polynomially small error probability . . . .
22.3 Concatenated codes . . . . . . . . . . . . . . . . . . . .
22.4 Achieving exponentially small error probability . . . . .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. V Lossy data compression 230
230
231
232
233 235 23 Rate-distortion theory
236
23.1 Scalar quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
23.2 Information-theoretic vector quantization . . . . . . . . . . . . . . . . . . . . . . . . . 241
23.3*Converting excess distortion to average . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
24 Rate distortion: achievability bounds
246
24.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
24.2 Shannon’s rate-distortion theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
24.3*Covering lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
25 Evaluating R(D). Lossy Source-Channel separation.
25.1 Evaluation of R(D) . . . . . . . . . . . . . . . . . . . . .
25.2*Analog of saddle-point property in rate-distortion . .
25.3 Lossy joint source-channel coding . . . . . . . . . . . .
25.4 What is lacking in classical lossy compression? . . . . 5 .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. 255
255
258
261
266 VI Advanced topics 267 26 Multiple-access channel
268
26.1 Problem motivation and main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
26.2 MAC achievability bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
26.3 MAC capacity region proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
27 Examples of MACs. Maximal Pe and zero-error capacity.
27.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.2 Orthogonal MAC . . . . . . . . . . . . . . . . . . . . . . . . . .
27.3 BSC MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.4 Adder MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.5 Multiplier MAC . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.6 Contraction MAC . . . . . . . . . . . . . . . . . . . . . . . . . .
27.7 Gaussian MAC . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.8 MAC Peculiarities . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. 275
275
275
276
277
278
279
280
281 28 Random number generators
28.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . .
28.3 Elias’ construction of RNG from lossless compressors
28.4 Peres’ iterated von Neumann’s scheme . . . . . . . . .
28.5 Bernoulli factory . . . . . . . . . . . . . . . . . . . . . .
28.6 Related problems . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. 285
285
286
286
287
289
290 Bibliography .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. 292 6 Part I Information measures 7 § 1. Information measures: entropy and divergence Review: Random variables
• Two methods to describe a random variable (R.V.) X:
1. a function X ∶ Ω → X from the probability space (Ω, F , P) to a target space X .
2. a distribution PX on some measurable space (X , F ).
• Convention: capital letter – RV (e.g. X); small letter – realization (e.g. x0 ).
• X — discrete if there exists a countable set X = {xj , j = 1, . . .} such that
∑j∞=1 PX (xj ) = 1. X is called alphabet of X, x ∈ X – atoms and PX (xj ) – probability
mass function (pmf).
• For discrete RV support suppPX = {x ∶ PX (x) > 0}.
• Vector RVs: X1n ≜ (X1 , . . . , Xn ). Also denoted just X n .
• For a vector RV X n and S ⊂ {1, . . . , n} we denote XS = {Xi , i ∈ S}. 1.1 Entropy Definition 1.1 (Entropy). For a discrete R.V. X with distribution PX :
H(X) = E[ log
= 1
PX (X) ∑ PX (x) log x∈X 1
.
PX (x) Definition 1.2 (Joint entropy). X n = (X1 , X2 , . . . , Xn ) – a random vector with n components.
H (X n ) = H (X1 , . . . , Xn ) = E[ log 1
.]
PX1 ,...,Xn (X1 , . . . , Xn ) Definition 1.3 (Conditional entropy).
H(X∣Y ) = Ey∼PY [H(PX∣Y =y )] = E[ log
i.e., the entropy of H(PX ∣Y =y ) averaged over PY .
Note:
8 1,
PX ∣Y (X ∣Y ) • Q: Why such definition, why log, why entropy?
Name comes from thermodynamics. Definition is justified by theorems in this course (e.g.
operationally by compression), but also by a number of experiments. For example, we can
measure time it takes for ants-scouts to describe location of the food to ants-workers. It was
found that when nest is placed at a root of a full binary tree of depth d and food at one of the
leaves, the time was proportional to log 2d = d – entropy of the random variable describing food
location. It was estimated that ants communicate with about 0.7 − 1 bit/min. Furthermore,
communication time reduces if there are some regularities in path-description (e.g., paths like
“left,right,left,right,left,right” were described faster). See [RZ86] for more.
• We agree that 0 log 10 = 0 (by continuity of x ↦ x log x1 )
• Also write H(PX ) instead of H(X) (abuse of notation, as customary in information theory).
• Basis of log — units
log2 ↔ bits
loge ↔ nats
log256 ↔ bytes
log ↔ arbitrary units, base always matches exp
Example (Bernoulli): X ∈ {0, 1}, H(X) = h(p) ≜ p log P[X = 1] = PX (1) ≜ p 1
1
+ p log
p
p where h(⋅) is called the binary entropy function.
Proposition 1.1. h(⋅) is continuous, concave on
[0, 1] and
p
h′ (p) = log
p
with infinite slope at 0 and 1.
Example (Geometric): X ∈ {0, 1, 2, . . .} 0 1/2 P[X = i] = Px (i) = p ⋅ (p)i ∞
1
1
1
=
ppi (i log + log ...
View
Full Document
- Spring '10
- MaruielMedard
- Information Theory, It, The Land