Information Theory.pdf - LECTURE NOTES ON INFORMATION THEORY Preface \u201cThere is a whole book of readymade long and convincing lavishly composed

Information Theory.pdf - LECTURE NOTES ON INFORMATION...

This preview shows page 1 out of 295 pages.

You've reached the end of your free preview.

Want to read all 295 pages?

Unformatted text preview: LECTURE NOTES ON INFORMATION THEORY Preface “There is a whole book of readymade, long and convincing, lavishly composed telegrams for all occasions. Sending such a telegram costs only twenty-five cents. You see, what gets transmitted over the telegraph is not the text of the telegram, but simply the number under which it is listed in the book, and the signature of the sender. This is quite a funny thing, reminiscent of Drugstore Breakfast #2. Everything is served up in a ready form, and the customer is totally freed from the unpleasant necessity to think, and to spend money on top of it.” Little Golden America. Travelogue by I. Ilf and E. Petrov, 1937. [Pre-Shannon encoding, courtesy of M. Raginsky] These notes are a graduate-level introduction to the mathematics of Information Theory. They were created by Yury Polyanskiy and Yihong Wu, who used them to teach at MIT (2012, 2013 and 2016) and UIUC (2013, 2014). The core structure and flow of material is largely due to Prof. Sergio Verdu, ´ whose wonderful class at Princeton University [Ver07] shaped up our own perception of the subject. Specifically, we follow Prof. Verdu’s ´ style in relying on single-shot results, Feinstein’s lemma and information spectrum methods. We have added a number of technical refinements and new topics, which correspond to our own interests (e.g., modern aspects of finite blocklength results and applications of information theoretic methods to statistical decision theory). Compared to the more popular “typicality” and “method of types” approaches (as in Cover-Thomas [CT06] and Csisz´ar-K¨orner [CK81]), these notes prepare the reader to consider delay-constraints (“non-asymptotics”) and to simultaneously treat continuous and discrete sources/channels. We are especially thankful to Dr. O. Ordentlich, who contributed a lecture on lattice codes. Initial version was typed by Qingqing Huang and Austin Collins, who also created many graphics. Rachel Cohen have also edited YDULRXV parts. Aolin Xu, Pengkun Yang and Ganesh Ajjanagadde have contributed suggestions and corrections to the content. We are indebted to all of them. Y. Polyanskiy Y. Wu 27 Feb 2015 1 Contents Contents 2 I Information measures 7 1 Information measures: entropy and divergence 8 1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Differential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Information measures: mutual information 2.1 Divergence: main inequality . . . . . . . . . . . . . . . . . . . . . 2.2 Conditional divergence . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conditional mutual information and conditional independence 2.5 Strong data-processing inequalities . . . . . . . . . . . . . . . . 2.6* How to avoid measurability problems? . . . . . . . . . . . . . . 3 4 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 20 23 27 28 29 Sufficient statistic. Continuity of divergence and mutual information 3.1 Sufficient statistics and data-processing . . . . . . . . . . . . . . . . . . . . . 3.2 Geometric interpretation of mutual information . . . . . . . . . . . . . . . . 3.3 Variational characterizations of divergence: Donsker-Varadhan . . . . . . . 3.4 Variational characterizations of divergence: Gelfand-Yaglom-Perez . . . . 3.5 Continuity of divergence. Dependence on σ-algebra. . . . . . . . . . . . . . 3.6 Variational characterizations and continuity of mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 33 34 35 36 38 . . . . . . . 39 39 40 42 43 46 46 49 . . . . . . Extremization of mutual information: capacity saddle point 4.1 Convexity of information measures . . . . . . . . . . . . . . . . . 4.2* Local behavior of divergence . . . . . . . . . . . . . . . . . . . . . 4.3* Local behavior of divergence and Fisher information . . . . . . . 4.4 Extremization of mutual information . . . . . . . . . . . . . . . . 4.5 Capacity = information radius . . . . . . . . . . . . . . . . . . . . 4.6 Existence of caod (general case) . . . . . . . . . . . . . . . . . . . 4.7 Gaussian saddle point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single-letterization. Probability of error. Entropy rate. 51 5.1 Extremization of mutual information for memoryless sources and channels . . . . . 51 5.2* Gaussian capacity via orthogonal symmetry . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Information measures and probability of error . . . . . . . . . . . . . . . . . . . . . . . 53 2 5.4 5.5 5.6 5.7 5.8* Fano, LeCam and minimax risks . . . . Entropy rate . . . . . . . . . . . . . . . . Entropy and symbol (bit) error rate . . Mutual information rate . . . . . . . . . Toeplitz matrices and Szeg¨o’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II Lossless data compression 55 57 58 59 60 62 6 Variable-length Lossless Compression 63 6.1 Variable-length, lossless, optimal compressor . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Uniquely decodable codes, prefix codes and Huffman codes . . . . . . . . . . . . . . . 72 7 Fixed-length (almost lossless) compression. Slepian-Wolf problem. 7.1 Fixed-length code, almost lossless . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Linear Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Compression with Side Information at both compressor and decompressor 7.4 Slepian-Wolf (Compression with Side Information at Decompressor only) 7.5 Multi-terminal Slepian Wolf . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6* Source-coding with a helper (Ahlswede-K¨orner-Wyner) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compressing stationary ergodic 8.1 Bits of ergodic theory . . . . . 8.2 Proof of Shannon-McMillan . . 8.3* Proof of Birkhoff-Khintchine . 8.4* Sinai’s generator theorem . . . . . . . . . . . . . . . . . . . . . . . 90 . 91 . 93 . 95 . 98 . . . . . . 101 102 102 104 105 107 108 8 9 sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Universal compression 9.1 Arithmetic coding . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Combinatorial construction of Fitingof . . . . . . . . . . . 9.3 Optimal compressors for a class of sources. Redundancy. 9.4* Approximate minimax solution: Jeffreys prior . . . . . . . 9.5 Sequential probability assignment: Krichevsky-Trofimov . 9.6 Lempel-Ziv compressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 76 81 83 84 85 87 III Binary hypothesis testing 111 10 Binary hypothesis testing 10.1 Binary Hypothesis Testing . . . . 10.2 Neyman-Pearson formulation . . . 10.3 Likelihood ratio tests . . . . . . . . 10.4 Converse bounds on R(P, Q) . . . 10.5 Achievability bounds on R(P, Q) 10.6 Asymptotics . . . . . . . . . . . . . 112 112 113 115 117 118 120 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Hypothesis testing asymptotics I 121 11.1 Stein’s regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 11.2 Chernoff regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 3 11.3 Basics of Large deviation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 12 Information projection and Large deviation 12.1 Large-deviation exponents . . . . . . . . . . . . 12.2 Information Projection . . . . . . . . . . . . . . 12.3 Interpretation of Information Projection . . . 12.4 Generalization: Sanov’s theorem . . . . . . . . . . . . 131 131 133 135 136 13 Hypothesis testing asymptotics II 13.1 (E0 , E1 )-Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Equivalent forms of Theorem 13.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3*Sequential Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 138 140 142 IV Channel coding 146 14 Channel coding 14.1 Channel Coding . . . . . . . . . . . . . . 14.2 Basic Results . . . . . . . . . . . . . . . 14.3 General (Weak) Converse Bounds . . . 14.4 General achievability bounds: Preview 15 Channel coding: achievability bounds 15.1 Information density . . . . . . . . . . . . 15.2 Shannon’s achievability bound . . . . . 15.3 Dependence-testing bound . . . . . . . 15.4 Feinstein’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Linear codes. Channel capacity 16.1 Linear coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Channels and channel capacity . . . . . . . . . . . . . . . . . . 16.3 Bounds on C ; Capacity of Stationary Memoryless Channels 16.4 Examples of DMC . . . . . . . . . . . . . . . . . . . . . . . . . 16.5*Information Stability . . . . . . . . . . . . . . . . . . . . . . . . 17 Channels with input constraints. Gaussian channels. 17.1 Channel coding with input constraints . . . . . . . . . . . ? 17.2 Capacity under input constraint C (P ) = Ci (P ) . . . . . . 17.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4*Non-stationary AWGN . . . . . . . . . . . . . . . . . . . . . 17.5*Stationary Additive Colored Gaussian noise channel . . . 17.6*Additive White Gaussian Noise channel with Intersymbol 17.7*Gaussian channels with amplitude constraints . . . . . . . 17.8*Gaussian channels with fading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 147 150 152 153 . . . . 154 154 156 157 158 . . . . . 160 160 163 166 169 170 . . . . . . . . 174 174 176 178 180 181 183 183 184 18 Lattice codes (by O. Ordentlich) 185 18.1 Lattice Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 18.2 First Attempt at AWGN Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 18.3 Nested Lattice Codes/Voronoi Constellations . . . . . . . . . . . . . . . . . . . . . . . 189 4 18.4 Dirty Paper Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 18.5 Construction of Good Nested Lattice Pairs . . . . . . . . . . . . . . . . . . . . . . . . . 193 19 Channel coding: energy-per-bit, continuous-time channels 19.1 Energy per bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2 What is N0 ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3 Capacity of the continuous-time band-limited AWGN channel . . 19.4 Capacity of the continuous-time band-unlimited AWGN channel . 19.5 Capacity per unit cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 195 197 200 201 203 20 Advanced channel coding. Source-Channel separation. 20.1 Strong Converse . . . . . . . . . . . . . . . . . . . . . . . . . 20.2 Stationary memoryless channel without strong converse . 20.3 Channel Dispersion . . . . . . . . . . . . . . . . . . . . . . . 20.4 Normalized Rate . . . . . . . . . . . . . . . . . . . . . . . . 20.5 Joint Source Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 207 211 212 214 214 . . . . . . . . . . . . . . . . . . . . . . . . . 21 Channel coding with feedback 218 21.1 Feedback does not increase capacity for stationary memoryless channels . . . . . . . 218 21.2*Alternative proof of Theorem 21.1 and Massey’s directed information . . . . . . . . 221 21.3 When is feedback really useful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 22 Capacity-achieving codes via Forney concatenation 22.1 Error exponents . . . . . . . . . . . . . . . . . . . . . . . 22.2 Achieving polynomially small error probability . . . . 22.3 Concatenated codes . . . . . . . . . . . . . . . . . . . . 22.4 Achieving exponentially small error probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V Lossy data compression 230 230 231 232 233 235 23 Rate-distortion theory 236 23.1 Scalar quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 23.2 Information-theoretic vector quantization . . . . . . . . . . . . . . . . . . . . . . . . . 241 23.3*Converting excess distortion to average . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 24 Rate distortion: achievability bounds 246 24.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 24.2 Shannon’s rate-distortion theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 24.3*Covering lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 25 Evaluating R(D). Lossy Source-Channel separation. 25.1 Evaluation of R(D) . . . . . . . . . . . . . . . . . . . . . 25.2*Analog of saddle-point property in rate-distortion . . 25.3 Lossy joint source-channel coding . . . . . . . . . . . . 25.4 What is lacking in classical lossy compression? . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 255 258 261 266 VI Advanced topics 267 26 Multiple-access channel 268 26.1 Problem motivation and main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 26.2 MAC achievability bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 26.3 MAC capacity region proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 27 Examples of MACs. Maximal Pe and zero-error capacity. 27.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.2 Orthogonal MAC . . . . . . . . . . . . . . . . . . . . . . . . . . 27.3 BSC MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.4 Adder MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.5 Multiplier MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.6 Contraction MAC . . . . . . . . . . . . . . . . . . . . . . . . . . 27.7 Gaussian MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.8 MAC Peculiarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 275 275 276 277 278 279 280 281 28 Random number generators 28.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.3 Elias’ construction of RNG from lossless compressors 28.4 Peres’ iterated von Neumann’s scheme . . . . . . . . . 28.5 Bernoulli factory . . . . . . . . . . . . . . . . . . . . . . 28.6 Related problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 285 286 286 287 289 290 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . 292 6 Part I Information measures 7 § 1. Information measures: entropy and divergence Review: Random variables • Two methods to describe a random variable (R.V.) X: 1. a function X ∶ Ω → X from the probability space (Ω, F , P) to a target space X . 2. a distribution PX on some measurable space (X , F ). • Convention: capital letter – RV (e.g. X); small letter – realization (e.g. x0 ). • X — discrete if there exists a countable set X = {xj , j = 1, . . .} such that ∑j∞=1 PX (xj ) = 1. X is called alphabet of X, x ∈ X – atoms and PX (xj ) – probability mass function (pmf). • For discrete RV support suppPX = {x ∶ PX (x) > 0}. • Vector RVs: X1n ≜ (X1 , . . . , Xn ). Also denoted just X n . • For a vector RV X n and S ⊂ {1, . . . , n} we denote XS = {Xi , i ∈ S}. 1.1 Entropy Definition 1.1 (Entropy). For a discrete R.V. X with distribution PX : H(X) = E[ log = 1 PX (X) ∑ PX (x) log x∈X 1 . PX (x) Definition 1.2 (Joint entropy). X n = (X1 , X2 , . . . , Xn ) – a random vector with n components. H (X n ) = H (X1 , . . . , Xn ) = E[ log 1 .] PX1 ,...,Xn (X1 , . . . , Xn ) Definition 1.3 (Conditional entropy). H(X∣Y ) = Ey∼PY [H(PX∣Y =y )] = E[ log i.e., the entropy of H(PX ∣Y =y ) averaged over PY . Note: 8 1, PX ∣Y (X ∣Y ) • Q: Why such definition, why log, why entropy? Name comes from thermodynamics. Definition is justified by theorems in this course (e.g. operationally by compression), but also by a number of experiments. For example, we can measure time it takes for ants-scouts to describe location of the food to ants-workers. It was found that when nest is placed at a root of a full binary tree of depth d and food at one of the leaves, the time was proportional to log 2d = d – entropy of the random variable describing food location. It was estimated that ants communicate with about 0.7 − 1 bit/min. Furthermore, communication time reduces if there are some regularities in path-description (e.g., paths like “left,right,left,right,left,right” were described faster). See [RZ86] for more. • We agree that 0 log 10 = 0 (by continuity of x ↦ x log x1 ) • Also write H(PX ) instead of H(X) (abuse of notation, as customary in information theory). • Basis of log — units log2 ↔ bits loge ↔ nats log256 ↔ bytes log ↔ arbitrary units, base always matches exp Example (Bernoulli): X ∈ {0, 1}, H(X) = h(p) ≜ p log P[X = 1] = PX (1) ≜ p 1 1 + p log p p where h(⋅) is called the binary entropy function. Proposition 1.1. h(⋅) is continuous, concave on [0, 1] and p h′ (p) = log p with infinite slope at 0 and 1. Example (Geometric): X ∈ {0, 1, 2, . . .} 0 1/2 P[X = i] = Px (i) = p ⋅ (p)i ∞ 1 1 1 = ppi (i log + log ...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture