This preview shows pages 1–8. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Entropy and
Information Theory July 16, 2009 ii Entropy and
Information Theory
Robert M. Gray
Information Systems Laboratory
Electrical Engineering Department
Stanford University SpringerVerlag
New York iv
c 1990 by Springer Verlag. Revised 2000, 2007, 2008, 2009 by Robert M.
Gray v to Tim, Lori, Julia, Peter,
Gus, Amy Elizabeth, and Alice
and in memory of Tino vi Contents
Prologue xi 1 Information Sources
1.1 Introduction . . . . . . . . . . . . . . . . . .
1.2 Probability Spaces and Random Variables .
1.3 Random Processes and Dynamical Systems
1.4 Distributions . . . . . . . . . . . . . . . . .
1.5 Standard Alphabets . . . . . . . . . . . . .
1.6 Expectation . . . . . . . . . . . . . . . . . .
1.7 Asymptotic Mean Stationarity . . . . . . .
1.8 Ergodic Properties . . . . . . . . . . . . . . .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. 1
1
1
5
6
10
11
14
15 2 Entropy and Information
2.1 Introduction . . . . . . . . . . . . . . .
2.2 Entropy and Entropy Rate . . . . . .
2.3 Basic Properties of Entropy . . . . . .
2.4 Entropy Rate . . . . . . . . . . . . . .
2.5 Conditional Entropy and Information
2.6 Entropy Rate Revisited . . . . . . . .
2.7 Relative Entropy Densities . . . . . . .
3 The
3.1
3.2
3.3
3.4
3.5 .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
.
.
. 17
17
17
20
31
35
42
44 Entropy Ergodic Theorem
Introduction . . . . . . . . . . . . . . . .
Stationary Ergodic Sources . . . . . . .
Stationary Nonergodic Sources . . . . .
AMS Sources . . . . . . . . . . . . . . .
The Asymptotic Equipartition Property .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. 47
47
50
56
59
63 4 Information Rates I
65
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Stationary Codes and Approximation . . . . . . . . . . . . . . . 65
4.3 Information Rate of Finite Alphabet Processes . . . . . . . . . . 73
vii viii CONTENTS 5 Relative Entropy
5.1 Introduction . . . . . . . . . . . . .
5.2 Divergence . . . . . . . . . . . . .
5.3 Conditional Relative Entropy . . .
5.4 Limiting Entropy Densities . . . .
5.5 Information for General Alphabets
5.6 Some Convergence Results . . . . . .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. 77
. 77
. 77
. 93
. 105
. 107
. 117 6 Information Rates II
6.1 Introduction . . . . . . . . . . . . . . . . .
6.2 Information Rates for General Alphabets
6.3 A Mean Ergodic Theorem for Densities .
6.4 Information Rates of Stationary Processes .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. 121
121
121
124
126 7 Relative Entropy Rates
7.1 Introduction . . . . . . . . . . . . . . .
7.2 Relative Entropy Densities and Rates
7.3 Markov Dominating Measures . . . . .
7.4 Stationary Processes . . . . . . . . . .
7.5 Mean Ergodic Theorems . . . . . . . . .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. 133
133
133
136
139
142 8 Ergodic Theorems for Densities
8.1 Introduction . . . . . . . . . . . . . . . . . . .
8.2 Stationary Ergodic Sources . . . . . . . . . .
8.3 Stationary Nonergodic Sources . . . . . . . .
8.4 AMS Sources . . . . . . . . . . . . . . . . . .
8.5 Ergodic Theorems for Information Densities. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. 147
147
147
152
155
158 .
.
.
.
. 161
161
162
164
167
187 .
.
.
.
.
. 193
193
193
195
197
199
203 9 Channels and Codes
9.1 Introduction . . . . . . . . . . . . .
9.2 Channels . . . . . . . . . . . . . .
9.3 Stationarity Properties of Channels
9.4 Examples of Channels . . . . . . .
9.5 The RohlinKakutani Theorem . .
10 Distortion
10.1 Introduction . . . . . . . . . . .
10.2 Distortion and Fidelity Criteria
10.3 Performance . . . . . . . . . . .
10.4 The rhobar distortion . . . . .
10.5 dbar Continuous Channels . .
10.6 The DistortionRate Function . .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. .
.
.
.
. .
.
.
.
.
. CONTENTS ix 11 Source Coding Theorems
11.1 Source Coding and Channel Coding . . . . . .
11.2 Block Source Codes for AMS Sources . . . . . .
11.3 Block Coding Stationary Sources . . . . . . . .
11.4 Block Coding AMS Ergodic Sources . . . . . .
11.5 Subadditive Fidelity Criteria . . . . . . . . . .
11.6 Asynchronous Block Codes . . . . . . . . . . .
11.7 Sliding Block Source Codes . . . . . . . . . . .
11.8 A geometric Interpretation of operational DRFs .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
. 213
213
213
223
224
231
232
235
244 12 Coding for noisy channels
12.1 Noisy Channels . . . . . . . . . . . . . . . .
12.2 Feinstein’s Lemma . . . . . . . . . . . . . .
12.3 Feinstein’s Theorem . . . . . . . . . . . . .
12.4 Channel Capacity . . . . . . . . . . . . . . .
12.5 Robust Block Codes . . . . . . . . . . . . .
12.6 Block Coding Theorems for Noisy Channels
12.7 Joint Source and Channel Block Codes . . .
12.8 Synchronizing Block Channel Codes . . . .
12.9 Sliding Block Source and Channel Coding . .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. 247
247
248
251
254
258
262
263
266
270 .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. Bibliography 280 Index 291 x CONTENTS Prologue
This book is devoted to the theory of probabilistic information measures and
their application to coding theorems for information sources and noisy channels.
The eventual goal is a general development of Shannon’s mathematical theory
of communication, but much of the space is devoted to the tools and methods
required to prove the Shannon coding theorems. These tools form an area common to ergodic theory and information theory and comprise several quantitative
notions of the information in random variables, random processes, and dynamical systems. Examples are entropy, mutual information, conditional entropy,
conditional information, and relative entropy (discrimination, KullbackLeibler
information), along with the limiting normalized versions of these quantities
such as entropy rate and information rate. When considering multiple random
objects, in addition to information we will be concerned with the distance or
distortion between the random objects, that is, the accuracy of the representation of one random object by another. Much of the book is concerned with the
properties of these quantities, especially the long term asymptotic behavior of
average information and distortion, where both sample averages and probabilistic averages are of interest.
The book has been strongly inﬂuenced by M. S. Pinsker’s classic Information
and Information Stability of Random Variables and Processes and by the seminal
work of A. N. Kolmogorov, I. M. Gelfand, A. M. Yaglom, and R. L. Dobrushin on
information measures for abstract alphabets and their convergence properties.
Many of the results herein are extensions of their generalizations of Shannon’s
original results. The mathematical models of this treatment are more general
than traditional treatments in that nonstationary and nonergodic information
processes are treated. The models are somewhat less general than those of the
Soviet school of information theory in the sense that standard alphabets rather
than completely abstract alphabets are considered. This restriction, however,
permits many stronger results as well as the extension to nonergodic processes.
In addition, the assumption of standard spaces simpliﬁes many proofs and such
spaces include as examples virtually all examples of engineering interest.
The information convergence results are combined with ergodic theorems
to prove general Shannon coding theorems for sources and channels. The results are not the most general known and the converses are not the strongest
available, but they are suﬃcently general to cover most systems encountered
in applications and they provide an introduction to recent extensions requiring
xi xii PROLOGUE signiﬁcant additional mathematical machinery. Several of the generalizations
have not previously been treated in book form. Examples of novel topics for an
information theory text include asymptotic mean stationary sources, onesided
¯
sources as well as twosided sources, nonergodic sources, dcontinuous channels,
and sliding block or stationary codes . Another novel aspect is the use of recent
proofs of general ShannonMcMillanBreiman theorems which do not use martingale theory — a coding proof of Ornstein and Weiss [118] is used to prove
the almost everywhere convergence of sample entropy for discrete alphabet processes and a variation on the sandwich approach of Algoet and Cover [7] is used
to prove the convergence of relative entropy densities for general standard alphabet processes. Both results are proved for asymptotically mean stationary
processes which need not be ergodic.
This material can be considered as a sequel to my book Probability, Random
Processes, and Ergodic Properties [51] wherein the prerequisite results on probability, standard spaces, and ordinary ergodic properties may be found. This
book is self contained with the exception of common (and a few less common)
results which may be found in the ﬁrst book.
It is my hope that the book will interest engineers in some of the mathematical aspects and general models of the theory and mathematicians in some of
the important engineering applications of performance bounds and code design
for communication systems.
Information theory, the mathematical theory of communication, has two
primary goals: The ﬁrst is the development of the fundamental theoretical limits on the achievable performance when communicating a given information
source over a given communications channel using coding schemes from within
a prescribed class. The second goal is the development of coding schemes that
provide performance that is reasonably good in comparison with the optimal
performance given by the theory. Information theory was born in a surprisingly rich state in the classic papers of Claude E. Shannon [131] [132] which
contained the basic results for simple memoryless sources and channels and introduced more general communication systems models, including ﬁnite state
sources and channels. The key tools used to prove the original results and many
of those that followed were special cases of the ergodic theorem and a new variation of the ergodic theorem which considered sample averages of a measure of
the entropy or self information in a process.
Information theory can be viewed as simply a branch of applied probability
theory. Because of its dependence on ergodic theorems, however, it can also be
viewed as a branch of ergodic theory, the theory of invariant transformations
and transformations related to invariant transformations. In order to develop
the ergodic theory example of principal interest to information theory, suppose
that one has a random process, which for the moment we consider as a sample space or ensemble of possible output sequences together with a probability
measure on events composed of collections of such sequences. The shift is the
transformation on this space of sequences that takes a sequence and produces a
new sequence by shifting the ﬁrst sequence a single time unit to the left. In other PROLOGUE xiii words, the shift transformation is a mathematical model for the eﬀect of time
on a data sequence. If the probability of any sequence event is unchanged by
shifting the event, that is, by shifting all of the sequences in the event, then the
shift transformation is said to be invariant and the random process is said to be
stationary. Thus the theory of stationary random processes can be considered as
a subset of ergodic theory. Transformations that are not actually invariant (random processes which are not actually stationary) can be considered using similar
techniques by studying transformations which are almost invariant, which are
invariant in an asymptotic sense, or which are dominated or asymptotically
dominated in some sense by an invariant transformation. This generality can
be important as many real processes are not well modeled as being stationary.
Examples are processes with transients, processes that have been parsed into
blocks and coded, processes that have been encoded using variablelength codes
or ﬁnite state codes and channels with arbitrary starting states.
Ergodic theory was originally developed for the study of statistical mechanics
as a means of quantifying the trajectories of physical or dynamical systems.
Hence, in the language of random processes, the early focus was on ergodic
theorems: theorems relating the time or sample average behavior of a random
process to its ensemble or expected behavior. The work of Hoph [65], von
Neumann [148] and others culminated in the pointwise or almost everywhere
ergodic theorem of Birkhoﬀ [16].
In the 1940’s and 1950’s Shannon made use of the ergodic theorem in the
simple special case of memoryless processes to characterize the optimal performance theoretically achievable when communicating information sources over
constrained random media called channels. The ergodic theorem was applied
in a direct fashion to study the asymptotic behavior of error frequency and
time average distortion in a communication system, but a new variation was
introduced by deﬁning a mathematical measure of the entropy or information
in a random process and characterizing its asymptotic behavior. These results
are known as coding theorems. Results describing performance that is actually
achievable, at least in the limit of unbounded complexity and time, are known as
positive coding theorems. Results providing unbeatable bounds on performance
are known as converse coding theorems or negative coding theorems. When the
same quantity is given by both positive and negative coding theorems, one has
exactly the optimal performance theoretically achievable by the given communication systems model.
While mathematical notions of information had existed before, it was Shannon who coupled the notion with the ergodic theorem and an ingenious idea
known as “random coding” in order to develop the coding theorems and to
thereby give operational signiﬁcance to such information measures. The name
“random coding” is a bit misleading since it refers to the random selection of
a deterministic code and not a coding system that operates in a random or
stochastic manner. The basic approach to proving positive coding theorems
was to analyze the average performance over a random selection of codes. If
the average is good, then there must be at least one code in the ensemble of
codes with performance as good as the average. The ergodic theorem is cru xiv PROLOGUE cial to this argument for determining such average behavior. Unfortunately,
such proofs promise the existence of good codes but give little insight into their
construction.
Shannon’s original work focused on memoryless sources whose probability
distribution did not change with time and whose outputs were drawn from a ﬁnite alphabet or the real line. In this simple case the wellknown ergodic theorem
immediately provided the required result concerning the asymptotic behavior of
information. He observed that the basic ideas extended in a relatively straightforward manner to more complicated Markov sources. Even this generalization,
however, was a far cry from the general stationary sources considered in the
ergodic theorem.
To continue the story requires a few additional words about measures of
information. Shannon really made use of two diﬀerent but related measures.
The ﬁrst was entropy, an idea inherited from thermodynamics and previously
proposed as a measure of the information in a random signal by Hartley [64].
Shannon deﬁned the entropy of a discrete time discrete alphabet random process {Xn }, which we denote by H (X ) while deferring its deﬁnition, and made
rigorous the idea that the the entropy of a process is the amount of information in the process. He did this by proving a coding theorem showing that
if one wishes to code the given process into a sequence of binary symbols so
that a receiver viewing the binary sequence can reconstruct the original process
perfectly (or nearly so), then one needs at least H (X ) binary symbols or bits
(converse theorem) and one can accomplish the task with very close to H (X )
bits (positive theorem). This coding theorem is known as the noiseless source
coding theorem.
The second notion of information used by Shannon was mutual information.
Entropy is really a notion of self information–the information provided by a
random process about itself. Mutual information is a measure of the information
contained in one process about another process. While entropy is suﬃcient to
study the reproduction of a single process through a noiseless environment, more
often one has two or more distinct random processes, e.g., one random process
representing an information source and another representing the output of a
communication medium wherein the coded source has been corrupted by another
random process called noise. In such cases observations are made on one process
in order to make decisions on another. Suppose that {Xn , Yn } is a random
process with a discrete alphabet, that is, taking on values in a discrete set. The
coordinate random processes {Xn } and {Yn } might correspond, for example,
to the input and output of a communication system. Shannon introduced the
notion of the average mutual information between the two processes:
I (X, Y ) = H (X ) + H (Y ) − H (X, Y ), (1) the sum of the two self entropies minus the entropy of the pair. This proved to
be the relevant quantity in coding theorems involving more than one distinct
random process: the channel coding theorem describing reliable communication
through a noisy channel, and the general source coding theorem describing the PROLOGUE xv coding of a source for a user subject to a ﬁdelity criterion. The ﬁrst theorem
focuses on error detection and correction and the second on analogtodigital
conversion and data compression. Special cases of both of these coding theorems
were given in Shannon’s original work.
Average mutual information can also be deﬁned in terms of conditional entropy (or equivocation) H (X Y ) = H (X, Y ) − H (Y ) and hence
I (X, Y ) = H (X ) − H (X Y ) = H (Y ) − H (X Y ). (2) In this form the mutual information can be interpreted as the information contained in one process minus the information contained in the process when the
other process is known. While elementary texts on information theory abound
with such intuitive descriptions of information measures, we will minimize such
discussion because of the potential pitfall of using the interpretations to apply
such measures to problems where they are not appropriate. ( See, e.g., P. Elias’
“Information theory, photosynthesis, and religion” in his “Two famous papers”
[36].) Information measures are important because coding theorems exist imbuing them with operational signiﬁcance and not because of intuitively pleasing
aspects of their deﬁnitions.
We focus on the deﬁnition (1) of mutual information since it does not require
any explanation of what conditional entropy means and since it has a more
symmetric form than the conditional deﬁnitions. It turns out that H (X, X ) =
H (X ) (the entropy of a random variable is not changed by repeating it) and
hence from (1)
I (X, X ) = H (X )
(3)
so that entropy can be considered as a special case of average mutual information.
To return to the story, Shannon’s work spawned the new ﬁeld of information
theory and also had a profound eﬀect on the older ﬁeld of ergodic theory.
Information theorists, both mathematicians and engineers, extended Shannon’s basic approach to ever more general models of information sources, coding
structures, and performance measures. The fundamental ergodic theorem for
entropy was extended to the same generality as the ordinary ergodic theorems by
McMillan [104] and Breiman [19] and the result is now known as the ShannonMcMillanBreiman theorem. (Other names are the asymptotic equipartition
theorem or AEP, the ergodic theorem of information theory, and the entropy
theorem.) A variety of detailed proofs of the basic coding theorems and stronger
versions of the theorems for memoryless, Markov, and other special cases of random processes were developed, notable examples being the work of Feinstein [38]
[39] and Wolfowitz (see, e.g., Wolfowitz [153].) The ideas of measures of information, channels, codes, and communications systems were rigorously extended
to more general random processes with abstract alphabets and discrete and
continuous time by Khinchine [73], [74] and by Kolmogorov and his colleagues,
especially Gelfand, Yaglom, Dobrushin, and Pinsker [45], [91], [88], [32], [126].
(See, for example, “Kolmogorov’s contributions to information theory and algorithmic complexity” [23].) In almost all of the early Soviet work, it was average xvi PROLOGUE mutual information that played the fundamental role. It was the more natural quantity when more than one process were being considered. In addition,
the notion of entropy was not useful when dealing with processes with continuous alphabets since it is generally inﬁnite in such cases. A generalization of
the idea of entropy called discrimination was developed by Kullback (see, e.g.,
Kullback [93]) and was further studied by the Soviet school. This form of information measure is now more commonly referred to as relative entropy or cross
entropy (or KullbackLeibler number) and it is better interpreted as a measure
of similarity between probability distributions than as a measure of information
between random variables. Many results for mutual information and entropy
can be viewed as special cases of results for relative entropy and the formula for
relative entropy arises naturally in some proofs.
It is the mathematical aspects of information theory and hence the descendants of the above results that are the focus of this book, but the developments
in the engineering community have had as signiﬁcant an impact on the foundations of information theory as they have had on applications. Simpler proofs of
the basic coding theorems were developed for special cases and, as a natural oﬀshoot, the rate of convergence to the optimal performance bounds characterized
in a variety of important cases. See, e.g., the texts by Gallager [43], Berger [11],
and Csisz`r and K¨rner [26]. Numerous practicable coding techniques were dea
o
veloped which provided performance reasonably close to the optimum in many
cases: from the simple linear error correcting and detecting codes of Slepian
[139] to the huge variety of algebraic codes currently being implemented (see,
e.g., [13], [150],[96], [98], [18]) and the various forms of convolutional, tree, and
trellis codes for error correction and data compression (see, e.g., [147], [69]).
Clustering techniques have been used to develop good nonlinear codes (called
“vector quantizers”) for data compression applications such as speech and image
coding [49], [46], [100], [69], [119]. These clustering and trellis search techniques
have been combined to form single codes that combine the data compression
and reliable communication operations into a single coding system [8].
The engineering side of information theory through the middle 1970’s has
been well chronicled by two IEEE collections: Key Papers in the Development
of Information Theory, edited by D. Slepian [140], and Key Papers in the Development of Coding Theory, edited by E. Berlekamp [14]. In addition there have
been several survey papers describing the history of information theory during
each decade of its existence published in the IEEE Transactions on Information
Theory.
The inﬂuence on ergodic theory of Shannon’s work was equally great but in
a diﬀerent direction. After the development of quite general ergodic theorems,
one of the principal issues of ergodic theory was the isomorphism problem, the
characterization of conditions under which two dynamical systems are really the
same in the sense that each could be obtained from the other in an invertible
way by coding. Here, however, the coding was not of the variety considered by
Shannon — Shannon considered block codes, codes that parsed the data into
nonoverlapping blocks or windows of ﬁnite length and separately mapped each
input block into an output block. The more natural construct in ergodic theory PROLOGUE xvii can be called a sliding block code — here the encoder views a block of possibly
inﬁnite length and produces a single symbol of the output sequence using some
mapping (or code or ﬁlter). The input sequence is then shifted one time unit to
the left, and the same mapping applied to produce the next output symbol, and
so on. This is a smoother operation than the block coding structure since the
outputs are produced based on overlapping windows of data instead of on a completely diﬀerent set of data each time. Unlike the Shannon codes, these codes
will produce stationary output processes if given stationary input processes. It
should be mentioned that examples of such sliding block codes often occurred
in the information theory literature: timeinvariant convolutional codes or, simply, timeinvariant linear ﬁlters are sliding block codes. It is perhaps odd that
virtually all of the theory for such codes in the information theory literature
was developed by eﬀectively considering the sliding block codes as very long
block codes. Recently sliding block codes have proved a useful structure for the
design of noiseless codes for constrained alphabet channels such as magnetic
recording devices, and techniques from symbolic dynamics have been applied to
the design of such codes. See, for example [3], [101].
Shannon’s noiseless source coding theorem suggested a solution to the isomorphism problem: If we assume for the moment that one of the two processes
is binary, then perfect coding of a process into a binary process and back into
the original process requires that the original process and the binary process
have the same entropy. Thus a natural conjecture is that two processes are isomorphic if and only if they have the same entropy. A major diﬃculty was the
fact that two diﬀerent kinds of coding were being considered: stationary sliding
block codes with zero error by the ergodic theorists and either ﬁxed length block
codes with small error or variable length (and hence nonstationary) block codes
with zero error by the Shannon theorists. While it was plausible that the former
codes might be developed as some sort of limit of the latter, this proved to be
an extremely diﬃcult problem. It was Kolmogorov [89], [90] who ﬁrst reasoned
along these lines and proved that in fact equal entropy (appropriately deﬁned)
was a necessary condition for isomorphism.
Kolmogorov’s seminal work initiated a new branch of ergodic theory devoted
to the study of entropy of dynamical systems and its application to the isomorphism problem. Most of the original work was done by Soviet mathematicians;
notable papers are those by Sinai [136] [137] (in ergodic theory entropy is also
known as the KolmogorovSinai invariant), Pinsker [126], and Rohlin and Sinai
[128]. An actual construction of a perfectly noiseless sliding block code for a special case was provided by Meshalkin [105]. While much insight was gained into
the behavior of entropy and progress was made on several simpliﬁed versions of
the isomorphism problem, it was several years before Ornstein [115] proved a
result that has since come to be known as the Ornstein isomorphism theorem or
the KolmogorovOrnstein or KolmogorovSinaiOrnstein isomorphism theorem.
Ornstein showed that if one focused on a class of random processes which
we shall call B processes, then two processes are indeed isomorphic if and only
if they have the same entropy. B process are also called Bernoulli processes
in the ergodic theory literature, but this is potentially confusing because of xviii PROLOGUE the usage of “Bernoulli process” as a synonym of an independent identically
distributed (iid) process in information theory and random process theory. B processes have several equivalent deﬁnitions, perhaps the simplest is that they
are processes which can be obtained by encoding a memoryless process using a
sliding block code. This class remains the most general class known for which
the isomorphism conjecture holds. In the course of his proof, Ornstein developed
intricate connections between block coding and sliding block coding. He used
Shannonlike techniques on the block codes, then imbedded the block codes
into sliding block codes, and then used the stationary structure of the sliding
block codes to advantage in limiting arguments to obtain the required zero error
codes. Several other useful techniques and results were introduced in the proof:
notions of the distance between processes and relations between the goodness of
approximation and the diﬀerence of entropy. Ornstein expanded these results
into a book [117] and gave a tutorial discussion in the premier issue of the Annals
of Probability [116]. Several correspondence items by other ergodic theorists
discussing the paper accompanied the article.
The origins of this book lie in the tools developed by Ornstein for the proof
of the isomorphism theorem rather than with the result itself. During the early
1970’s I ﬁrst become interested in ergodic theory because of joint work with Lee
D. Davisson on source coding theorems for stationary nonergodic processes. The
ergodic decomposition theorem discussed in Ornstein [116] provided a needed
missing link and led to an intense campaign on my part to learn the fundamentals of ergodic theory and perhaps ﬁnd other useful tools. This eﬀort was
greatly eased by Paul Shields’ book The Theory of Bernoulli Shifts [133] and by
discussions with Paul on topics in both ergodic theory and information theory.
This in turn led to a variety of other applications of ergodic theoretic techniques
and results to information theory, mostly in the area of source coding theory:
proving source coding theorems for sliding block codes and using process distance measures to prove universal source coding theorems and to provide new
characterizations of Shannon distortionrate functions. The work was done with
Dave Neuhoﬀ, like me then an apprentice ergodic theorist, and Paul Shields.
With the departure of Dave and Paul from Stanford, my increasing interest led me to discussions with Don Ornstein on possible applications of his
techniques to channel coding problems. The interchange often consisted of my
describing a problem, his generation of possible avenues of solution, and then
my going oﬀ to work for a few weeks to understand his suggestions and work
them through.
One problem resisted our best eﬀorts–how to synchronize block codes over
channels with memory, a prerequisite for constructing sliding block codes for
such channels. In 1975 I had the good fortune to meet and talk with Roland Dobrushin at the 1975 IEEE/USSR Workshop on Information Theory in Moscow.
He observed that some of his techniques for handling synchronization in memoryless channels should immediately generalize to our case and therefore should
provide the missing link. The key elements were all there, but it took seven
years for the paper by Ornstein, Dobrushin and me to evolve and appear [59].
Early in the course of the channel coding paper, I decided that having the PROLOGUE xix solution to the sliding block channel coding result in sight was suﬃcient excuse
to write a book on the overlap of ergodic theory and information theory. The
intent was to develop the tools of ergodic theory of potential use to information
theory and to demonstrate their use by proving Shannon coding theorems for
the most general known information sources, channels, and code structures.
Progress on the book was disappointingly slow, however, for a number of reasons.
As delays mounted, I saw many of the general coding theorems extended and
improved by others (often by J. C. Kieﬀer) and new applications of ergodic
theory to information theory developed, such as the channel modeling work
of Neuhoﬀ and Shields [111], [114], [113], [112] and design methods for sliding
block codes for input restricted noiseless channels by Adler, Coppersmith, and
Hasner [3] and Marcus [101]. Although I continued to work in some aspects of
the area, especially with nonstationary and nonergodic processes and processes
with standard alphabets, the area remained for me a relatively minor one and
I had little time to write. Work and writing came in bursts during sabbaticals
and occasional advanced topic seminars. I abandoned the idea of providing the
most general possible coding theorems and decided instead to settle for coding
theorems that were suﬃciently general to cover most applications and which
possessed proofs I liked and could understand.
Only one third of this book is actually devoted to Shannon source and channel coding theorems; the remainder can be viewed as a monograph on information and distortion measures and their properties, especially their ergodic
properties.
Because of delays in the original project, the book was split into two smaller
books and the ﬁrst, Probability, Random Processes, and Ergodic Properties,
was published by SpringerVerlag in 1988 [50] and is currently available online
at http://ee.stanford.edu/∼gray/arp.html. It treats advanced probability
and random processes with an emphasis on processes with standard alphabets,
on nonergodic and nonstationary processes, and on necessary and suﬃcient
conditions for the convergence of long term sample averages. Asymptotically
mean stationary sources and the ergodic decomposition are there treated in
depth and recent simpliﬁed proofs of the ergodic theorem due to Ornstein and
Weiss [118] and others were incorporated. That book provides the background
material and introduction to this book, the split naturally falling before the
introduction of entropy. The ﬁrst chapter of this book reviews some of the basic
notation of the ﬁrst one in information theoretic terms, but results are often
simply quoted as needed from the ﬁrst book without any attempt to derive
them. The two books together are selfcontained in that all supporting results
from probability theory and ergodic theory needed here may be found in the
ﬁrst book. This book is selfcontained so far as its information theory content,
but it should be considered as an advanced text on the subject and not as an
introductory treatise to the reader only wishing an intuitive overview.
Here the ShannonMcMillanBreiman theorem is proved using the coding
approach of Ornstein and Weiss [118] (see also Shield’s tutorial paper [134])
and hence the treatments of ordinary ergodic theorems in the ﬁrst book and the xx PROLOGUE ergodic theorems for information measures in this book are consistent. The extension of the ShannonMcMillanBreiman theorem to densities is proved using
the “sandwich” approach of Algoet and Cover [7], which depends strongly on
the usual pointwise or Birkhoﬀ ergodic theorem: sample entropy is asymptotically sandwiched between two functions whose limits can be determined from
the ergodic theorem. These results are the most general yet published in book
form and diﬀer from traditional developments in that martingale theory is not
required in the proofs.
A few words are in order regarding topics that are not contained in this
book. I have not included multiuser information theory for two reasons: First,
after including the material that I wanted most, there was no room left. Second,
my experience in the area is slight and I believe this topic can be better handled
by others. Results as general as the single user systems described here have not
yet been developed. Good surveys of the multiuser area may be found in El
Gamal and Cover [44] van der Meulen [144] and Berger [12]
Traditional noiseless coding theorems and actual codes such as the Huﬀman
codes are not considered in depth because quite good treatments exist in the
literature, e.g., [43], [1], [103]. The corresponding ergodic theory result–Ornstein
isomorphism theorem–is also not proved, because its proof is diﬃcult and the
result is not needed for the Shannon coding theorems. Many techniques used
in its proof, however, are used here for similar and other purposes.
The actual computation of channel capacity and distortion rate functions
has not been included because existing treatments [43], [17], [11], [52] are quite
adequate.
This book does not treat code design techniques. Algebraic coding is well
developed in existing texts on the subject [13], [150], [96], [18]. Allen Gersho
and I wrote a book on the theory and design of nonlinear coding techniques
such as vector quantizers and trellis codes for analogtodigital conversion and
for source coding (data compression) and combined source and channel coding
applications [47].
Universal codes, codes which work well for an unknown source, and variable
rate codes, codes producing a variable number of bits for each input vector, are
not considered. The interested reader is referred to [110] [97] [78] [79] [28] and
the references therein.
An active research area that has made good use of the ideas of relative entropy to characterize exponential growth is that of large deviations theory[145][31].
These techniques have been used to provide new proofs of the basic source coding theorems[22]. These topics are not treated here.
Lastly, J. C. Kieﬀer developed a powerful new ergodic theorem that can be
used to prove both traditional ergodic theorems and the extended ShannonMcMillanBrieman theorem [84]. He has used this theorem to prove new strong
(almost everywhere) versions of the souce coding theorem and its converse, that
is, results showing that sample average distortion is with probability one no
smaller than the distortionrate function and that there exist codes with sample average distortion arbitrarily close to the distortionrate function [85] [83]. PROLOGUE xxi These results should have a profound impact on the future development of the
theoretical tools and results of information theory. Their imminent publication
provide a strong motivation for the completion of this monograph, which is
devoted to the traditional methods. Tradition has its place, however, and the
methods and results treated here should retain much of their role at the core of
the theory of entropy and information. It is hoped that this collection of topics
and methods will ﬁnd a niche in the literature.
19 November 2000 Revision The original edition went out of print in
2000. Hence I took the opportunity to ﬁx more typos which have been brought
to my attention (thanks in particular to Yariv Ephraim) and to prepare the book
for Web posting. This is done with the permission of the original publisher and
copyrightholder, SpringerVerlag. I hope someday to do some more serious
revising, but for the moment I am content to ﬁx the known errors and make the
manuscript available.
20 August 2008 Revision In the summer of 2008 the numerous minor
tweaks and corrections were made in the manuscript while reviewing it while
considering a possible second edition.
16 July 2009 Revision Some typos corrected. This summer I will begin a
major revision for a Second Edition, to be published by Springer. The current
form will be ﬁxed as the ﬁnal version of the First Edition (but I will continue
to ﬁx any typos found by me or readers). xxii PROLOGUE
Acknowledgments The research in information theory that yielded many of the results and some
of the new proofs for old results in this book was supported by the National
Science Foundation. Portions of the research and much of the early writing were
supported by a fellowship from the John Simon Guggenheim Memorial Foundation. The book was originally written using the eqn and troﬀ utilities on several
A
UNIX systems and was subsequently translated into L TEX on both UNIX and
Apple Macintosh systems. All of these computer systems were supported by
the Industrial Aﬃliates Program of the Stanford University Information SysA
tems Laboratory. Much helpful advice on the mysteries of L TEX was provided
by Richard Roy and Marc Goldburg.
Recent research and writing on some of these topics has been aided by gifts
from Hewlett Packard, Inc.
The book beneﬁted greatly from comments from numerous students and colleagues over many years; incluting Paul Shields, Paul Algoet, Ender Ayanoglu,
Lee Davisson, John Kieﬀer, Dave Neuhoﬀ, Don Ornstein, Bob Fontana, Jim
Dunham, Farivar Saadat, Michael Sabin, Andrew Barron, Phil Chou, Tom
Lookabaugh, Andrew Nobel, Bradley Dickinson, Ricardo Blasco Serrano, and
Christopher Ellison. Robert M. Gray
Rockport, Massachusetts
July 2009 Chapter 1 Information Sources
1.1 Introduction An information source or source is a mathematical model for a physical entity
that produces a succession of symbols called “outputs” in a random manner.
The symbols produced may be real numbers such as voltage measurements from
a transducer, binary numbers as in computer data, two dimensional intensity
ﬁelds as in a sequence of images, continuous or discontinuous waveforms, and
so on. The space containing all of the possible output symbols is called the
alphabet of the source and a source is essentially an assignment of a probability
measure to events consisting of sets of sequences of symbols from the alphabet.
It is useful, however, to explicitly treat the notion of time as a transformation
of sequences produced by the source. Thus in addition to the common random
process model we shall also consider modeling sources by dynamical systems as
considered in ergodic theory.
The material in this chapter is a distillation of [50] and is intended to establish notation. 1.2 Probability Spaces and Random Variables A measurable space (Ω, B ) is a pair consisting of a sample space Ω together with
a σ ﬁeld B of subsets of Ω (also called the event space). A σ ﬁeld or σ algebra
B is a nonempty collection of subsets of Ω with the following properties:
Ω ∈ B. (1.1) If F ∈ B , then F c = {ω : ω ∈ F } ∈ B . (1.2) If Fi ∈ B ; i = 1, 2, . . . , then (1.3) Fi ∈ B .
i 1 2 CHAPTER 1. INFORMATION SOURCES
From de Morgan’s “laws” of elementary set theory it follows that also
∞ ∞ Fi = (
i=1 Fic )c ∈ B . i=1 An event space is a collection of subsets of a sample space (called events by
virtue of belonging to the event space) such that any countable sequence of set
theoretic operations (union, intersection, complementation) on events produces
other events. Note that there are two extremes: the largest possible σ ﬁeld of
Ω is the collection of all subsets of Ω (sometimes called the power set) and the
smallest possible σ ﬁeld is {Ω, ∅}, the entire space together with the null set
∅ = Ωc (called the trivial space).
If instead of the closure under countable unions required by (1.3), we only
require that the collection of subsets be closed under ﬁnite unions, then we say
that the collection of subsets is a ﬁeld.
While the concept of a ﬁeld is simpler to work with, a σ ﬁeld possesses the
additional important property that it contains all of the limits of sequences of
sets in the collection. That is, if Fn , n = 1, 2, · · · is an increasing sequence of
∞
sets in a σ ﬁeld, that is, if Fn−1 ⊂ Fn and if F = n=1 Fn (in which case we
write Fn ↑ F or limn→∞ Fn = F ), then also F is contained in the σ ﬁeld. In
a similar fashion we can deﬁne decreasing sequences of sets: If Fn decreases to
∞
F in the sense that Fn+1 ⊂ Fn and F = n=1 Fn , then we write Fn ↓ F . If
Fn ∈ B for all n, then F ∈ B .
A probability space (Ω, B , P ) is a triple consisting of a sample space Ω , a σ ﬁeld B of subsets of Ω , and a probability measure P which assigns a real number
P (F ) to every member F of the σ ﬁeld B so that the following conditions are
satisﬁed:
• Nonnegativity:
P (F ) ≥ 0, all F ∈ B ; (1.4) P (Ω) = 1; (1.5) • Normalization:
• Countable Additivity:
If Fi ∈ B , i = 1, 2, · · · are disjoint, then
∞ P(
i=1 ∞ Fi ) = P (Fi ). (1.6) i=1 A set function P satisfying only (1.4) and (1.6) but not necessarily (1.5) is
called a measure and the triple (Ω, B , P ) is called a measure space. Since the
probability measure is deﬁned on a σ ﬁeld, such countable unions of subsets of
Ω in the σ ﬁeld are also events in the σ ﬁeld.
A standard result of basic probability theory is that if Gn ↓ ∅ (the empty or
∞
null set), that is, if Gn+1 ⊂ Gn for all n and n=1 Gn = ∅ , then we have 1.2. PROBABILITY SPACES AND RANDOM VARIABLES 3 • Continuity at ∅:
lim P (Gn ) = 0. (1.7) If Fn ↑ F, then lim P (Fn ) = P (F ), (1.8) n→∞ similarly it follows that we have
• Continuity from Below:
n→∞ and
• Continuity from Above:
If Fn ↓ F, then lim P (Fn ) = P (F ).
n→∞ (1.9) Given a measurable space (Ω, B ), a collection G of members of B is said to
generate B and we write σ (G ) = B if B is the smallest σ ﬁeld that contains G ;
that is, if a σ ﬁeld contains all of the members of G , then it must also contain all
of the members of B . The following is a fundamental approximation theorem of
probability theory. A proof may be found in Corollary 1.5.3 of [50]. The result
is most easily stated in terms of the symmetric diﬀerence ∆ deﬁned by
F ∆ G ≡ (F Gc ) (F c G). Theorem 1.2.1 Given a probability space (Ω, B , P ) and a generating ﬁeld F ,
that is, F is a ﬁeld and B = σ (F ), then given F ∈ B and > 0, there exists an
F0 ∈ F such that P (F ∆F0 ) ≤ .
Let (A, BA ) denote another measurable space. A random variable or measurable function deﬁned on (Ω, B ) and taking values in (A,BA ) is a mapping or
function f : Ω → A with the property that
if F ∈ BA , then f −1 (F ) = {ω : f (ω ) ∈ F } ∈ B . (1.10) The name “random variable” is commonly associated with the special case where
A is the real line and B the Borel ﬁeld, the smallest σ ﬁeld containing all the
intervals. Occasionally a more general sounding name such as “random object”
is used for a measurable function to implicitly include random variables (A the
real line), random vectors (A a Euclidean space), and random processes (A a
sequence or waveform space). We will use the terms “random variable” in the
more general sense.
A random variable is just a function or mapping with the property that
inverse images of “output events” determined by the random variable are events
in the original measurable space. This simple property ensures that the output
of the random variable will inherit its own probability measure. For example,
with the probability measure Pf deﬁned by
Pf (B ) = P (f −1 (B )) = P (ω : f (ω ) ∈ B ); B ∈ BA , 4 CHAPTER 1. INFORMATION SOURCES (A, BA , Pf ) becomes a probability space since measurability of f and elementary set theory ensure that Pf is indeed a probability measure. The induced
probability measure Pf is called the distribution of the random variable f . The
measurable space (A,BA ) or, simply, the sample space A, is called the alphabet
of the random variable f . We shall occasionally also use the notation P f −1
which is a mnemonic for the relation P f −1 (F ) = P (f −1 (F )) and which is less
awkward when f itself is a function with a complicated name, e.g., ΠI→M .
If the alphabet A of a random variable f is not clear from context, then we
shall refer to f as an Avalued random variable. . If f is a measurable function
from (Ω, B ) to (A, BA ), we will say that f is B /BA measurable if the σ ﬁelds
might not be clear from context.
Given a probability space (Ω, B , P ), a collection of subsets G is a subσ ﬁeld
if it is a σ ﬁeld and all its members are in B . A random variable f : Ω → A
is said to be measurable with respect to a subσ ﬁeld G if f −1 (H ) ∈ G for all
H ∈ BA .
Given a probability space (Ω, B , P ) and a subσ ﬁeld G , for any event H ∈ B
the conditional probability m(H G ) is deﬁned as any function, say g , which
satisﬁes the two properties
g is measurable with respect to G
ghdP = m(G H ); all G ∈ G . (1.11)
(1.12) G An important special case of conditional probability occurs when studying
the distributions of random variables deﬁned on an underlying probability space.
Suppose that X : Ω → AX and Y : Ω → AY are two random variables deﬁned
on (Ω, B , P ) with alphabets AX and AY and σ ﬁelds BAX and BAY , respectively.
Let PXY denote the induced distribution on (AX × AY , BAX × BAY ), that is,
PXY (F × G) = P (X ∈ F, Y ∈ G) = P (X −1 (F ) Y −1 (G)). Let σ (Y ) denote
the subσ ﬁeld of B generated by Y , that is, Y −1 (BAY ). Since the conditional
probability P (F σ (Y )) is realvalued and measurable with respect to σ (Y ), it
can be written as g (Y (ω )), ω ∈ Ω, for some function g (y ). (See, for example,
Lemma 5.2.1 of [50].) Deﬁne P (F y ) = g (y ). For a ﬁxed F ∈ BAX deﬁne the
conditional distribution of F given Y = y by
PX Y (F y ) = P (X −1 (F )y ); y ∈ BAY .
From the properties of conditional probability,
PXY (F × G) = PX Y (F y )dPY (y ); F ∈ BAX , G ∈ BAY . (1.13) G It is tempting to think that for a ﬁxed y , the set function deﬁned by
PX Y (F y ); F ∈ BAX is actually a probability measure. This is not the case in
general. When it does hold for a conditional probability measure, the conditional probability measure is said to be regular. As will be emphasized later, this
text will focus on standard alphabets for which regular conditional probabilites
always exist. 1.3. RANDOM PROCESSES AND DYNAMICAL SYSTEMS 1.3 5 Random Processes and Dynamical Systems We now consider two mathematical models for a source: A random process
and a dynamical system. The ﬁrst is the familiar one in elementary courses, a
source is just a random process or sequence of random variables. The second
model is possibly less familiar; a random process can also be constructed from
an abstract dynamical system consisting of a probability space together with a
transformation on the space. The two models are connected by considering a
time shift to be a transformation.
A discrete time random process or for our purposes simply a random process
is a sequence of random variables {Xn }n∈T or {Xn ; n ∈ T }, where T is an
index set, deﬁned on a common probability space (Ω, B , P ). We deﬁne a source
as a random process, although we could also use the alternative deﬁnition of
a dynamical system to be introduced shortly. We usually assume that all of
the random variables share a common alphabet, say A. The two most common
index sets of interest are the set of all integers Z = {· · · , −2, −1, 0, 1, 2, · · · },
in which case the random process is referred to as a twosided random process,
and the set of all nonnegative integers Z+ = {0, 1, 2, · · · }, in which case the
random process is said to be onesided. Onesided random processes will often
prove to be far more diﬃcult in theory, but they provide better models for
physical random processes that must be “turned on” at some time or which
have transient behavior.
Observe that since the alphabet A is general, we could also model continuous
time random processes in the above fashion by letting A consist of a family of
waveforms deﬁned on an interval, e.g., the random variable Xn could in fact be
a continuous time waveform X (t) for t ∈ [nT, (n + 1)T ), where T is some ﬁxed
positive real number.
The above deﬁnition does not specify any structural properties of the index
set T . In particular, it does not exclude the possibility that T be a ﬁnite set, in
which case “random vector” would be a better name than “random process.” In
fact, the two cases of T = Z and T = Z+ will be the only important examples
for our purposes. Nonetheless, the general notation of T will be retained in
order to avoid having to state separate results for these two cases.
An abstract dynamical system consists of a probability space (Ω, B , P ) together with a measurable transformation T : Ω → Ω of Ω into itself. Measurability means that if F ∈ B , then also T −1 F = {ω : T ω ∈ F }∈ B . The quadruple
(Ω,B ,P ,T ) is called a dynamical system in ergodic theory. The interested reader
can ﬁnd excellent introductions to classical ergodic theory and dynamical system
theory in the books of Halmos [62] and Sinai [138]. More complete treatments
may be found in [15], [133], [125], [30], [149], [117], [42]. The term “dynamical
systems” comes from the focus of the theory on the long term “dynamics” or
“dynamical behavior” of repeated applications of the transformation T on the
underlying measure space.
An alternative to modeling a random process as a sequence or family of
random variables deﬁned on a common probability space is to consider a single random variable together with a transformation deﬁned on the underlying 6 CHAPTER 1. INFORMATION SOURCES probability space. The outputs of the random process will then be values of the
random variable taken on transformed points in the original space. The transformation will usually be related to shifting in time and hence this viewpoint will
focus on the action of time itself. Suppose now that T is a measurable mapping
of points of the sample space Ω into itself. It is easy to see that the cascade or
composition of measurable functions is also measurable. Hence the transformation T n deﬁned as T 2 ω = T (T ω ) and so on (T n ω = T (T n−1 ω )) is a measurable
function for all positive integers n. If f is an Avalued random variable deﬁned
on (Ω, B ), then the functions f T n : Ω → A deﬁned by f T n (ω ) = f (T n ω ) for
ω ∈ Ω will also be random variables for all n in Z+ . Thus a dynamical system
together with a random variable or measurable function f deﬁnes a onesided
random process {Xn }n∈Z+ by Xn (ω ) = f (T n ω ). If it should be true that T is
invertible, that is, T is onetoone and its inverse T −1 is measurable, then one
can deﬁne a twosided random process by Xn (ω ) = f (T n ω ), all n in Z .
The most common dynamical system for modeling random processes is that
consisting of a sequence space Ω containing all one or twosided Avalued sequences together with the shift transformation T , that is, the transformation
that maps a sequence {xn } into the sequence {xn+1 } wherein each coordinate
has been shifted to the left by one time unit. Thus, for example, let Ω = AZ+
= {all x = (x0 , x1 , · · · ) with xi ∈ A for all i} and deﬁne T : Ω → Ω by
T (x0 , x1 , x2 , · · · ) = (x1 , x2 , x3 , · · · ). T is called the shift or left shift transformation on the onesided sequence space. The shift for twosided spaces is deﬁned
similarly.
The diﬀerent models provide equivalent models for a given process: one
emphasizing the sequence of outputs and the other emphasising the action of a
transformation on the underlying space in producing these outputs. In order to
demonstrate in what sense the models are equivalent for given random processes,
we next turn to the notion of the distribution of a random process. 1.4 Distributions While in principle all probabilistic quantities associated with a random process
can be determined from the underlying probability space, it is often more convenient to deal with the induced probability measures or distributions on the
space of possible outputs of the random process. In particular, this allows us to
compare diﬀerent random processes without regard to the underlying probability spaces and thereby permits us to reasonably equate two random processes
if their outputs have the same probabilistic structure, even if the underlying
probability spaces are quite diﬀerent.
We have already seen that each random variable Xn of the random process
{Xn } inherits a distribution because it is measurable. To describe a process,
however, we need more than simply probability measures on output values of
separate single random variables; we require probability measures on collections
of random variables, that is, on sequences of outputs. In order to place probability measures on sequences of outputs of a random process, we ﬁrst must 1.4. DISTRIBUTIONS 7 construct the appropriate measurable spaces. A convenient technique for accomplishing this is to consider product spaces, spaces for sequences formed by
concatenating spaces for individual outputs.
Let T denote any ﬁnite or inﬁnite set of integers. In particular, T = Z (n) =
{0, 1, 2, · · · , n − 1}, T = Z , or T = Z+ . Deﬁne xT = {xi }i∈T . For example,
xZ = (· · · , x−1 , x0 , x1 , · · · ) is a twosided inﬁnite sequence. When T = Z (n) we
abbreviate xZ (n) to simply xn . Given alphabets Ai , i ∈ T , deﬁne the cartesian
product space
× Ai = { all xT : xi , ∈ Ai all i in T }.
i∈T In most cases all of the Ai will be replicas of a single alphabet A and the above
product will be denoted simply by AT . Thus, for example, A{m,m+1,··· ,n} is
the space of all possible outputs of the process from time m to time n; AZ
is the sequence space of all possible outputs of a twosided process. We shall
abbreviate the notation for the space AZ (n) , the space of all n dimensional
vectors with coordinates in A, by An .
To obtain useful σ ﬁelds of the above product spaces, we introduce the idea of
a rectangle in a product space. A rectangle in AT taking values in the coordinate
σ ﬁelds Bi , i ∈ J , is deﬁned as any set of the form
B = {xT ∈ AT : xi ∈ Bi ; all i in J }, (1.14) where J is a ﬁnite subset of the index set T and Bi ∈ Bi for all i ∈ J .
(Hence rectangles are sometimes referred to as ﬁnite dimensional rectangles.) A
rectangle as in (1.14) can be written as a ﬁnite intersection of onedimensional
rectangles as
Xi −1 (Bi ) {xT ∈ AT : xi ∈ Bi } = B=
i∈J (1.15) i∈J where here we consider Xi as the coordinate functions Xi : AT → A deﬁned by
Xi (xT ) = xi .
As rectangles in AT are clearly fundamental events, they should be members
of any useful σ ﬁeld of subsets of AT . Deﬁne the product σ ﬁeld BA T as the
smallest σ ﬁeld containing all of the rectangles, that is, the collection of sets that
contains the clearly important class of rectangles and the minimum amount of
other stuﬀ required to make the collection a σ ﬁeld. To be more precise, given
an index set T of integers, let RECT (Bi , i ∈ T ) denote the set of all rectangles
in AT taking coordinate values in sets in Bi , i ∈ T . We then deﬁne the product
σ ﬁeld of AT by
BA T = σ (RECT (Bi , i ∈ T )).
(1.16)
Consider an index set T and an Avalued random process {Xn }n∈T deﬁned
on an underlying probability space (Ω, B , P ). Given any index set J ⊂ T ,
measurability of the individual random variables Xn implies that of the random
vectors X J = {Xn ; n ∈ J }. Thus the measurable space (AJ , BA J ) inherits a
probability measure from the underlying space through the random variables 8 CHAPTER 1. INFORMATION SOURCES X J . Thus in particular the measurable space (AT , BA T ) inherits a probability
measure from the underlying probability space and thereby determines a new
probability space (AT , BA T , PX T ), where the induced probability measure is
deﬁned by
PX T (F ) = P ((X T )−1 (F )) = P (ω : X T (ω ) ∈ F ); F ∈ BA T . (1.17) Such probability measures induced on the outputs of random variables are referred to as distributions for the random variables, exactly as in the simpler case
ﬁrst treated. When T = {m, m + 1, · · · , m + n − 1}, e.g., when we are treating
n
Xm = (Xn , · · · , Xm+n−1 ) taking values in An , the distribution is referred to
as an ndimensional or nth order distribution and it describes the behavior of
an ndimensional random variable. If T is the entire process index set, e.g., if
T = Z for a twosided process or T = Z+ for a onesided process, then the
induced probability measure is deﬁned to be the distribution of the process.
Thus, for example, a probability space (Ω, B , P ) together with a doubly inﬁnite sequence of random variables {Xn }n∈Z induces a new probability space
(AZ , BA Z , PX Z ) and PX Z is the distribution of the process. For simplicity, let
us now denote the process distribution simply by m. We shall call the probability space (AT , BA T , m) induced in this way by a random process {Xn }n∈Z
the output space or sequence space of the random process.
Since the sequence space (AT , BA T , m) of a random process {Xn }n∈Z is a
probability space, we can deﬁne random variables and hence also random processes on this space. One simple and useful such deﬁnition is that of a sampling
or coordinate or projection function deﬁned as follows: Given a product space
AT , deﬁne the sampling functions Πn : AT → A by
Πn (xT ) = xn , xT ∈ AT ; n ∈ T . (1.18) The sampling function is named Π since it is also a projection. Observe that the
distribution of the random process {Πn }n∈T deﬁned on the probability space
(AT , BA T , m) is exactly the same as the distribution of the random process
{Xn }n∈T deﬁned on the probability space (Ω, B , P ). In fact, so far they are the
same process since the {Πn } simply read oﬀ the values of the {Xn }.
What happens, however, if we no longer build the Πn on the Xn , that is, we
no longer ﬁrst select ω from Ω according to P , then form the sequence xT =
X T (ω ) = {Xn (ω )}n∈T , and then deﬁne Πn (xT ) = Xn (ω )? Instead we directly
choose an x in AT using the probability measure m and then view the sequence
of coordinate values. In other words, we are considering two completely separate
experiments, one described by the probability space (Ω, B , P ) and the random
variables {Xn } and the other described by the probability space (AT , BA T , m)
and the random variables {Πn }. In these two separate experiments, the actual
sequences selected may be completely diﬀerent. Yet intuitively the processes
should be the “same” in the sense that their statistical structures are identical,
that is, they have the same distribution. We make this intuition formal by
deﬁning two processes to be equivalent if their process distributions are identical,
that is, if the probability measures on the output sequence spaces are the same, 1.4. DISTRIBUTIONS 9 regardless of the functional form of the random variables of the underlying
probability spaces. In the same way, we consider two random variables to be
equivalent if their distributions are identical.
We have described above two equivalent processes or two equivalent models
for the same random process, one deﬁned as a sequence of random variables
on a perhaps very complicated underlying probability space, the other deﬁned
as a probability measure directly on the measurable space of possible output
sequences. The second model will be referred to as a directly given random
process or a the Kolmogorov model for the random process.
Which model is “better” depends on the application. For example, a directly
given model for a random process may focus on the random process itself and not
its origin and hence may be simpler to deal with. If the random process is then
coded or measurements are taken on the random process, then it may be better
to model the encoded random process in terms of random variables deﬁned on
the original random process and not as a directly given random process. This
model will then focus on the input process and the coding operation. We shall
let convenience determine the most appropriate model.
We can now describe yet another model for the above random process, that
is, another means of describing a random process with the same distribution.
This time the model is in terms of a dynamical system. Given the probability
space (AT , BA T , m), deﬁne the (left) shift transformation T : AT → AT by
T (xT ) = T ({xn }n∈T ) = y T = {yn }n∈T ,
where
yn = xn+1 , n ∈ T .
Thus the nth coordinate of y T is simply the (n + 1)st coordinate of xT . (We
assume that T is closed under addition and hence if n and 1 are in T , then so
is (n + 1).) If the alphabet of such a shift is not clear from context, we will
occasionally denote the shift by TA or TAT . The shift can easily be shown to
be measurable.
Consider next the dynamical system (AT , BA T , P, T ) and the random process formed by combining the dynamical system with the zero time sampling
function Π0 (we assume that 0 is a member of T ). If we deﬁne Yn (x) = Π0 (T n x)
for x = xT ∈ AT , or, in abbreviated form, Yn = Π0 T n , then the random process {Yn }n∈T is equivalent to the processes developed above. Thus we have
developed three diﬀerent, but equivalent, means of producing the same random
process. Each will be seen to have its uses.
The above development shows that a dynamical system is a more fundamental entity than a random process since we can always construct an equivalent
model for a random process in terms of a dynamical system–use the directly
given representation, shift transformation, and zero time sampling function.
The shift transformation on a sequence space introduced above is the most
important transformation that we shall encounter. It is not, however, the only
important transformation. When dealing with transformations we will usually
use the notation T to reﬂect the fact that it is often related to the action of a 10 CHAPTER 1. INFORMATION SOURCES simple left shift of a sequence, yet it should be kept in mind that occasionally
other operators will be considered and the theory to be developed will remain
valid, even if T is not required to be a simple time shift. For example, we will
also consider block shifts.
Most texts on ergodic theory deal with the case of an invertible transformation, that is, where T is a onetoone transformation and the inverse mapping
T −1 is measurable. This is the case for the shift on AZ , the twosided shift. It is
not the case, however, for the onesided shift deﬁned on AZ+ and hence we will
avoid use of this assumption. We will, however, often point out in the discussion
what simpliﬁcations or special properties arise for invertible transformations.
Since random processes are considered equivalent if their distributions are
the same, we shall adopt the notation [A, m, X ] for a random process {Xn ; n ∈
T } with alphabet A and process distribution m, the index set T usually being
clear from context. We will occasionally abbreviate this to the more common
notation [A, m], but it is often convenient to note the name of the output random variables as there may be several, e.g., a random process may have an
input X and output Y . By “the associated probability space” of a random
process [A, m, X ] we shall mean the sequence probability space (AT , BA T , m).
It will often be convenient to consider the random process as a directly given
random process, that is, to view Xn as the coordinate functions Πn on the sequence space AT rather than as being deﬁned on some other abstract space.
This will not always be the case, however, as often processes will be formed by
coding or communicating other random processes. Context should render such
bookkeeping details clear. 1.5 Standard Alphabets A measurable space (A, BA ) is a standard space if there exists a sequence of
ﬁnite ﬁelds Fn ; n = 1, 2, · · · with the following properties:
(1) Fn ⊂ Fn+1 (the ﬁelds are increasing).
(2) BA is the smallest σ ﬁeld containing all of the Fn (the Fn generate BA or
∞
BA = σ ( n=1 Fn )).
(3) An event Gn ∈ Fn is called an atom of the ﬁeld if it is nonempty and and
its only subsets which are also ﬁeld members are itself and the empty set.
If Gn ∈ Fn ; n = 1, 2, · · · are atoms and Gn+1 ⊂ Gn for all n, then
∞ Gn = ∅.
n=1 Standard spaces are important for several reasons: First, they are a general class
of spaces for which two of the key results of probability hold: (1) the Kolmogorov
extension theorem showing that a random process is completely described by its
ﬁnite order distributions, and (2) the existence of regular conditional probability 1.6. EXPECTATION 11 measures. Thus, in particular, the conditional probability measure PX Y (F y )
of (1.13) is regular if the alphabets AX and AY are standard and hence for each
ﬁxed y ∈ AY the set function PX Y (F y ); F ∈ BAX is a probability measure.
In this case we can interpret PX Y (F y ) as P (X ∈ F Y = y ). Second, the
ergodic decomposition theorem of ergodic theory holds for such spaces. Third,
the class is suﬃciently general to include virtually all examples arising in applications, e.g., discrete spaces, the real line, Euclidean vector spaces, Polish
spaces (complete separable metric spaces), etc. The reader is referred to [50]
and the references cited therein for a detailed development of these properties
and examples of standard spaces.
Standard spaces are not the most general space for which the Kolmogorov
extension theorem, the existence of conditional probability, and the ergodic
decomposition theorem hold. These results also hold for perfect spaces which
include standard spaces as a special case. (See, e.g., [130],[141],[127], [99].) We
limit discussion to standard spaces, however, as they are easier to characterize
and work with and they are suﬃciently general to handle most cases encountered
in applications. Although standard spaces are not the most general for which the
required probability theory results hold, they are the most general for which all
ﬁnitely additive normalized measures extend to countably additive probability
measures, a property which greatly eases the proof of many of the desired results.
Throughout this book we shall assume that the alphabet A of the information
source is a standard space. 1.6 Expectation Let (Ω, B , m) be a probability space, e.g., the probability space of a directly
given random process with alphabet A, (AT , BA T , m). A realvalued random
variable f : Ω → R will also be called a measurement since it is often formed
by taking a mapping or function of some other set of more general random
variables, e.g., the outputs of some random process which might not have realvalued outputs. Measurements made on such processes, however, will always be
assumed to be real.
Suppose next we have a measurement f whose range space or alphabet
f (Ω) ⊂ R of possible values is ﬁnite. Then f is called a discrete random
variable or discrete measurement or digital measurement or, in the common
mathematical terminology, a simple function.
Given a discrete measurement f , suppose that its range space is f (Ω) =
{bi , i = 1, · · · , N }, where the bi are distinct. Deﬁne the sets Fi = f −1 (bi ) =
{x : f (x) = bi }, i = 1, · · · , N . Since f is measurable, the Fi are all members
of B . Since the bi are distinct, the Fi are disjoint. Since every input point in
Ω must map into some bi , the union of the Fi equals Ω. Thus the collection
{Fi ; i = 1, 2, · · · , N } forms a partition of Ω. We have therefore shown that any 12 CHAPTER 1. INFORMATION SOURCES discrete measurement f can be expressed in the form
M f (x) = bi 1Fi (x), (1.19) i=1 where bi ∈ R, the Fi ∈ B form a partition of Ω, and 1Fi is the indicator function
of Fi , i = 1, · · · , M . Every simple function has a unique representation in this
form with distinct bi and {Fi } a partition.
The expectation or ensemble average or probabilistic average or mean of a
discrete measurement f : Ω → R as in (1.19) with respect to a probability
measure m is deﬁned by
M Em f = bi m(Fi ). (1.20) i=0 An immediate consequence of the deﬁnition of expectation is the simple but
useful fact that for any event F in the original probability space,
Em 1F = m(F ),
that is, probabilities can be found from expectations of indicator functions.
Again let (Ω, B , m) be a probability space and f : Ω → R a measurement,
that is, a realvalued random variable or measurable realvalued function. Deﬁne
the sequence of quantizers qn : R → R, n = 1, 2, · · · , as follows: n (k − 1)2−n
qn (r) =
−n −(k − 1)2 −n n≤r
(k − 1)2−n ≤ r < k 2−n , k = 1, 2, · · · , n2n
−k 2−n ≤ r < −(k − 1)2−n ; k = 1, 2, · · · , n2n
r < −n. We now deﬁne expectation for general measurements in two steps. If f ≥ 0,
then deﬁne
Em f = lim Em (qn (f )).
(1.21)
n→∞ Since the qn are discrete measurements on f , the qn (f ) are discrete measurements on Ω (qn (f )(x) = qn (f (x)) is a simple function) and hence the individual
expectations are well deﬁned. Since the qn (f ) are nondecreasing, so are the
Em (qn (f )) and this sequence must either converge to a ﬁnite limit or grow
without bound, in which case we say it converges to ∞. In both cases the
expectation Em f is well deﬁned, although it may be inﬁnite.
If f is an arbitrary real random variable, deﬁne its positive and negative parts
f + (x) = max(f (x), 0) and f − (x) = − min(f (x), 0) so that f (x) = f + (x)−f − (x)
and set
Em f = Em f + − Em f −
(1.22)
provided this does not have the form +∞ − ∞, in which case the expectation
does not exist. It can be shown that the expectation can also be evaluated for 1.6. EXPECTATION 13 nonnegative measurements by the formula
Em f = sup
Em g.
discrete g: g≤f The expectation is also called an integral and is denoted by any of the following:
Em f = f dm = f (x)dm(x) = f (x)m(dx). The subscript m denoting the measure with respect to which the expectation is
taken will occasionally be omitted if it is clear from context.
A measurement f is said to be integrable or mintegrable if Em f exists and
is ﬁnite. A function is integrable if and only if its absolute value is integrable.
Deﬁne L1 (m) to be the space of all mintegrable functions. Given any mintegrable f and an event B , deﬁne
f dm = f (x)1B (x) dm(x). B Two random variables f and g are said to be equal malmosteverywhere
or equal ma.e. or equal with mprobability one if m(f = g ) = m({x : f (x) =
g (x)}) = 1. The m is dropped if it is clear from context.
Given a probability space (Ω, B , m), suppose that G is a subσ ﬁeld of B ,
that is, it is a σ ﬁeld of subsets of Ω and all those subsets are in B (G ⊂ B ).
Let f : Ω → R be an integrable measurement. Then the conditional expectation
E (f G ) is described as any function, say h(ω ), that satisﬁes the following two
properties:
h(ω ) is measurable with respect to G
(1.23)
f dm; all G ∈ G . h dm =
G (1.24) G If a regular conditional probability distribution given G exists, e.g., if the
space is standard, then one has a constructive deﬁnition of conditional expectation: E (f G )(ω ) is simply the expectation of f with respect to the conditional
probability measure m(.G )(ω ). Applying this to the example of two random
variables X and Y with standard alphabets described in Section 1.2 we have
from (1.24) that for integrable f : AX × AY → R
E (f ) = f (x, y )dPXY (x, y ) = ( f (x, y )dPX Y (xy ))dPY (y ). (1.25) In particular, for ﬁxed y , f (x, y ) is an integrable (and measurable) function of
x.
Equation (1.25) provides a generalization of (1.13) from rectangles to arbitrary events. For an arbitrary F ∈ BAX ×AY we have that
PXY (F ) = (1F (x, y )dPX Y (xy ))dPY (y ) = PX Y (Fy y )dPY (y ), (1.26) 14 CHAPTER 1. INFORMATION SOURCES where Fy = {x : (x, y ) ∈ F } is called the section of F at y . If F is measurable,
then so is Fy for all y . Alternatively, since 1F (x, y ) is measurable with respect
to x for each ﬁxed y , Fy ∈ BAX . The inner integral is just
dPX Y (xy ) = PX Y (Fy y ),
x:(x,y )∈F 1.7 Asymptotic Mean Stationarity A dynamical system (or the associated source) (Ω, B , P, T ) is said to be stationary if
P (T −1 G) = P (G)
for all G ∈ B . It is said to be asymptotically mean stationary or, simply, AMS
if the limit
n−1
1
¯
P (T −k G)
(1.27)
P (G) = lim
n→∞ n
k=0 exists for all G ∈ B . The following theorems summarize several important
properties of AMS sources. Details may be found in Chapter 6 of [50].
¯
Theorem 1.7.1 If a dynamical system (Ω, B , P, T ) is AMS, then P deﬁned in
¯ , T ) is stationary. (P is called the
¯
(1.27) is a probability measure and (Ω, B , P
stationary mean of P .) If an event G is invariant in the sense that T −1 G = G,
then
¯
P (G) = P (G).
If a random variable g is invariant in the sense that g (T x) = g (x) with P
probability 1, then
EP g = EP g.
¯
¯ asymptotically dominates P in the sense that if P (G)
¯
The stationary mean P
= 0, then
lim sup P (T −n G) = 0.
n→∞ Theorem 1.7.2 Given an AMS source {Xn } let σ (Xn , Xn+1 , · · · ) denote the
σ ﬁeld generated by the random variables Xn , · · · , that is, the smallest σ ﬁeld
with respect to which all these random variables are measurable. Deﬁne the tail
σ ﬁeld F∞ by
∞ F∞ = σ (Xn , · · · ).
n=0 ¯
If G ∈ F∞ and P (G) = 0, then also P (G) = 0.
The tail σ ﬁeld can be thought of as events that are determinable by looking
only at samples of the sequence in the arbitrarily distant future. The theorem
states that the stationary mean dominates the original measure on such tail
events in the sense that zero probability under the stationary mean implies zero
probability under the original source. 1.8. ERGODIC PROPERTIES 1.8 15 Ergodic Properties Two of the basic results of ergodic theory that will be called upon extensively
are the pointwise or almosteverywhere ergodic theorem and the ergodic decomposition theorem We quote these results along with some relevant notation for
reference. Detailed developments may be found in Chapters 68 of [50]. The
ergodic theorem states that AMS dynamical systems (and hence also sources)
have convergent sample averages, and it characterizes the limits.
Theorem 1.8.1 If a dynamical system (Ω, B , m, T ) is AMS with stationary
mean m and if f ∈ L1 (m), then with probability one under m and m
¯
¯
¯
1
n→∞ n n−1 f T i = Em (f I ),
¯ lim i=0 where I is the subσ ﬁeld of invariant events, that is, events G for which T −1 G =
G.
The basic idea of the ergodic decomposition is that any stationary source
which is not ergodic can be represented as a mixture of stationary ergodic components or subsources.
Theorem 1.8.2 Given the standard sequence space (Ω, B ) with shift T as previously, there exists a family of stationary ergodic measures {px ; x ∈ Ω}, called
the ergodic decomposition, with the following properties:
(a) pT x = px .
(b) For any stationary measure m,
m(G) = px (G)dm(x); all G ∈ B . (c) For any g ∈ L1 (m)
g dm = ( g dpx )dm(x). It is important to note that the same collection of stationary ergodic components
works for any stationary measure m. This is the strong form of the ergodic
decomposition.
The ﬁnal result of this section is a variation on the ergodic decomposition
that will be useful. To describe the result, we need to digress brieﬂy to introduce
a metric on spaces of probability measures. A thorough development can be
found in Chapter 8 of [50]. We have a standard sequence measurable space
(Ω, B ) and hence we can generate the σ ﬁeld B by a countable ﬁeld F = {Fn ; 16 CHAPTER 1. INFORMATION SOURCES n = 1, 2, · · · }. Given such a countable generating ﬁeld, a distributional distance
between two probability measures p and m on (Ω, B ) is deﬁned by
∞ 2−n p(Fn ) − m(Fn ). d(p, m) =
n=1 Any choice of a countable generating ﬁeld yields a distributional distance. Such
a distance or metric yields a measurable space of probability measures as follows:
Let Λ denote the space of all probability measures on the original measurable
space (Ω, B ). Let B (Λ) denote the σ ﬁeld of subsets of Λ generated by all
open spheres using the distributional distance, that is, all sets of the form {p :
d(p, m) ≤ } for some m ∈ Λ and some > 0. We can now consider properties of
functions that carry sequences in our original space into probability measures.
The following is Theorem 8.5.1 of [50].
Theorem 1.8.3 Fix a standard measurable space (Ω, B ) and a transformation
T : Ω → Ω. Then there are a standard measurable space (Λ, L), a family of
stationary ergodic measures {mλ ; λ ∈ Λ} on (Ω, B ), and a measurable mapping
ψ : Ω → Λ such that
(a) ψ is invariant (ψ (T x) = ψ (x) all x);
(b) if m is a stationary measure on (Ω, B ) and Pψ is the induced distribution;
that is, Pψ (G) = m(ψ −1 (G)) for G ∈ Λ (which is well deﬁned from (a)),
then
m(F ) = dm(x)mψ(x) (F ) = and if f ∈ L1 (m), then so is
Em f = dPψ (λ)mλ (F ), all F ∈ B , f dmλ Pψ a.e. and dm(x)Emψ(x) f = dPψ (λ)Emλ f. Finally, for any event F , mψ (F ) = m(F ψ ), that is, given the ergodic
decomposition and a stationary measure m , the ergodic component λ is a
version of the conditional probability under m given ψ = λ.
The following corollary to the ergodic decomposition is Lemma 8.6.2 of [50].
It states that the conditional probability of a future event given the entire past
is unchanged by knowing the ergodic component in eﬀect. This is because the
inﬁnite past determines the ergodic component in eﬀect.
Corollary 1.8.1 Suppose that {Xn } is a twosided stationary process with distribution m and that {mλ ; λ ∈ Λ} is the ergodic decomposition and ψ the ergodic component function. Then the mapping ψ is measurable with respect to
σ (X−1 , X−2 , · · · ) and
m((X0 , X1 , · · · ) ∈ F X−1 , X−2 , · · · ) =
mψ ((X0 , X1 , · · · ) ∈ F X−1 , X−2 , · · · ); m − a.e. Chapter 2 Entropy and Information
2.1 Introduction The development of the idea of entropy of random variables and processes by
Claude Shannon provided the beginnings of information theory and of the modern age of ergodic theory. We shall see that entropy and related information
measures provide useful descriptions of the long term behavior of random processes and that this behavior is a key factor in developing the coding theorems
of information theory. We now introduce the various notions of entropy for random variables, vectors, processes, and dynamical systems and we develop many
of the fundamental properties of entropy.
In this chapter we emphasize the case of ﬁnite alphabet random processes
for simplicity, reﬂecting the historical development of the subject. Occasionally
we consider more general cases when it will ease later developments. 2.2 Entropy and Entropy Rate There are several ways to introduce the notion of entropy and entropy rate. We
take some care at the beginning in order to avoid redeﬁning things later. We also
try to use deﬁnitions resembling the usual deﬁnitions of elementary information
theory where possible. Let (Ω, B , P, T ) be a dynamical system. Let f be a ﬁnite
alphabet measurement (a simple function) deﬁned on Ω and deﬁne the onesided random process fn = f T n ; n = 0, 1, 2, . . .. This process can be viewed
as a coding of the original space, that is, one produces successive coded values
by transforming (e.g., shifting) the points of the space, each time producing
an output symbol using the same rule or mapping. In the usual way we can
construct an equivalent directly given or Kolmogorov model of this process. Let
Z
A = {a1 , a2 , . . . , a A } denote the ﬁnite alphabet of f and let (AZ+ , BA+ ) be the
resulting onesided sequence space, where BA is the power set. We abbreviate
∞
the notation for this sequence space to (A∞ , BA ). Let TA denote the shift
on this space and let X denote the time zero sampling or coordinate function
17 18 CHAPTER 2. ENTROPY AND INFORMATION n
and deﬁne Xn (x) = X (TA x) = xn . Let m denote the process distribution
¯
induced by the original space and the f T n , i.e., m = Pf = P f −1 where f (ω ) =
¯
2
(f (ω ), f (T ω ), f (T ω ), . . .).
Observe that by construction, shifting the input point yields an output sequence that is also shifted, that is, ¯
¯
f (T ω ) = TA f (ω ).
Sequencevalued measurements of this form are called stationary or invariant
codings (or time invariant or shift invariant codings in the case of the shift)
since the coding commutes with the transformations.
The entropy and entropy rates of a ﬁnite alphabet measurement depend
only on the process distributions and hence are usually more easily stated in
terms of the induced directly given model and the process distribution. For the
moment, however, we point out that the deﬁnition can be stated in terms of
either system. Later we will see that the entropy of the underlying system is
deﬁned as a supremum of the entropy rates of all ﬁnite alphabet codings of the
system.
The entropy of a discrete alphabet random variable f deﬁned on the probability space (Ω, B , P ) is deﬁned by
HP (f ) = − P (f = a) ln P (f = a). (2.1) a∈A We deﬁne 0ln0 to be 0 in the above formula. We shall often use logarithms
to the base 2 instead of natural logarithms. The units for entropy are “nats”
when the natural logarithm is used and “bits” for base 2 logarithms. The
natural logarithms are usually more convenient for mathematics while the base 2
logarithms provide more intuitive descriptions. The subscript P can be omitted
if the measure is clear from context. Be forewarned that the measure will
often not be clear from context since more than one measure may be under
consideration and hence the subscripts will be required. A discrete alphabet
random variable f has a probability mass function (pmf), say pf , deﬁned by
pf (a) = P (f = a) = P ({ω : f (ω ) = a}) and hence we can also write
H (f ) = − pf (a) ln pf (a).
a∈ A It is often convenient to consider the entropy not as a function of the particular outputs of f but as a function of the partition that f induces on Ω. In
particular, suppose that the alphabet of f is A = {a1 , a2 , . . . , a A } and deﬁne
the partition Q = {Qi ; i = 1, 2, . . . , A } by Qi = {ω : f (ω ) = ai } = f −1 ({ai }).
In other words, Q consists of disjoint sets which group the points in Ω together
according to what output the measurement f produces. We can consider the
entropy as a function of the partition and write
A HP (Q) = − P (Qi ) ln P (Qi ).
i=1 (2.2) 2.2. ENTROPY AND ENTROPY RATE 19 Clearly diﬀerent mappings with diﬀerent alphabets can have the same entropy
if they induce the same partition. Both notations will be used according to the
desired emphasis. We have not yet deﬁned entropy for random variables that
do not have discrete alphabets; we shall do that later.
Return to the notation emphasizing the mapping f rather than the partition.
Deﬁning the random variable P (f ) by P (f )(ω ) = P (λ : f (λ) = f (ω )) we can
also write the entropy as
HP (f ) = EP (− ln P (f )).
Using the equivalent directly given model we have immediately that
HP (f ) = HP (Q) = Hm (X0 ) = Em (− ln m(X0 )). (2.3) At this point one might ask why we are carrying the baggage of notations
for entropy in both the original space and in the sequence space. If we were
dealing with only one measurement f (or Xn ), we could conﬁne interest to the
simpler directlygiven form. More generally, however, we will be interested in
diﬀerent measurements or codings on a common system. In this case we will
require the notation using the original system. Hence for the moment we keep
both forms, but we shall often focus on the second where possible and the ﬁrst
only when necessary.
The nth order entropy of a discrete alphabet measurement f with respect to
T is deﬁned as
(n)
HP (f ) = n−1 HP (f n )
where f n = (f, f T, f T 2 , . . . , f T n−1 ) or, equivalently, we deﬁne the discrete
alphabet random process Xn (ω ) = f (T n ω ), then
f n = X n = X0 , X1 , . . . , Xn−1 .
As previously, this is given by
(n
Hm ) (X ) = n−1 Hm (X n ) = n−1 Em (− ln m(X n )). This is also called the entropy (percoordinate or persample) of the random
vector f n or X n . We can also use the partition notation here. The partition
corresponding to f n has a particular form: Suppose that we have two partitions,
Q = {Qi } and P = {Pi }. Deﬁne their join Q P as the partition containing
all nonempty intersection sets of the form Qi Pj . Deﬁne also T −1 Q as the
partition containing the atoms T −1 Qi . Then f n induces the partition
n−1 T −i Q
i=0 and we can write
n−1
( n)
HP (f ) = (n)
HP (Q) =n −1 T −i Q). HP (
i=0 20 CHAPTER 2. ENTROPY AND INFORMATION As before, which notation is preferable depends on whether we wish to emphasize
the mapping f or the partition Q.
The entropy rate or mean entropy of a discrete alphabet measurement f with
respect to the transformation T is deﬁned by
¯
HP (f ) = (n) lim sup HP (f )
n→∞ = ( n)
¯
HP (Q) = lim sup HP (Q) = (n
¯
Hm (X ) = lim sup Hm ) (X ). n→∞ n→∞ Given a dynamical system (Ω, B , P, T ), the entropy H (P, T ) of the system
(or of the measure with respect to the transformation) is deﬁned by
¯
¯
H (P, T ) = sup HP (f ) = sup HP (Q),
f Q where the supremum is over all ﬁnite alphabet measurements (or codings) or,
equivalently, over all ﬁnite measurable partitions of Ω. (We emphasize that
this means alphabets of size M for all ﬁnite values of M .) The entropy of a
system is also called the KolmogorovSinai invariant of the system because of
the generalization by Kolmogorov [89] and Sinai [136] of Shannon’s entropy rate
concept to dynamical systems and the demonstration that equal entropy was a
necessary condition for two dynamical systems to be isomorphic.
Suppose that we have a dynamical system corresponding to a ﬁnite alphabet
random process {Xn }, then one possible ﬁnite alphabet measurement on the
¯
process is f (x) = x0 , that is, the time 0 output. In this case clearly HP (f ) =
¯ P (X ) and hence, since the system entropy is deﬁned as the supremum over
H
all simple measurements,
¯
H (P, T ) ≥ HP (X ).
(2.4)
We shall later see that (2.4) holds with equality for ﬁnite alphabet random
processes and provides a generalization of entropy rate for processes that do not
have ﬁnite alphabets. 2.3 Basic Properties of Entropy For simplicity we focus on the entropy rate of a directly given ﬁnite alphabet
random process {Xn }. We also will emphasize stationary measures, but we will
try to clarify those results that require stationarity and those that are more
general.
Let A be a ﬁnite set. Let Ω = AZ+ and let B be the sigmaﬁeld of subsets of
Ω generated by the rectangles. Since A is ﬁnite, (A, BA ) is standard, where BA is
the power set of A. Thus (Ω, B ) is also standard by Lemma 2.4.1 of [50]. In fact,
from the proof that cartesian products of standard spaces are standard, we can
take as a basis for B the ﬁelds Fn generated by the ﬁnite dimensional rectangles
having the form {x : X n (x) = xn = an } for all an ∈ An and all positive integers 2.3. BASIC PROPERTIES OF ENTROPY 21 n. (Members of this class of rectangles are called thin cylinders.) The union of
all such ﬁelds, say F , is then a generating ﬁeld.
Many of the basic properties of entropy follow from the following simple
inequality.
Lemma 2.3.1 Given two probability mass functions {pi } and {qi }, that is, two
countable or ﬁnite sequences of nonnegative numbers that sum to one, then
pi ln
i pi
≥0
qi with equality if and only if qi = pi , all i.
Proof: The lemma follows easily from the elementary inequality for real numbers
ln x ≤ x − 1 (2.5) (with equality if and only if x = 1) since
pi ln
i qi
≤
pi pi (
i qi
− 1) =
pi qi −
i pi = 0
i with equality if and only if qi /pi = 1 all i. Alternatively, the inequality follows
from Jensen’s inequality [63] since ln is a convex
function:
pi ln
i qi
≤ ln
pi pi
i qi
pi =0 with equality if and only if qi /pi = 1, all i.
2
The quantity used in the lemma is of such fundamental importance that we
pause to introduce another notion of information and to recast the inequality
in terms of it. As with entropy, the deﬁnition for the moment is only for ﬁnite
alphabet random variables. Also as with entropy, there are a variety of ways
to deﬁne it. Suppose that we have an underlying measurable space (Ω, B ) and
two measures on this space, say P and M , and we have a random variable f
with ﬁnite alphabet A deﬁned on the space and that Q is the induced partition
{f −1 (a); a ∈ A}. Let Pf and Mf be the induced distributions and let p and m be
the corresponding probability mass functions, e.g., p(a) = Pf ({a}) = P (f = a).
Deﬁne the relative entropy of a measurement f with measure P with respect to
the measure M by
HP M (f ) = HP M (Q) =
a∈A p(a)
p(a) ln
=
m(a) A P (Qi ) ln
i=1 P (Qi )
.
M (Qi ) Observe that this only makes sense if p(a) is 0 whenever m(a) is, that is, if Pf is
absolutely continuous with respect to Mf or Mf
Pf . Deﬁne HP M (f ) = ∞
if Pf is not absolutely continuous with respect to Mf . The measure M is referred to as the reference measure. Relative entropies will play an increasingly 22 CHAPTER 2. ENTROPY AND INFORMATION important role as general alphabets are considered. In the early chapters the
emphasis will be on ordinary entropy with similar properties for relative entropies following almost as an afterthought. When considering more abstract
(nonﬁnite) alphabets later on, relative entropies will prove indispensible.
Analogous to entropy, given a random process {Xn } described by two process
distributions p and m, if it is true that
mX n pX n ; n = 1, 2, . . . , then we can deﬁne for each n the nth order relative entropy n−1 Hp
the relative entropy rate
¯
Hp m (X ) ≡ lim sup
n→∞ 1
Hp
n m (X n m (X n ) and ). When dealing with relative entropies it is often the measures that are important and not the random variable or partition. We introduce a special notation
which emphasizes this fact. Given a probability space (Ω, B , P ), with Ω a ﬁnite
space, and another measure M on the same space, we deﬁne the divergence of
P with respect to M as the relative entropy of the identity mapping with respect
to the two measures:
P (ω )
.
D (P M ) =
P (ω ) ln
M (ω )
ω ∈Ω Thus, for example, given a ﬁnite alphabet measurement f on an arbitrary probability space (Ω, B , P ), if M is another measure on (Ω, B ) then
HP M (f ) = D(Pf Mf ). Similarly,
Hp m (X n ) = D(PX n MX n ), where PX n and MX n are the distributions for X n induced by process measures p
and m, respectively. The theory and properties of relative entropy are therefore
determined by those for divergence.
There are many names and notations for relative entropy and divergence
throughout the literature. The idea was introduced by Kullback for applications
of information theory to statistics (see, e.g., Kullback [93] and the references
therein) and was used to develop information theoretic results by Perez [121]
[123] [122], Dobrushin [32], and Pinsker [126]. Various names in common use for
this quantity are discrimination, discrimination information, KullbackLeibler
number, directed divergence, and cross entropy.
The lemma can be summarized simply in terms of divergence as in the
following theorem, which is commonly referred to as the divergence inequality.
Theorem 2.3.1 Given any two probability measures P and M on a common
ﬁnite alphabet probability space, then
D(P M ) ≥ 0
with equality if and only if P = M . (2.6) 2.3. BASIC PROPERTIES OF ENTROPY 23 In this form the result is known as the divergence inequality. The fact that
the divergence of one probability measure with respect to another is nonnegative
and zero only when the two measures are the same suggest the interpretation
of divergence as a “distance” between the two probability measures, that is, a
measure of how diﬀerent the two measures are. It is not a true distance or metric
in the usual sense since it is not a symmetric function of the two measures and
it does not satisfy the triangle inequality. The interpretation is, however, quite
useful for adding insight into results characterizing the behavior of divergence
and it will later be seen to have implications for ordinary distance measures
between probability measures.
The divergence plays a basic role in the family of information measures all
of the information measures that we will encounter–entropy, relative entropy,
mutual information, and the conditional forms of these information measures–
can be expressed as a divergence.
There are three ways to view entropy as a special case of divergence. The
ﬁrst is to permit M to be a general measure instead of requiring it to be a
probability measure and have total mass 1. In this case entropy is minus the
divergence if M is the counting measure, i.e., assigns measure 1 to every point
in the discrete alphabet. If M is not a probability measure, then the divergence
inequality (2.6) need not hold. Second, if the alphabet of f is Af and has Af
elements, then letting M be a uniform pmf assigning probability 1/ A to all
symbols in A yields
D(P M ) = ln Af − HP (f ) ≥ 0
and hence the entropy is the log of the alphabet size minus the divergence with
respect to the uniform distribution. Third, we can also consider entropy a special
case of divergence while still requiring that M be a probability measure by using
product measures and a bit of a trick. Say we have two measures P and Q on
a common probability space (Ω, B ). Deﬁne two measures on the product space
(Ω×Ω, B (Ω×Ω)) as follows: Let P ×Q denote the usual product measure, that is,
the measure speciﬁed by its values on rectangles as P × Q(F × G) = P (F )Q(G).
Thus, for example, if P and Q are discrete distributions with pmf’s p and q ,
then the pmf for P × Q is just p(a)q (b). Let P denote the “diagonal” measure
deﬁned by its values on rectangles as P (F × G) = P (F G). In the discrete
case P has pmf p (a, b) = p(a) if a = b and 0 otherwise. Then
HP (f ) = D(P P × P ).
Note that if we let X and Y be the coordinate random variables on our product
space, then both P and P × P give the same marginal probabilities to X and
Y , that is, PX = PY = P . P is an extreme distribution on (X, Y ) in the sense
that with probability one X = Y ; the two coordinates are deterministically
dependent on one another. P × P , however, is the opposite extreme in that it
makes the two random variables X and Y independent of one another. Thus 24 CHAPTER 2. ENTROPY AND INFORMATION the entropy of a distribution P can be viewed as the relative entropy between
these two extreme joint distributions having marginals P .
We now return to the general development for entropy. For the moment ﬁx
a probability measure m on a measurable space (Ω, B ) and let X and Y be two
ﬁnite alphabet random variables deﬁned on that space. Let AX and AY denote
the corresponding alphabets. Let PXY , PX , and PY denote the distributions of
(X, Y ), X , and Y , respectively.
First observe that since PX (a) ≤ 1, all a, − ln PX (a) is positive and hence
H (X ) = − PX (a) ln PX (a) ≥ 0. (2.7) a∈A From (2.6) with M uniform as in the second interpretation of entropy above,
if X is a random variable with alphabet AX , then
H (X ) ≤ ln AX .
Since for any a ∈ AX and b ∈ AY we have that PX (a) ≥ PXY (a, b), it follows
that
H (X, Y ) = − PXY (a, b) ln PXY (a, b)
a,b ≥ − PXY (a, b) ln PX (a) = H (X ).
a,b Using Lemma 2.3.1 we have that since PXY and PX PY are probability mass
functions,
H (X, Y ) − (H (X ) + H (Y )) = PXY (a, b) ln
a,b PX (a)PY (b)
≤ 0.
PXY (a, b) This proves the following result.
Lemma 2.3.2 Given two discrete alphabet random variables X and Y deﬁned
on a common probability space, we have
0 ≤ H (X ) (2.8) max(H (X ), H (Y )) ≤ H (X, Y ) ≤ H (X ) + H (Y ) (2.9) and
where the right hand inequality holds with equality if and only if X and Y are
independent. If the alphabet of X has AX symbols, then
HX (X ) ≤ ln AX . (2.10) There is another proof of the left hand inequality in (2.9) that uses an
inequality for relative entropy that will be useful later when considering codes.
The following lemma gives the inequality. First we introduce a deﬁnition. A
partition R is said to reﬁne a partion Q if every atom in Q is a union of atoms
of R, in which case we write Q < R. 2.3. BASIC PROPERTIES OF ENTROPY 25 Lemma 2.3.3 Suppose that P and M are two measures deﬁned on a common
measurable space (Ω, B ) and that we are given a ﬁnite partitions Q < R. Then
M (Q) HP ≤ HP M (R) and
HP (Q) ≤ HP (R)
Comments: The lemma can also be stated in terms of random variables and
mappings in an intuitive way: Suppose that U is a random variable with ﬁnite
alphabet A and f : A → B is a mapping from A into another ﬁnite alphabet
B . Then the composite random variable f (U ) deﬁned by f (U )(ω ) = f (U (ω )) is
also a ﬁnite random variable. If U induces a partition R and f (U ) a partition
Q, then Q < R (since knowing the value of U implies the value of f (U )). Thus
the lemma immediately gives the following corollary.
Corollary 2.3.1 If M
P are two measures describing a random variable U
with alphabet A and if f : A → B , then
HP M (f (U )) ≤ HP M (U ) and
HP (f (U )) ≤ HP (U ).
Since D(Pf Mf ) = HP M (f ), we have also the following corollary which we
state for future reference.
Corollary 2.3.2 Suppose that P and M are two probability measures on a discrete space and that f is a random variable deﬁned on that space, then
D(Pf Mf ) ≤ D(P M ).
The lemma, discussion, and corollaries can all be interpreted as saying that
taking a measurement on a ﬁnite alphabet random variable lowers the entropy
and the relative entropy of that random variable. By choosing U as (X, Y ) and
f (X, Y ) = X or Y , the lemma yields the promised inequality of the previous
lemma.
Proof of Lemma: If HP M (R) = +∞, the result is immediate. If HP M (Q) =
+∞, that is, if there exists at least one Qj such that M (Qj ) = 0 but P (Qj ) = 0,
then there exists an Ri ⊂ Qj such that M (Ri ) = 0 and P (Ri ) > 0 and hence
HP M (R) = +∞. Lastly assume that both HP M (R) and HP M (Q) are ﬁnite
and consider the diﬀerence
HP M (R) − HP M (Q) = P (Ri ) ln
i = [
j i:Ri ⊂Qj P (Ri )
−
M (Ri ) P (Qj ) ln
j P (Qj )
M (Qj ) P (Ri )
P (Qj )
P (Ri ) ln
− P (Qj ) ln.
M (Ri )
M (Qj ) 26 CHAPTER 2. ENTROPY AND INFORMATION We shall show that each of the bracketed terms is nonnegative, which will prove
the ﬁrst inequality. Fix j . If P (Qj ) is 0 we are done since then also P (Ri ) is 0
for all i in the inner sum since these Ri all belong to Qj . If P (Qj ) is not 0, we
can divide by it to rewrite the bracketed term as P (Ri )
P (Ri )/P (Qj ) ,
P (Qj ) ln
P (Qj ) M (Ri )/M (Qj )
i:Ri ⊂Qj where we also used the fact that M (Qj ) cannot be 0 since then P (Qj ) would
also have to be zero. Since Ri ⊂ Qj , P (Ri )/P (Qj ) = P (Ri Qj )/P (Qj ) =
P (Ri Qj ) is an elementary conditional probability. Applying a similar argument
to M and dividing by P (Qj ), the above expression becomes
P (Ri Qj ) ln
i:Ri ⊂Qj P (Ri Qj )
M (Ri Qj ) which is nonnegative from Lemma 2.3.1, which proves the ﬁrst inequality. The
second inequality follows similarly. Consider the diﬀerence
HP (R) − HP (Q) = [
j P (Ri ) ln i:Ri ⊂Qj P (Qj )[− = P (Qj )
P (Ri )
P (Ri Qj ) ln P (Ri Qj )] i:Ri ⊂Qj j and the result follows since the bracketed term is nonnegative since it is an
entropy for each value of j (Lemma 2.3.2).
2
The next result provides useful inequalities for entropy considered as a function of the underlying distribution. In particular, it shows that entropy is a
concave (or convex ) function of the underlying distribution. Deﬁne the binary entropy function (the entropy of a binary random variable with probability
mass function (λ, 1 − λ)) by
h2 (λ) = −λ ln λ − (1 − λ) ln(1 − λ). Lemma 2.3.4 Let m and p denote two distributions for a discrete alphabet
random variable X and let λ ∈ (0, 1). Then for any λ ∈ (0, 1)
λHm (X ) + (1 − λ)Hp (X ) ≤
≤ Hλm+(1−λ)p (X )
λHm (X ) + (1 − λ)Hp (X ) + h2 (λ). (2.11) Proof: We do a little extra here to save work in a later result. Deﬁne the
quantities
I=−
m(x) ln(λm(x) + (1 − λ)p(x))
x 2.3. BASIC PROPERTIES OF ENTROPY 27 and
J = Hλm+(1−λ)p (X ) =
−λ m(x) ln(λm(x)+(1 − λ)p(x)) − (1 − λ)
x p(x) ln(λm(x)+(1 − λ)p(x)).
x First observe that
λm(x) + (1 − λ)p(x) ≥ λm(x)
and therefore applying this bound to both m and p
I ≤ −ln λ − m(x) ln m(x) = − ln λ + Hm (X )
x J ≤ −λ m(x) ln m(x) − (1 − λ)
x = p(x) ln p(x) + h2 (λ)
x λHm (X ) + (1 − λ)Hp (X ) + h2 (λ). (2.12) To obtain the lower bounds of the lemma observe that
I =− m(x) ln m(x)(λ + (1 − λ)
x =− m(x) ln m(x) −
x p(x)
)
m(x) m(x) ln(λ + (1 − λ)
x p(x)
).
m(x) Using (2.5) the rightmost term is bound below by
− m(x)((λ + (1 − λ)
x p(x)
− 1) = −λ − 1 + λ
m(x) p(X = a) + 1 = 0.
a∈ A Thus for all n
I≥− m(x) ln m(x) = Hm (X ). (2.13) x and hence also
J ≥ −λ m(x) ln m(x) − (1 − λ)
x p(x) ln p(x)
x = λHm (X ) + (1 − λ)Hp (X ).
2
The next result presents an interesting connection between combinatorics
and binomial sums with a particular entropy. We require the familiar deﬁnition
of the binomial coeﬃcient:
n
k = n!
.
k !(n − k )! 28 CHAPTER 2. ENTROPY AND INFORMATION Lemma 2.3.5 Given δ ∈ (0, 1 ] and a positive integer M , we have
2 i≤δM M
i ≤ eM h 2 ( δ ) . (2.14) If 0 < δ ≤ p ≤ 1, then i≤δM Mi
p (1 − p)M −i ≤ e−M h2 (δ
i where
h2 (δ p) = δ ln p) , (2.15) δ
1−δ
+ (1 − δ ) ln
.
p
1−p Proof: We have after some simple algebra that
e−h2 (δ)M = δ δM (1 − δ )(1−δ)M .
If δ < 1/2, then δ k (1 − δ )M −k increases as k decreases (since we are having more
large terms and fewer small terms in the product) and hence if i ≤ M δ ,
δ δM (1 − δ )(1−δ)M ≤ δ i (1 − δ )M −i .
Thus we have the inequalities
M 1 =
i=0 Mi
δ (1 − δ )M −i ≥
i ≥ e−h2 (δ)M
i≤δM i≤δM Mi
δ (1 − δ )M −i
i M
i which completes the proof of (2.14). In a similar fashion we have that
eM h2 (δ p) 1 − δ (1−δ)M
δ
= ( )δM (
)
.
p
1−p Since δ ≤ p, we have as in the ﬁrst argument that for i ≤ M δ
δ
1 − δ (1−δ)M
δ 1 − δ M −i
( )δM (
)
≤ ( )i (
)
p
1−p
p 1−p
and therefore after some algebra we have that if i ≤ M δ then
pi (1 − p)M −i ≤ δ i (1 − δ )M −i e−M h2 (δ p) and hence i≤δM Mi
p (1 − p)M −i ≤ e−M h2 (δ
i p)
i≤δM Mi
δ (1 − δ )M −i
i 2.3. BASIC PROPERTIES OF ENTROPY
M ≤ e−nh2 (δ 29 Mi
δ (1 − δ )M −i = e−M h2 (δ
i p)
i=0 p) , which proves (2.15).
2
The following is a technical but useful property of sample entropies. The
proof follows Billingsley [15].
Lemma 2.3.6 Given a ﬁnite alphabet process {Xn } (not necessarily stationn
ary) with distribution m, let Xk = (Xk , Xk+1 , . . . , Xk+n−1 ) denote the random
vectors giving a block of samples of dimension n starting at time k . Then the
n
random variables n−1 ln m(Xk ) are muniformly integrable (uniform in k and
n).
Proof: For each nonnegative integer r deﬁne the sets
Er (k, n) = {x : − 1
ln m(xn ) ∈ [r, r + 1)}
k
n and hence if x ∈ Er (k, n) then
r≤− 1
ln m(xn ) < r + 1
k
n or
e−nr ≥ m(xn ) > e−n(r+1) .
k
Thus for any r
− 1
n
ln m(Xk )
n < dm (r + 1)m(Er (k, n)) = Er (k,n) (r + 1) xn
k xn ∈Er (k,n)
k = e−nr m(xn ) ≤ (r + 1)
k (r + 1)e −nr A n ≤ (r + 1)e −nr , where the ﬁnal step follows since there are at most A n possible ntuples corresponding to thin cylinders in Er (k, n) and by construction each has probability
less than e−nr .
To prove uniform integrability we must show uniform convergence to 0 as
r → ∞ of the integral
γr (k, n) (− =
1
x:− n ln m(xn )≥r
k
∞ (− =
i=0 Er+i (k,n) 1
n
ln m(Xk )) dm
n 1
n
ln m(Xk )) dm
n ∞ (r + i + 1)e−n(r+i) A ≤ n i=0
∞ (r + i + 1)e−n(r+i−ln ≤
i=0 A) . 30 CHAPTER 2. ENTROPY AND INFORMATION Taking r large enough so that r > ln A , then the exponential term is bound
above by the special case n = 1 and we have the bound
∞ (r + i + 1)e−(r+i−ln γr (k, n) ≤ A) i=0 a bound which is ﬁnite and independent of k and n. The sum can easily be
shown to go to zero as r → ∞ using standard summation formulas. (The
exponential terms shrink faster than the linear terms grow.)
2 Variational Description of Divergence
Divergence has a variational characterization that is a fundamental property
for its applications to large deviations theory [145] [31]. Although this theory
will not be treated here, the basic result of this section provides an alternative
description of divergence and hence of relative entropy that has intrinsic interest.
The basic result is originally due to Donsker and Varadhan [34].
Suppose now that P and M are two probability measures on a common
discrete probability space, say (Ω, B ). Given any realvalued random variable Φ
deﬁned on the probability space, we will be interested in the quantity
EM eΦ . (2.16) which is called the cumulant generating function of Φ with respect to M and
is related to the characteristic function of the random variable Φ as well as to
the moment generating function and the operational transform of the random
variable. The following theorem provides a variational description of divergence
in terms of the cumulant generating function.
Theorem 2.3.2
D(P M ) = sup EP Φ − ln(EM (eΦ )) .
Φ (2.17) Proof: First consider the random variable Φ deﬁned by
Φ(ω ) = ln(P (ω )/M (ω ))
and observe that
EP Φ − ln(EM (eΦ )) = P (ω ) ln
ω P (ω )
− ln(
M (ω ) M (ω )
ω P (ω )
)
M (ω ) = D(P M ) − ln 1 = D(P M ).
This proves that the supremum over all Φ is no smaller than the divergence.
To prove the other half observe that for any bounded random variable Φ,
EP Φ − ln EM (eΦ ) = EP ln eΦ
EM (eΦ ) = P (ω ) ln
ω M Φ (ω )
M (ω ) , 2.4. ENTROPY RATE 31 where the probability measure M Φ is deﬁned by
M Φ (ω ) = M (ω )eΦ(ω)
.
Φ(x)
x M (x)e We now have for any Φ that
D(P Q) − EP Φ − ln(EM (eΦ )) =
P (ω ) ln
ω P (ω )
M (ω ) − P (ω ) ln
ω M Φ (ω )
M (ω ) = P (ω ) ln
ω P (ω )
M Φ (ω ) ≥0 using the divergence inequality. Since this is true for any Φ, it is also true for
the supremum over Φ and the theorem is proved.
2 2.4 Entropy Rate Again let {Xn ; n = 0, 1, . . .} denote a ﬁnite alphabet random process and apply
Lemma 2.3.2 to vectors and obtain
H (X0 , X1 , . . . , Xn−1 ) ≤
H (X0 , X1 , . . . , Xm−1 ) + H (Xm , Xm+1 , . . . , Xn−1 ); 0 < m < n. (2.18) n
Deﬁne as usual the random vectors Xk = (Xk , Xk+1 , . . . , Xk+n−1 ), that
n
is, Xk is a vector of dimension n consisting of the samples of X from k to
k + n − 1. If the underlying measure is stationary, then the distributions of
n
the random vectors Xk do not depend on k . Hence if we deﬁne the sequence
n
h(n) = H (X ) = H (X0 , . . . , Xn−1 ), then the above equation becomes h(k + n) ≤ h(k ) + h(n); all k, n > 0.
Thus h(n) is a subadditive sequence as treated in Section 7.5 of [50]. A basic
property of subadditive sequences is that the limit h(n)/n as n → ∞ exists and
equals the inﬁmum of h(n)/n over n. (See, e.g., Lemma 7.5.1 of [50].) This
immediately yields the following result.
Lemma 2.4.1 If the distribution m of a ﬁnite alphabet random process {Xn }
is stationary, then
1
1
¯
Hm (X ) ≡ lim Hm (X n ) = inf Hm (X n ).
n→∞ n
n≥1 n
Thus the limit exists and equals the inﬁmum.
The next two properties of entropy rate are primarily of interest because
they imply a third property, the ergodic decomposition of entropy rate, which
will be described in Theorem 2.4.1. They are also of some independent interest. 32 CHAPTER 2. ENTROPY AND INFORMATION The ﬁrst result is a continuity result for entropy rate when considered as a function or functional on the underlying process distribution. The second property
demonstrates that entropy rate is actually an aﬃne functional (both convex
and convex ) of the underlying distribution, even though ﬁnite order entropy
was only convex
and not aﬃne.
We apply the distributional distance described in Section 1.8 to the standard
Z
sequence measurable space (Ω, B ) = (AZ+ , BA+ ) with a σ ﬁeld generated by the
countable ﬁeld F = {Fn ; n = 1, 2, . . .} generated by all thin rectangles.
¯
Corollary 2.4.1 The entropy rate Hm (X ) of a discrete alphabet random process considered as a functional of stationary measures is upper semicontinuous;
that is, if probability measures m and mn , n = 1, 2, . . . have the property that
d(m, mn ) → 0 as n → ∞, then
¯
¯
Hm (X ) ≥ lim sup Hmn (X ).
n→∞ Proof: For each ﬁxed n
Hm (X n ) = − m(X n = an ) ln m(X n = an )
an ∈An is a continuous function of m since for the distance to go to zero, the probabilities
of all thin rectangles must go to zero and the entropy is the sum of continuous
realvalued functions of the probabilities of thin rectangles. Thus we have from
Lemma 2.4.1 that if d(mk , m) → 0, then
¯
Hm ( X ) 1
1
Hm (X n ) = inf
lim Hmk (X n )
n n k→∞
n
1
¯
≥ lim sup inf Hmk (X n ) = lim sup Hmk (X ).
nn
k→∞
k→∞
= inf
n 2
The next lemma uses Lemma 2.3.4 to show that entropy rates are aﬃne
functions of the underlying probability measures.
Lemma 2.4.2 Let m and p denote two distributions for a discrete alphabet
random process {Xn }. Then for any λ ∈ (0, 1),
λHm (X n ) + (1 − λ)Hp (X n )
≤ Hλm+(1−λ)p (X n ) ≤ λHm (X n ) + (1 − λ)Hp (X n ) + h2 (λ), (2.19) and
lim sup(−
n→∞ dm(x) 1
ln(λm(X n (x)) + (1 − λ)p(X n (x))))
n
1
¯
= lim sup − dm(x) ln m(X n (x)) = Hm (X ). (2.20)
n
n→∞ 2.4. ENTROPY RATE 33 If m and p are stationary then
¯
¯
¯
Hλm+(1−λ)p (X ) = λHm (X ) + (1 − λ)Hp (X ) (2.21) and hence the entropy rate of a stationary discrete alphabet random process is
an aﬃne function of the process distribution.
Comment: Eq. (2.19) is simply Lemma 2.3.4 applied to the random vectors X n
stated in terms of the process distributions. Eq. (2.20) states that if we look
at the limit of the normalized log of a mixture of a pair of measures when one
of the measures governs the process, then the limit of the expectation does not
depend on the other measure at all and is simply the entropy rate of the driving
source. Thus in a sense the sequences produced by a measure are able to select
the true measure from a mixture.
Proof: Eq. (2.19) is just Lemma 2.3.4. Dividing by n and taking the limit as
n → ∞ proves that entropy rate is aﬃne. Similarly, take the limit supremum
in expressions (2.12) and (2.13) and the lemma is proved.
2
We are now prepared to prove one of the fundamental properties of entropy
rate, the fact that it has an ergodic decomposition formula similar to property
(c) of Theorem 1.8.2 when it is considered as a functional on the underlying
distribution. In other words, the entropy rate of a stationary source is given by
an integral of the entropy rates of the stationary ergodic components. This is a
far more complicated result than property (c) of the ordinary ergodic decomposition because the entropy rate depends on the distribution; it is not a simple
function of the underlying sequence. The result is due to Jacobs [68].
Theorem 2.4.1 The Ergodic Decomposition of Entropy Rate
Let (AZ+ , B (A)Z+ , m, T ) be a stationary dynamical system corresponding to
a stationary ﬁnite alphabet source {Xn }. Let {px } denote the ergodic decompo¯
sition of m. If Hpx (X ) is mintegrable, then
¯
Hm (X ) = ¯
dm(x)Hpx (X ). Proof: The theorem follows immediately from Corollary 2.4.1 and Lemma 2.4.2
and the ergodic decomposition of semicontinuous aﬃne funtionals as in Theorem 8.9.1 of [50].
2 Relative Entropy Rate
The properties of relative entropy rate are more diﬃcult to demonstrate. In
particular, the obvious analog to (2.18) does not hold for relative entropy rate
without the requirement that the reference measure by memoryless, and hence
one cannot immediately infer that the relative entropy rate is given by a limit
for stationary sources. The following lemma provides a condition under which
the relative entropy rate is given by a limit. The condition, that the dominating
measure be a k th order (or k step) Markov source will occur repeatedly when 34 CHAPTER 2. ENTROPY AND INFORMATION dealing with relative entropy rates. A source is k th order Markov or k step
Markov (or simply Markov if k is clear from context) if for any n and any
N ≥k
P (Xn = xn Xn−1 = xn−1 , . . . , Xn−N = xn−N )
= P (Xn = xn Xn−1 = xn−1 , . . . , Xn−k = xn−k );
that is, conditional probabilities given the inﬁnite past depend only on the most
recent k symbols. A 0step Markov source is a memoryless source. A Markov
source is said to have stationary transitions if the above conditional probabilities
do not depend on n, that is, if for any n
P (Xn = xn Xn−1 = xn−1 , . . . , Xn−N = xn−N )
= P (Xk = xn Xk−1 = xn−1 , . . . , X0 = xn−k ).
Lemma 2.4.3 If p is a stationary process and m is a k step Markov process
with stationary transitions, then
¯
Hp m (X ) = lim n→∞ 1
Hp
n m (X n ¯
) = −Hp (X ) − Ep [ln m(Xk X k )], where Ep [ln m(Xk X k )] is an abbreviation for
Ep [ln m(Xk X k )] = pX k+1 (xk+1 ) ln mXk X k (xk xk ).
xk+1 Proof: If for any n it is not true that mX n
pX n , then Hp m (X n ) = ∞ for
that and all larger n and both sides of the formula are inﬁnite, hence we assume
that all of the ﬁnite dimensional distributions satisfy the absolute continuity
relation. Since m is Markov,
n−1 mXl X l (xl xl )mX k (xk ). mX n (xn ) =
l=k Thus
1
Hp
n m (X n 1
1
) = − Hp (X n ) −
n
n pX n (xn ) ln mX n (xn )
xn 1
1
= − Hp (X n ) −
n
n
− pX k (xk ) ln mX k (xk )
xk n−k
n pX k+1 (xk+1 ) ln mXk X k (xk xk ).
xk+1 Taking limits then yields
¯
Hp m (X ) ¯
= −Hp − pX k+1 (xk+1 ) ln mXk X k (xk xk ),
xk+1 2.5. CONDITIONAL ENTROPY AND INFORMATION 35 where the sum is well deﬁned because if mXk X k (xk xk ) = 0, then so must
pX k+1 (xk+1 ) = 0 from absolute continuity.
2
Combining the previous lemma with the ergodic decomposition of entropy
rate yields the following corollary.
Corollary 2.4.2 The Ergodic Decomposition of Relative Entropy Rate
Let (AZ+ , B (A)Z+ , p, T ) be a stationary dynamical system corresponding to
a stationary ﬁnite alphabet source {Xn }. Let m be a k th order Markov process
for which mX n
pX n for all n. Let {px } denote the ergodic decomposition of
¯
p. If Hpx m (X ) is pintegrable, then
¯
Hp 2.5 m (X ) ¯
dp(x)Hpx = m (X ). Conditional Entropy and Information We now turn to other notions of information. While we could do without these
if we conﬁned interest to ﬁnite alphabet processes, they will be essential for
later generalizations and provide additional intuition and results even in the
ﬁnite alphabet case. We begin by adding a second ﬁnite alphabet measurement
to the setup of the previous sections. To conform more to information theory
tradition, we consider the measurements as ﬁnite alphabet random variables X
and Y rather than f and g . This has the advantage of releasing f and g for use
as functions deﬁned on the random variables: f (X ) and g (Y ). Let (Ω, B , P, T )
be a dynamical system. Let X and Y be ﬁnite alphabet measurements deﬁned
on Ω with alphabets AX and AY . Deﬁne the conditional entropy of X given Y
by
H (X Y ) ≡ H (X, Y ) − H (Y ).
The name conditional entropy comes from the fact that
H ( X Y ) =− P (X = a, Y = b) ln P (X = aY = b)
x,y =− pX,Y (x, y ) ln pX Y (xy ),
x,y where pX,Y (x, y ) is the joint pmf for (X, Y ) and pX Y (xy ) = pX,Y (x, y )/pY (y )
is the conditional pmf. Deﬁning
H (X Y = y ) = − pX Y (xy ) ln pX Y (xy )
x we can also write
H ( X Y ) = pY (y )H (X Y = y ).
y 36 CHAPTER 2. ENTROPY AND INFORMATION Thus conditional entropy is an average of entropies with respect to conditional
pmf’s. We have immediately from Lemma 2.3.2 and the deﬁnition of conditional
entropy that
0 ≤ H (X Y ) ≤ H (X ).
(2.22)
The inequalities could also be written in terms of the partitions induced by X
and Y . Recall that according to Lemma 2.3.2 the right hand inequality will be
an equality if and only if X and Y are independent.
Deﬁne the average mutual information between X and Y by
= H (X ) + H (Y ) − H (X, Y ) = I (X ; Y ) H (X ) − H (X Y ) = H (Y ) − H (Y X ). In terms of distributions and pmf’s we have that
I (X ; Y ) = P (X = x, Y = y ) ln
x,y = pX,Y (x, y ) ln
x,y =
x,y P (X = x, Y = y )
P (X = x)P (Y = y ) pX,Y (x, y )
=
pX (x)pY (y ) pX,Y (x, y ) ln
x,y pX Y (xy )
p X ( x) pY X (y x)
pX,Y (x, y ) ln
.
pY (y ) Note also that mutual information can be expressed as a divergence by
I (X ; Y ) = D(PX,Y PX × PY ),
where PX × PY is the product measure on X, Y , that is, a probability measure
which gives X and Y the same marginal distributions as PXY , but under which
X and Y are independent. Entropy is a special case of mutual information since
H (X ) = I (X ; X ).
We can collect several of the properties of entropy and relative entropy and
produce corresponding properties of mutual information. We state these in the
form using measurements, but they can equally well be expressed in terms of
partitions.
Lemma 2.5.1 Suppose that X and Y are two ﬁnite alphabet random variables
deﬁned on a common probability space. Then
0 ≤ I (X ; Y ) ≤ min(H (X ), H (Y )).
Suppose that f : AX → A and g : AY → B are two measurements. Then
I (f (X ); g (Y )) ≤ I (X ; Y ). 2.5. CONDITIONAL ENTROPY AND INFORMATION 37 Proof: The ﬁrst result follows immediately from the properties of entropy. The
second follows from Lemma 2.3.3 applied to the measurement (f, g ) since mutual
information is a special case of relative entropy.
2
The next lemma collects some additional, similar properties.
Lemma 2.5.2 Given the assumptions of the previous lemma,
H (f (X )X ) = H (X, f (X )) = H (X ), H (X ) 0, = H (f (X )) + H (X f (X ), I (X ; f (X )) = H (f (X )), H (X g (Y ))
I (f (X ); g (Y )) ≥ H (X Y ),
≤ I (X ; Y ), H (X Y ) = H (X, f (X, Y ))Y ), and, if Z is a third ﬁnite alphabet random variable deﬁned on the same probability space,
H (X Y ) ≥ H (X Y, Z ).
Comments: The ﬁrst relation has the interpretation that given a random variable, there is no additional information in a measurement made on the random
variable. The second and third relationships follow from the ﬁrst and the definitions. The third relation is a form of chain rule and it implies that given a
measurement on a random variable, the entropy of the random variable is given
by that of the measurement plus the conditional entropy of the random variable
given the measurement. This provides an alternative proof of the second result
of Lemma 2.3.3. The ﬁfth relation says that conditioning on a measurement of
a random variable is less informative than conditioning on the random variable
itself. The sixth relation states that coding reduces mutual information as well
as entropy. The seventh relation is a conditional extension of the second. The
eighth relation says that conditional entropy is nonincreasing when conditioning
on more information.
Proof: Since g (X ) is a deterministic function of X , the conditional pmf is trivial
(a Kronecker delta) and hence H (g (X )X = x) is 0 for all x, hence the ﬁrst
relation holds. The second and third relations follow from the ﬁrst and the
deﬁnition of conditional entropy. The fourth relation follows from the ﬁrst since
I (X ; Y ) = H (Y ) − H (Y X ). The ﬁfth relation follows from the previous lemma
since
H (X ) − H (X g (Y )) = I (X ; g (Y )) ≤ I (X ; Y ) = H (X ) − H (X Y ).
The sixth relation follows from Corollary 2.3.2 and the fact that
I (X ; Y ) = D(PX,Y PX × PY ). 38 CHAPTER 2. ENTROPY AND INFORMATION
The seventh relation follows since
H (X, f (X, Y ))Y ) = H (X, f (X, Y )), Y ) − H (Y )
= H (X, Y ) − H (Y ) = H (X Y ). The ﬁnal relation follows from the second by replacing Y by Y, Z and setting
g (Y, Z ) = Y .
2
In a similar fashion we can consider conditional relative entropies. Suppose
now that M and P are two probability measures on a common space, that X
and Y are two random variables deﬁned on that space, and that MXY
PXY
(and hence also MX
PY ). Analagous to the deﬁnition of the conditional
entropy we can deﬁne
M (X Y HP ) ≡ HP M (X, Y ) − HP M (Y ). Some algebra shows that this is equivalent to
M (X Y HP ) = pX,Y (x, y ) ln
x,y pX Y (xy )
mX Y (xy ) pX (x) pX Y (xy ) ln =
x pX Y (xy )
mX Y (xy ) . (2.23) This can be written as
HP M (X Y pY (y )D(pX Y (·y ) mX Y (·y )), )=
y an average of divergences of conditional pmf’s, each of which is well deﬁned
because of the original absolute continuity of the joint measure. Manipulations
similar to those for entropy can now be used to prove the following properties
of conditional relative entropies.
Lemma 2.5.3 Given two probability measures M and P on a common space,
and two random variables X and Y deﬁned on that space with the property that
MXY
PXY , then the following properties hold:
HP M (f (X )X ) = HP M (X, f (X )) = HP M (X ), = HP M (f (X )) HP M (X ) 0,
+ HP M (X f (X )), (2.24) If MXY = MX × MY (that is, if the pmfs satisfy mX,Y (x, y ) = mX (x)mY (y )),
then
HP M (X, Y ) ≥ HP M (X ) + HP M (Y )
and
HP M ( X Y ) ≥ HP M (X ). 2.5. CONDITIONAL ENTROPY AND INFORMATION 39 Eq. (2.24) is a chain rule for relative entropy which provides as a corollary an
immediate proof of Lemma 2.3.3. The ﬁnal two inequalities resemble inequalities
for entropy (with a sign reversal), but they do not hold for all reference measures.
The above lemmas along with Lemma 2.3.3 show that all of the information measures thus far considered are reduced by taking measurements or by
coding. This property is the key to generalizing these quantities to nondiscrete
alphabets.
We saw in Lemma 2.3.4 that entropy was a convex function of the underlying distribution. The following lemma provides similar properties of mutual
information considered as a function of either a marginal or a conditional distribution.
Lemma 2.5.4 Let µ denote a pmf on a discrete space AX , µ(x) = Pr(X = x),
and let q be a conditional pmf, q (y x) = Pr(Y = y X = x). Let µq denote the
resulting joint pmf µq (x, y ) = µ(x)q (y x). Let Iµq = Iµq (X ; Y ) be the average
mutual information. Then Iµq is a convex
function of q ; that is, given two
conditional pmf ’s q1 and q2 , a λ ∈ [0, 1], and q = λq1 + (1 − λ)q2 , then
¯
Iµq ≤ λIµq1 + (1 − λ)Iµq2 ,
¯
and Iµq is a convex
function of µ, that is, given two pmf ’s µ1 and µ2 , λ ∈
[0, 1], and µ = λµ1 + (1 − λ)µ2 ,
¯
Iµq ≥ λIµ1 q + (1 − λ)Iµ2 q .
¯
Proof: Let r (respectively, r1 , r2 , r) denote the pmf for Y resulting from q
¯
(respectively q1 , q2 , q ), that is, r(y ) = Pr(Y = y ) = x µ(x)q (y x). From (2.5)
¯
µ(x)q1 (x, y ) log µ(x)¯(x, y ) µ(x)r1 (y ) µ(x)q1 (x, y )
q
µ(x)¯(y ) µ(x)q1 (x, y ) µ(x)r1 (y )
r µ(x)q2 (x, y ) log Iµq = λ
¯ µ(x)¯(x, y ) µ(x)r2 (y ) µ(x)q2 (x, y )
q
µ(x)¯(y ) µ(x)q2 (x, y ) µ(x)r2 (y )
r x,y +(1 − λ)
x,y ≤ λIµq1 + λ µq1 (x, y )
x,y +(1 − λ)Iµq2 + (1 − λ) µ(x)¯(x, y ) µ(x)r1 (y )
q
−1
µ(x)¯(y ) µ(x)q1 (x, y )
r µ(x)q2 (x, y )
x,y µ(x)¯(x, y ) µ(x)r2 (y )
q
−1
µ(x)¯(y ) µ(x)q2 (x, y )
r = λIµq1 + (1 − λ)Iµq2 + λ(−1 +
x,y +(1 − λ)(−1 +
x,y µq (x, y )
¯
r1 (y ))
r(y )
¯ µ(x)¯(x, y )
q
r2 (y )) = λIµq1 + (1 − λ)Iµq2 .
r(y )
¯ 40 CHAPTER 2. ENTROPY AND INFORMATION Similarly, let µ = λµ1 + (1 − λ)µ2 and let r1 , r2 , and r denote the induced
¯
¯
output pmf’s. Then
µ1 (x)q (y x) log Iµq = λ
¯
x,y + (1 − λ) q (y x) r1 (y ) q (y x)
r(y ) q (y x) r1 (y )
¯ µ2 (x)q (y x) log
x,y = λIµ1 q + (1 − λ)Iµ2 q − λ q (y x) r2 (y ) q (y x)
r(y ) q (y x) r2 (y )
¯
µ1 (x)q (y x) log x,y − (1 − λ) µ2 (x)q (y x) log
x,y r(y )
¯
r1 (y ) r (y )
¯
r2 (y )
≥ λIµ1 q + (1 − λ)Iµ2 q from another application of (2.5).
2
We consider one other notion of information: Given three ﬁnite alphabet
random variables X, Y, Z , deﬁne the conditional mutual information between X
and Y given Z by
I (X ; Y Z ) = D(PXY Z PX ×Y Z )
(2.25)
where PX ×Y Z is the distribution deﬁned by its values on rectangles as
PX ×Y Z (F × G × D) = P (X ∈ F Z = z )P (Y ∈ GZ = z )P (Z = z ). (2.26)
z ∈D PX ×Y Z has the same conditional distributions for X given Z and for Y given
Z as does PXY Z , but now X and Y are conditionally independent given Z . Alternatively, the conditional distribution for X, Y given Z under the distribution
PX ×Y Z is the product distribution PX Z × PY Z . Thus
I (X ; Y Z ) pXY Z (x, y, z ) ln pXY Z (x, y, z )
pX Z (xz )pY Z (y z )pZ (z ) (2.27) pXY Z (x, y, z ) ln = pXY Z (x, y z )
.
pX Z (xz )pY Z (y z ) (2.28) x,y,z =
x,y,z Since pXY Z
pXY Z
pX
pXY Z
pY
=
×
=
×
pX Z pY Z pZ
pX pY Z
pX Z
pXZ pY
pY Z we have the ﬁrst statement in the following lemma.
Lemma 2.5.5
I (X ; Y Z ) + I (Y ; Z ) = I (Y ; (X, Z )), (2.29) I (X ; Y Z ) ≥ 0, (2.30) 2.5. CONDITIONAL ENTROPY AND INFORMATION 41 with equality if and only if X and Y are conditionally independent given Z , that
is, pXY Z = pX Z pY Z . Given ﬁnite valued measurements f and g ,
I (f (X ); g (Y )Z ) ≤ I (X ; Y Z ).
Proof: The second inequality follows from the divergence inequality (2.6) with
P = PXY Z and M = PX ×Y Z , i.e., the pmf’s pXY Z and pX Z pY Z pZ . The
third inequality follows from Lemma 2.3.3 or its corollary applied to the same
measures.
2
Comments: Eq. (2.29) is called Kolmogorov’s formula. If X and Y are conditionally independent given Z in the above sense, then we also have that
pX Y Z = pXY Z /pY Z = pX Z , in which case we say that Y → Z → X is a
Markov chain and note that given Z , X does not depend on Y . (Note that if
Y → Z → X is a Markov chain, then so is X → Z → Y .) Thus the conditional
mutual information is 0 if and only if the variables form a Markov chain with
the conditioning variable in the middle. One might be tempted to infer from
Lemma 2.3.3 that given ﬁnite valued measurements f , g , and r
I (f (X ); g (Y )r(Z )) (?)
I (X ; Y Z ).
≤ This does not follow, however, since it is not true that if Q is the partition
corresponding to the three quantizers, then D(Pf (X ),g(Y ),r(Z ) Pf (X )×g(Y )r(Z ) )
is HPX,Y,Z PX ×Y Z (f (X ), g (Y ), r(Z )) because of the way that PX ×Y Z is constructed; e.g., the fact that X and Y are conditionally independent given Z
implies that f (X ) and g (Y ) are conditionally independent given Z , but it does
not imply that f (X ) and g (Y ) are conditionally independent given r(Z ). Alternatively, if M is PX ×Z Y , then it is not true that Pf (X )×g(Y )r(Z ) equals
M (f gr)−1 . Note that if this inequality were true, choosing r(z ) to be trivial
(say 1 for all z ) would result in I (X ; Y Z ) ≥ I (X ; Y r(Z )) = I (X ; Y ). This
cannot be true in general since, for example, choosing Z as (X, Y ) would give
I (X ; Y Z ) = 0. Thus one must be careful when applying Lemma 2.3.3 if the
measures and random variables are related as they are in the case of conditional
mutual information.
We close this section with an easy corollary of the previous lemma and of the
deﬁnition of conditional entropy. Results of this type are referred to as chain
rules for information and entropy.
Corollary 2.5.1 Given ﬁnite alphabet random variables Y , X1 , X2 , . . ., Xn ,
n H (Xi X1 , . . . , Xi−1 ) H (X1 , X2 , . . . , Xn ) =
i=1
n Hp m (X1 , X2 , . . . , Xn ) = Hp m (Xi X1 , . . . , Xi−1 ) i=1
n I (Y ; Xi X1 , . . . , Xi−1 ). I (Y ; (X1 , X2 , . . . , Xn )) =
i=1 42 CHAPTER 2. ENTROPY AND INFORMATION 2.6 Entropy Rate Revisited The chain rule of Corollary 2.5.1 provides a means of computing entropy rates
for stationary processes. We have that
1
1
H (X n ) =
n
n n−1 H (Xi X i ).
i=0 First suppose that the source is a stationary k th order Markov process, that
is, for any m > k
Pr(Xn = xn Xi = xi ; i = 0, 1, . . . , n − 1)
= Pr(Xn = xn Xi = xi ; i = n − k, . . . , n − 1).
For such a process we have for all n ≥ k that
k
H (Xn X n ) = H (Xn Xn−k ) = H (Xk X k ),
m
where Xi = Xi , . . . , Xi+m−1 . Thus taking the limit as n → ∞ of the nth order
entropy, all but a ﬁnite number of terms in the sum are identical and hence the
Ces`ro (or arithmetic) mean is given by the conditional expectation. We have
a
therefore proved the following lemma. Lemma 2.6.1 If {Xn } is a stationary k th order Markov source, then
¯
H (X ) = H (Xk X k ).
If we have a twosided stationary process {Xn }, then all of the previous deﬁnitions for entropies of vectors extend in an obvious fashion and a generalization
of the Markov result follows if we use stationarity and the chain rule to write
1
1
H (X n ) =
n
n n−1 H (X0 X−1 , . . . , X−i ).
i=0 Since conditional entropy is nonincreasing with more conditioning variables
((2.22) or Lemma 2.5.2), H (X0 X−1 , . . . , X−i ) has a limit. Again using the fact
that a Ces`ro mean of terms all converging to a common limit also converges
a
to the same limit we have the following result.
Lemma 2.6.2 If {Xn } is a twosided stationary source, then
¯
H (X ) = lim H (X0 X−1 , . . . , X−n ).
n→∞ It is tempting to identify the above limit as the conditional entropy given
the inﬁnite past, H (X0 X−1 , . . .). Since the conditioning variable is a sequence
and does not have a ﬁnite alphabet, such a conditional entropy is not included
in any of the deﬁnitions yet introduced. We shall later demonstrate that this 2.6. ENTROPY RATE REVISITED 43 interpretation is indeed valid when the notion of conditional entropy has been
suitably generalized.
The natural generalization of Lemma 2.6.2 to relative entropy rates unfortunately does not work because conditional relative entropies are not in general
monotonic with increased conditioning and hence the chain rule does not immediately yield a limiting argument analogous to that for entropy. The argument
does work if the reference measure is a k th order Markov, as considered in the
following lemma.
Lemma 2.6.3 If {Xn } is a source described by process distributions p and m
and if p is stationary and m is k th order Markov with stationary transitions,
then for n ≥ k Hp m (X0 X−1 , . . . , X−n ) is nondecreasing in n and
¯
Hp m (X ) = lim Hp n→∞ m (X0 X−1 , . . . , X−n ) ¯
−Hp (X ) − Ep [ln m(Xk X k )]. =
Proof: For n ≥ k we have that
Hp m (X0 X−1 , . . . , X−n ) =
pX k+1 (xk+1 ) ln mXk X k (xk xk ). − Hp (X0 X−1 , . . . , X−n ) −
xk+1 Since the conditional entropy is nonincreasing with n and the remaining term
does not depend on n, the combination is nondecreasing with n. The remainder
of the proof then parallels the entropy rate result.
2
It is important to note that the relative entropy analogs to entropy properties
often require k th order Markov assumptions on the reference measure (but not
on the original measure). Markov Approximations
¯
Recall that the relative entropy rate Hp m (X ) can be thought of as a distance
between the process with distribution p and that with distribution m and that
the rate is given by a limit if the reference measure m is Markov. A particular
Markov measure relevant to p is the distribution p(k) which is the k th order
Markov approximation to p in the sense that it is a k th order Markov source
and it has the same k th order transition probabilities as p. To be more precise,
the process distribution p(k) is speciﬁed by its ﬁnite dimensional distributions
(k) pX k (xk ) = pX k (xk )
n−1 (k )
pX n (xn ) k pXl Xlk−k (xl xk−k ); n = k, k + 1, . . .
l = pX k (x )
l=k so that
(k ) pXk X k = pXk X k .
It is natural to ask how good this approximation is, especially in the limit, that
¯
is, to study the behavior of the relative entropy rate Hp p(k) (X ) as k → ∞. 44 CHAPTER 2. ENTROPY AND INFORMATION Theorem 2.6.1 Given a stationary process p, let p(k) denote the k th order
Markov approximations to p. Then
¯
lim Hp k→∞ p(k) (X ) ¯
= inf Hp
k p(k) (X ) = 0. Thus the Markov approximations are asymptotically accurate in the sense that
the relative entropy rate between the source and approximation can be made
arbitrarily small (zero if the original source itself happens to be Markov).
Proof: As in the proof of Lemma 2.6.3 we can write for n ≥ k that
Hp p(k) (X0 X−1 , . . . , X−n ) pX k+1 (xk+1 ) ln pXk X k (xk xk ) = −Hp (X0 X−1 , . . . , X−n ) −
xk+1 = Hp (X0 X−1 , . . . , X−k ) − Hp (X0 X−1 , . . . , X−n ).
(k ) Note that this implies that pX n
pX n for all n since the entropies are ﬁnite.
This automatic domination of the ﬁnite dimensional distributions of a measure
by those of its Markov approximation will not hold in the general case to be
encountered later, it is speciﬁc to the ﬁnite alphabet case. Taking the limit as
n → ∞ gives
¯
Hp p(k) (X ) = lim Hp n→∞ p(k) (X0 X−1 , . . . , X−n ) ¯
= Hp (X0 X−1 , . . . , X−k ) − Hp (X ).
The corollary then follows immediately from Lemma 2.6.2. 2 Markov approximations will play a fundamental role when considering relative entropies for general (nonﬁnite alphabet) processes. The basic result above
will generalize to that case, but the proof will be much more involved. 2.7 Relative Entropy Densities Many of the convergence results to come will be given and stated in terms
of relative entropy densities. In this section we present a simple but important
result describing the asymptotic behavior of relative entropy densities. Although
the result of this section is only for ﬁnite alphabet processes, it is stated and
proved in a manner that will extend naturally to more general processes later
on. The result will play a fundamental role in the basic ergodic theorems to
come.
Throughout this section we will assume that M and P are two process
distributions describing a random process {Xn }. Denote as before the sample
vector X n = (X0 , X1 , . . . , Xn−1 ), that is, the vector beginning at time 0 having
length n. The distributions on X n induced by M and P will be denoted by
Mn and Pn , respectively. The corresponding pmf’s are mX n and pX n . The 2.7. RELATIVE ENTROPY DENSITIES 45 key assumption in this section is that for all n if mX n (xn ) = 0, then also
pX n (xn ) = 0, that is,
Mn
Pn for all n.
(2.31)
If this is the case, we can deﬁne the relative entropy density
hn (x) ≡ ln
where pX n (xn )
= ln fn (x),
mX n (xn ) pX n (xn )
mX n (xn ) if mX n (xn ) = 0 0 fn (x) ≡ (2.32) otherwise (2.33) Observe that the relative entropy is found by integrating the relative entropy
density:
HP M (X n ) = pX n (xn ) ln D(Pn Mn ) =
xn pX n (X n )
ln
dP
mX n (X n ) = pX n (xn )
mX n (xn )
(2.34) Thus, for example, if we assume that
HP M (X n ) < ∞, all n, (2.35) then (2.31) holds.
The following lemma will prove to be useful when comparing the asymptotic
behavior of relative entropy densities for diﬀerent probability measures. It is the
ﬁrst almost everywhere result for relative entropy densities that we consider. It
is somewhat narrow in the sense that it only compares limiting densities to zero
and not to expectations. We shall later see that essentially the same argument
implies the same result for the general case (Theorem 5.4.1), only the interim
steps involving pmf’s need be dropped. Note that the lemma requires neither
stationarity nor asymptotic mean stationarity.
Lemma 2.7.1 Given a ﬁnite alphabet process {Xn } with process measures P, M
satisfying (2.31), Then n→∞ and
lim inf
n→∞ If in addition M 1
hn ≤ 0, M − a.e.
n (2.36) 1
hn ≥ 0, P − a.e..
n lim sup (2.37) P , then
lim n→∞ 1
hn = 0, P − a.e..
n (2.38) 46 CHAPTER 2. ENTROPY AND INFORMATION Proof: First consider the probability
1
EM (fn )
,
M ( hn ≥ ) = M (fn ≥ en ) ≤
n
en
where the ﬁnal inequality is Markov’s inequality. But
EM (fn ) = mX n (xn ) dM fn =
xn : mX n (xn )=0 pX n (xn )
mX n (xn ) pX n (xn ) ≤ 1 =
xn : mX n (xn )=0 and therefore 1
M ( hn ≥ ) ≤ 2−n
n and hence ∞ ∞ 1
M ( hn > ) ≤
e−n < ∞.
n
n=1
n=1
From the BorelCantelli Lemma (e.g., Lemma 4.6.3 of [50]) this implies that
M (n−1 hn ≥ i.o.) = 0 which implies the ﬁrst equation of the lemma.
Next consider
1
P (− hn > )
n pX n (xn ) =
1
xn :− n ln pX n (xn )/mX n (xn )> pX n (xn ) =
1
xn :− n ln pX n (xn )/mX n (xn )> and mX n (xn )=0 where the last statement follows since if mX n (xn ) = 0, then also pX n (xn ) = 0
and hence nothing would be contributed to the sum. In other words, terms
violating this condition add zero to the sum and hence adding this condition to
the sum does not change the sum’s value. Thus
1
P (− hn > )
n =
1
xn :− n ln pX n (xn )/mX n (xn )> fn <e−n mX n (xn )=0 dM e−n dM fn ≤ =
= and p X n ( xn )
mX n (xn )
mX n (xn ) fn <e−n e−n M (fn < e−n ) ≤ e−n . Thus as before we have that P (n−1 hn > ) ≤ e−n and hence that P (n−1 hn ≤
− i.o.) = 0 which proves the second claim. If also M
P , then the ﬁrst
equation of the lemma is also true P a.e., which when coupled with the second
equation proves the third.
2 Chapter 3 The Entropy Ergodic
Theorem
3.1 Introduction The goal of this chapter is to prove an ergodic theorem for sample entropy of
ﬁnite alphabet random processes. The result is sometimes called the ergodic
theorem of information theory or the asymptotic equipartion theorem, but it is
best known as the ShannonMcMillanBreiman theorem. It provides a common
foundation to many of the results of both ergodic theory and information theory. Shannon [131] ﬁrst developed the result for convergence in probability for
stationary ergodic Markov sources. McMillan [104] proved L1 convergence for
stationary ergodic sources and Breiman [19] [20] proved almost everywhere convergence for stationary and ergodic sources. Billingsley [15] extended the result
to stationary nonergodic sources. Jacobs [67] [66] extended it to processes dominated by a stationary measure and hence to twosided AMS processes. Gray
and Kieﬀer [54] extended it to processes asymptotically dominated by a stationary measure and hence to all AMS processes. The generalizations to AMS
processes build on the Billingsley theorem for the stationary mean. Following generalizations of the deﬁnitions of entropy and information, corresponding
generalizations of the entropy ergodic theorem will be considered in Chapter 8.
Breiman’s and Billingsley’s approach requires the martingale convergence
theorem and embeds the possibly onesided stationary process into a twosided
process. Ornstein and Weiss [118] recently developed a proof for the stationary
and ergodic case that does not require any martingale theory and considers
only positive time and hence does not require any embedding into twosided
processes. The technique was described for both the ordinary ergodic theorem
and the entropy ergodic theorem by Shields [134]. In addition, it uses a form
of coding argument that is both more direct and more information theoretic in
ﬂavor than the traditional martingale proofs. We here follow the Ornstein and
Weiss approach for the stationary ergodic result. We also use some modiﬁcations
47 48 CHAPTER 3. THE ENTROPY ERGODIC THEOREM similar to those of Katznelson and Weiss for the proof of the ergodic theorem.
We then generalize the result ﬁrst to nonergodic processes using the “sandwich”
technique of Algoet and Cover [7] and then to AMS processes using a variation
on a result of [54].
We next state the theorem to serve as a guide through the various steps. We
also prove the result for the simple special case of a Markov source, for which
the result follows from the usual ergodic theorem.
We consider a directly given ﬁnite alphabet source {Xn } described by a
distribution m on the sequence measurable space (Ω, B ). Deﬁne as previously
n
Xk = (Xk , Xk+1 , · · · , Xk+n−1 ). The subscript is omitted when it is zero. For
n
any random variable Y deﬁned on the sequence space (such as Xk ) we deﬁne
the random variable m(Y ) by m(Y )(x) = m(Y = Y (x)).
Theorem 3.1.1 The Entropy Ergodic Theorem
Given a ﬁnite alphabet AMS source {Xn } with process distribution m and
stationary mean m, let {mx ; x ∈ Ω} be the ergodic decomposition of the station¯
¯
ary mean m. Then
¯
− ln m(X n )
= h; m − a.e. and in L1 (m),
n→∞
n
lim (3.1) where h(x) is the invariant function deﬁned by
¯¯
h(x) = Hmx (X ). (3.2) Furthermore,
1
¯
Hm (X n ) = Hm (X );
n
that is, the entropy rate of an AMS process is given by the limit, and
Em h = lim n→∞ ¯¯
¯
Hm (X ) = Hm (X ). (3.3) (3.4) Comments: The theorem states that the sample entropy using the AMS
measure m converges to the entropy rate of the underlying ergodic component
of the stationary mean. Thus, for example, if m is itself stationary and ergodic, then the sample entropy converges to the entropy rate of the process
ma.e. and in L1 (m). The L1 (m) convergence follows immediately from the
almost everywhere convergence and the fact that sample entropy is uniformly
integrable (Lemma 2.3.6). L1 convergence in turn immediately implies the lefthand equality of (3.3). Since the limit exists, it is the entropy rate. The ﬁnal
equality states that the entropy rates of an AMS process and its stationary mean
are the same. This result follows from (3.2)(3.3) by the following argument:
¯
¯¯
We have that Hm (X ) = Em h and Hm (X ) = Em h, but h is invariant and hence
¯
the two expectations are equal (see, e.g., Lemma 6.3.1 of [50]). Thus we need
only prove almost everywhere convergence in (3.1) to prove the theorem.
In this section we limit ourselves to the following special case of the theorem that can be proved using the ordinary ergodic theorem without any new
techniques. 3.1. INTRODUCTION 49 Lemma 3.1.1 Given a ﬁnite alphabet stationary k th order Markov source {Xn },
then there is an invariant function h such that
− ln m(X n )
= h; m − a.e. and in L1 (m),
n→∞
n
lim where h is deﬁned by
h(x) = −Emx ln m(Xk X k ),
¯ (3.5) where {mx } is the ergodic decomposition of the stationary mean m. Further¯
¯
more,
¯¯
h(x) = Hmx (X ) = Hmx (Xk X k ).
(3.6)
¯
Proof of Lemma: We have that
− 1
1
ln m(X n ) = −
n
n n−1 ln m(Xi X i ).
i=0 Since the process is k th order Markov with stationary transition probabilites,
for i > k we have that
m(Xi X i ) = m(Xi Xi−k , · · · , Xi−1 ) = m(Xk X k )T i−k .
The terms − ln m(Xi X i ), i = 0, 1, · · · , k − 1 have ﬁnite expectation and hence
are ﬁnite ma.e. so that the ergodic theorem can be applied to deduce
− ln m(X n )(x)
n = − 1
n
1
n k −1 ln m(Xi X i )(x) −
i=0
k −1 ln m(Xi X i )(x) − = − → 1
n
1
n n−1 ln m(Xk X k )(T i−k x)
i=k
n−k −1 ln m(Xk X k )(T i x) Emx (− ln m(Xk X k )),
¯ n→∞ i=0 i=0 proving the ﬁrst statement of the lemma. It follows from the ergodic decomposition of Markov sources (see Lemma 8.6.3) of [50]) that with probability 1,
mx (Xk X k ) = m(Xk ψ (x), X k ) = m(Xk X k ), where ψ is the ergodic component
¯
function. This completes the proof.
2
We prove the theorem in three steps: The ﬁrst step considers stationary
and ergodic sources and uses the approach of Ornstein and Weiss [118] (see also
Shields [134]). The second step removes the requirement for ergodicity. This
result will later be seen to provide an information theoretic interpretation of
the ergodic decomposition. The third step extends the result to AMS processes
by showing that such processes inherit limiting sample entropies from their
stationary mean. The later extension of these results to more general relative
entropy and information densities will closely parallel the proofs of the second
and third steps for the ﬁnite case. 50 3.2 CHAPTER 3. THE ENTROPY ERGODIC THEOREM Stationary Ergodic Sources This section is devoted to proving the entropy ergodic theorem for the special
case of stationary ergodic sources. The result was originally proved by Breiman
[19]. The original proof ﬁrst used the martingale convergence theorem to infer
the convergence of conditional probabilities of the form m(X0 X−1 , X−2 , · · · , X−k )
to m(X0 X−1 , X−2 , · · · ). This result was combined with an an extended form of
the ergodic theorem stating that if gk → g as k → ∞ and if gk is L1 dominated
n−1
n−1
(supk gk  is in L1 ), then 1/n k=0 gk T k has the same limit as 1/n k=0 gT k .
Combining these facts yields that that
1
1
ln m(X n ) =
n
n n−1 ln m(Xk X k ) =
k=0 1
n n−1
k
ln m(X0 X−k )T k
k=0 has the same limit as
1
n n−1 ln m(X0 X−1 , X−2 , · · · )T k
k=0 which, from the usual ergodic theorem, is the expectation
E (ln m(X0 X− ) ≡ E (ln m(X0 X−1 , X−2 , · · · )).
As suggested at the end of the preceeding chapter, this should be minus the
conditional entropy H (X0 X−1 , X−2 , · · · ) which in turn should be the entropy
¯
rate HX . This approach has three shortcomings: it requires a result from martingale theory which has not been proved here or in the companion volume [50],
it requires an extended ergodic theorem which has similarly not been proved
here, and it requires a more advanced deﬁnition of entropy which has not yet
been introduced. Another approach is the sandwich proof of Algoet and Cover
[7]. They show without using martingale theory or the extended ergodic theon−1
i
rem that 1/n i=0 ln m(X0 X−i )T i is asymptotically sandwiched between the
entropy rate of a k th order Markov approximation:
1
n n−1
k
k
k
ln m(X0 X−k )T i → Em [ln m(X0 X−k )] = −H (X0 X−k )
i=k n→∞ and
1
n n−1 ln m(X0 X−1 , X−2 , · · · )T i → Em [ln m(X0 X1 , · · · )] = i=k −H (X0 X−1 , X−2 , · · · ). n→∞ By showing that these two limits are arbitrarily close as k → ∞, the result is
proved. The drawback of this approach for present purposes is that again the
more advanced notion of conditional entropy given the inﬁnite past is required. 3.2. STATIONARY ERGODIC SOURCES 51 Algoet and Cover’s proof that the above two entropies are asymptotically close
involves martingale theory, but this can be avoided by using Corollary 5.2.4 as
will be seen.
The result can, however, be proved without martingale theory, the extended
ergodic theorem, or advanced notions of entropy using the approach of Ornstein
and Weiss [118], which is the approach we shall take in this chapter. In a later
chapter when the entropy ergodic theorem is generalized to nonﬁnite alphabets
and the convergence of entropy and information densities is proved, the sandwich
approach will be used since the appropriate general deﬁnitions of entropy will
have been developed and the necessary side results will have been proved.
Lemma 3.2.1 Given a ﬁnite alphabet source {Xn } with a stationary ergodic
distribution m, we have that
− ln m(X n )
= h; m − a.e.,
n→∞
n
lim where h(x) is the invariant function deﬁned by
¯
h(x) = Hm (X ).
Proof: Deﬁne
hn (x) = − ln m(X n )(x) = − ln m(xn )
and 1
− ln m(xn )
hn (x) = lim inf
.
n→∞ n
n→∞
n
Since m((x0 , · · · , xn−1 )) ≤ m((x1 , · · · , xn−1 )), we have that
h(x) = lim inf hn (x) ≥ hn−1 (T x).
Dividing by n and taking the limit inﬁmum of both sides shows that h(x) ≥
h(T x). Since the n−1 hn are nonnegative and uniformly integrable (Lemma 2.3.6),
we can use Fatou’s lemma to deduce that h and hence also hT are integrable
with respect to m. Integrating with respect to the stationary measure m yields
dm(x)h(x) = dm(x)h(T x) which can only be true if
h(x) = h(T x); m − a.e.,
that is, if h is an invariant function with mprobability one. If h is invariant
almost everywhere, however, it must be a constant with probability one since
m is ergodic (Lemma 6.7.1 of [50]). Since it has a ﬁnite integral (bounded by
¯
Hm (X )), h must also be ﬁnite. Henceforth we consider h to be a ﬁnite constant.
We now proceed with steps that resemble those of the proof of the ergodic
theorem in Section 7.2 of [50]. Fix > 0. We also choose for later use a δ > 0 52 CHAPTER 3. THE ENTROPY ERGODIC THEOREM small enough to have the following properties: If A is the alphabet of X0 and
A is the ﬁnite cardinality of the alphabet, then
δ ln A < , (3.7) − δ ln δ − (1 − δ ) ln(1 − δ ) ≡ h2 (δ ) < . (3.8) and
The latter property is possible since h2 (δ ) → 0 as δ → 0.
Deﬁne the random variable n(x) to be the smallest integer n for which
n−1 hn (x) ≤ h + . By deﬁnition of the limit inﬁmum there must be inﬁnitely
many n for which this is true and hence n(x) is everywhere ﬁnite. Deﬁne the
set of “bad” sequences by B = {x : n(x) > N } where N is chosen so large
that m(B ) < δ/2. Still mimicking the proof of the ergodic theorem, we deﬁne
a bounded modiﬁcation of n(x) by
n(x) =
˜ n( x) x ∈ B
1
x∈B so that n(x) ≤ N for all x ∈ B c . We now parse the sequence into variablelength
˜
blocks. Iteratively deﬁne nk (x) by
n0 (x) = 0
n1 (x) = n(x)
˜
n2 (x) = n1 (x) + n(T n1 (x) x) = n1 (x) + l1 (x)
˜
.
.
.
nk+1 (x) = nk (x) + n(T nk (x) x) = nk (x) + lk (x),
˜
where lk (x) is the length of the k th block:
lk (x) = n(T nk (x) x).
˜
We have parsed a long sequence xL = (x0 , · · · , xL−1 ), where L
N , into
lk (x)
blocks xnk (x) , · · · , xnk+1 (x)−1 = xnk (x) which begin at time nk (x) and have
length lk (x) for k = 0, 1, · · · . We refer to this parsing as the block decomposition
of a sequence. The k th block, which begins at time nk (x), must either have
sample entropy satisfying
l (x) k
− ln m(xnk (x) ) lk (x) ≤h+ (3.9) or, equivalently, probability at least
l (x) k
m(xnk (x) ) ≥ e−lk (x)(h+ ) , (3.10) 3.2. STATIONARY ERGODIC SOURCES 53 or it must consist of only a single symbol. Blocks having length 1 (lk = 1) could
have the correct sample entropy, that is,
− ln m(x1 k (x) )
n
1 ¯
≤h+ , or they could be bad in the sense that they are the ﬁrst symbol of a sequence
with n > N ; that is,
n(T nk (x) x) > N,
or, equivalently,
T nk (x) x ∈ B.
Except for these bad symbols, each of the blocks by construction will have a
probability which satisﬁes the above bound.
Deﬁne for nonnegative integers n and positive integers l the sets
l
S (n, l) = {x : m(Xn (x)) ≥ e−l(h+ ) }, that is, the collection of inﬁnite sequences for which (3.9) and (3.10) hold for
a block starting at n and having length l. Observe that for such blocks there
cannot be more than el(h+ ) distinct ltuples for which the bound holds (lest
the probabilities sum to something greater than 1). In symbols this is
S (n, l) ≤ el(h+ ) . (3.11) The ergodic theorem will imply that there cannot be too many single symbol
blocks with n(T nk (x) x) > N because the event has small probability. These
facts will be essential to the proof.
Even though we write n(x) as a function of the entire inﬁnite sequence, we
˜
can determine its value by observing only the preﬁx xN of x since either there
is an n ≤ N for which n−1 ln m(xn ) ≤ h + or there is not. Hence there is a
function n(xN ) such that n(x) = n(xN ). Deﬁne the ﬁnite length sequence event
ˆ
˜
ˆ
C = {xN : n(xN ) = 1 and − ln m(x1 ) > h + }, that is, C is the collection of all
ˆ
N tuples xN that are preﬁxes of bad inﬁnite sequences, sequences x for which
n(x) > N . Thus in particular,
x ∈ B if and only if xN ∈ C.
Now recall that we parse sequences of length L
of “good” Ltuples by
GL = {xL : 1
L−N (3.12)
N and deﬁne the set GL L−N −1 1C (xN ) ≤ δ },
i
i=0 that is, GL is the collection of all Ltuples which have fewer than δ (L − N ) ≤ δL
time slots i for which xN is a preﬁx of a bad inﬁnite sequence. From (3.12) and
i 54 CHAPTER 3. THE ENTROPY ERGODIC THEOREM the ergodic theorem for stationary ergodic sources we know that ma.e. we get
an x for which
1
n→∞ n n−1 1
n→∞ n n−1 1C (xN ) = lim
i lim i=0 1B (T i x) = m(B ) ≤
i=0 δ
.
2 (3.13) From the deﬁnition of a limit, this means that with probability 1 we get an x
for which there is an L0 = L0 (x) such that
1
LN L −N −1 1C (xN ) ≤ δ ; for all L > L0 .
i (3.14) i=0 This follows simply because if the limit is less than δ/2, there must be an L0 so
large that for larger L the time average is at least no greater than 2δ/2 = δ . We
can restate (3.14) as follows: with probability 1 we get an x for which xL ∈ GL
for all but a ﬁnite number of L. Stating this in negative fashion, we have one of
the key properties required by the proof: If xL ∈ GL for all but a ﬁnite number
of L, then xL cannot be in the complement Gc inﬁnitely often, that is,
L
m(x : xL ∈ Gc i.o.) = 0.
L (3.15) We now change tack to develop another key result for the proof. For each
L we bounded above the cardinality GL  of the set of good Ltuples. By
construction there are no more than δL bad symbols in an Ltuple in GL and
these can occur in any of at most k≤δL L
k ≤ eh2 (δ)L (3.16) places, where we have used Lemma 2.3.5. Eq. (3.16) provides an upper
bound on the number of ways that a sequence in GL can be parsed by the given
rules. The bad symbols and the ﬁnal N symbols in the Ltuple can take on
any of the A diﬀerent values in the alphabet. Eq. (3.11) bounds the number
of ﬁnite length sequences that can occur in each of the remaining blocks and
hence for any given block decomposition, the number of ways that the remaining
blocks blocks can be ﬁlled is bounded above by
elk (x)(h+ ) = e P k lk (x)(h+ ) = eL(h+ ) , (3.17) k:T nk (x) x∈B regardless of the details of the parsing. Combining these bounds we have that
GL  ≤ eh2 (δ)L × AδL × AN × eL(h+ ) = eh2 (δ)L+(δL+N ) ln A+L(h+
or
GL  ≤ eL(h+ +h2 (δ )+(δ + N ) ln A)
L . ) 3.2. STATIONARY ERGODIC SOURCES 55 Since δ satisﬁes (3.7)–(3.8), we can choose L1 large enough so that N ln A/L1 ≤
and thereby obtain
GL  ≤ eL(h+4 ) ; L ≥ L1 .
(3.18)
This bound provides the second key result in the proof of the lemma. We now
combine (3.18) and (3.15) to complete the proof.
Let BL denote a collection of Ltuples that are bad in the sense of having
too large a sample entropy or, equivalently, too small a probability; that is if
xL ∈ BL , then
m(xL ) ≤ e−L(h+5 )
or, equivalently, for any x with preﬁx xL
hL (x) ≥ h + 5 .
The upper bound on GL  provides a bound on the probability of BL
m(BL e−L(h+5 m(xL ) ≤ GL ) =
xL ∈ B L T xL ∈ G GL −L(h+5 ) ≤ GL e ≤e −L L1 −1 m(BL GL ) = L=1 ) L .
> 0 and for all L ≥ L1 . Recall now that the above bound is true for a ﬁxed
Thus
∞ GL : ∞ m(BL GL ) + L=1 m(BL GL ) L=L1
∞ e− ≤ L1 + L <∞ L=L1 and hence from the BorelCantelli lemma (Lemma 4.6.3 of [50]) m(x : xL ∈
BL GL i.o.) = 0. We also have from (3.15), however, that m(x : xL ∈
Gc i.o. ) = 0 and hence xL ∈ GL for all but a ﬁnite number of L. Thus
L
xL ∈ BL i.o. if and only if xL ∈ BL GL i.o. As this latter event has zero
probability, we have shown that m(x : xL ∈ BL i.o.) = 0 and hence
lim sup hL (x) ≤ h + 5 .
L→∞ Since is arbitrary we have proved that the limit supremum of the sample
entropy −n−1 ln m(X n ) is less than or equal to the limit inﬁmum and therefore
that the limit exists and hence with mprobability 1
− ln m(X n )
(3.19)
= h.
n→∞
n
Since the terms on the left in (3.19) are uniformly integrable from Lemma 2.3.6,
we can integrate to the limit and apply Lemma 2.4.1 to ﬁnd that
lim − ln m(X n (x))
¯
= Hm (X ),
n→∞
n
which completes the proof of the lemma and hence also proves Theorem 3.1.1
for the special case of stationary ergodic measures.
2
h = lim dm(x) 56 3.3 CHAPTER 3. THE ENTROPY ERGODIC THEOREM Stationary Nonergodic Sources Next suppose that a source is stationary with ergodic decomposition {mλ ;
λ ∈ Λ} and ergodic component function ψ as in Theorem 1.8.3. The source
will produce with probability one under m an ergodic component mλ and
Lemma 3.2.1 will hold for this ergodic component. In other words, we should
have that
1
¯
lim − ln mψ (X n ) = Hmψ (X ); m − a.e.,
(3.20)
n→∞
n
that is,
¯
m({x : − lim ln mψ(x) (xn ) = Hmψ(x) (X )}) = 1.
n→∞ This argument is made rigorous in the following lemma.
Lemma 3.3.1 Suppose that {Xn } is a stationary not necessarily ergodic source
with ergodic component function ψ . Then
¯
m({x : − lim ln mψ(x) (xn ) = Hmψ(x) (X )}) = 1; m − a.e..
n→∞ (3.21) Proof: Let
¯
G = {x : − lim ln mψ(x) (xn ) = Hmψ(x) (X )}
n→∞ and let Gλ denote the section of G at λ, that is,
¯
Gλ = {x : − lim ln mλ (xn ) = Hmλ (X )}.
n→∞ From the ergodic decomposition (e.g., Theorem 1.8.3 or [50], Theorem 8.5.1)
and (1.26)
m(G) = dPψ (λ)mλ (G), where
mλ (G) = m(Gψ = λ) = m(G {x : ψ (x) = λ}ψ = λ) = m(Gλ ψ = λ) = mλ (Gλ )
which is 1 for all λ from the stationary ergodic result. Thus
m(G) = dPψ (λ)mλ (Gλ ) = 1. It is straightforward to verify that all of the sets considered are in fact measurable.
2
Unfortunately it is not the sample entropy using the distribution of the
ergodic component that is of interest, rather it is the original sample entropy
for which we wish to prove convergence. The following lemma shows that the
two sample entropies converge to the same limit and hence Lemma 3.3.1 will also
provide the limit of the sample entropy with respect to the stationary measure. 3.3. STATIONARY NONERGODIC SOURCES 57 Lemma 3.3.2 Given a stationary source {Xn }, let {mλ ; λ ∈ Λ} denote the
ergodic decomposition and ψ the ergodic component function of Theorem 1.8.3.
Then
1 mψ (X n )
lim
ln
= 0; m − a.e.
n→∞ n
m(X n )
Proof: First observe that if m(an ) is 0, then from the ergodic decomposition
with probability 1 mψ (an ) will also be 0. One part is easy. For any > 0 we
have from the Markov inequality that
m( 1
m(X n )
m(X n )
m(X n ) −n
ln
> ) = m(
> en ) ≤ Em (
)e
.
n mψ (X n )
mψ (X n )
mψ (X n )
(λ) The expectation, however, can be evaluated as follows: Let An
mλ (an ) > 0}. Then
Em m(X n )
mψ (X n ) = dPψ (λ)
an ∈An m(an )
mλ (an ) =
mλ (an ) = {an : dPψ (λ)m(A(λ) ) ≤ 1,
n where Pψ is the distribution of ψ . Thus
m(
and hence 1
m(X n )
ln
> ) ≤ e−n .
n mψ (X n ) ∞ m(
n=1 m(X n )
1
ln
> )<∞
n mψ (X n ) and hence from the BorelCantelli lemma
m( 1
m(X n )
ln
>
n mψ (X n ) i.o.) = 0 and hence with m probability 1
lim sup
n→∞ Since m(X n )
1
ln
≤.
n mψ (X n ) is arbitrary,
lim sup
n→∞ 1
m(X n )
ln
≤ 0; m − a.e.
n mψ (X n ) (3.22) For later use we restate this as
lim inf
n→∞ 1 mψ (X n )
ln
≥ 0; m − a.e.
n
m(X n ) (3.23) Now turn to the converse inequality. For any positive integer k , we can
construct a stationary k step Markov approximation to m as in Section 2.6,
that is, construct a process m(k) with the conditional probabilities
k
k
m(k) (Xn ∈ F X n ) = m(k) (Xn ∈ F Xn−k ) = m(Xn ∈ F Xn−k ) 58 CHAPTER 3. THE ENTROPY ERGODIC THEOREM and the same k th order distributions m(k) (X k ∈ F ) = m(X k ∈ F ). Consider
the probability
m( 1 m(k) (X n )
m(k) (X n )
m(k) (X n ) −n
ln
≥ ) = m(
≥ en ) ≤ Em (
)e
.
n
m(X n )
m(X n )
m(X n ) The expectation is evaluated as xn m(k) (xn )
m(xn ) = 1
m(xn ) and hence we again have using BorelCantelli that
lim sup
n→∞ 1 m(k) (X n )
ln
≤ 0.
n
m(X n ) Apply the usual ergodic theorem to conclude that with probability 1 under m
lim sup
n→∞ 1
1
1
1
ln
≤ lim
ln (k) n = Emψ [− ln m(Xk X k )].
n)
n→∞ n
n m(X
m (X ) Combining this result with (3.20) and Lemma 2.4.3 yields
lim sup
n→∞ 1 mψ (X n )
¯
¯
ln
≤ −Hmψ (X ) − Emψ [ln m(Xk X k )]. = Hmψ m(k) (X ).
n
m(X n ) This bound holds for any integer k and hence it must also be true that ma.e.
the following holds:
lim sup
n→∞ 1 mψ (X n )
¯
ln
≤ inf Hmψ m(k) (X ) ≡ ζ.
k
n
m(X n ) (3.24) In order to evaluate ζ we apply the ergodic decomposition of relative entropy
rate (Corollary 2.4.2) and the ordinary ergodic decomposition to write
dPψ ζ ¯
dPψ inf Hmψ m(k) (X ) = k ≤ inf
k ¯
¯
dPψ Hmψ m(k) (X ) = inf Hmm(k) (X ).
k From Theorem 2.6.1, the right hand term is 0. If the integral of a nonnegative
function is 0, the integrand must itself be 0 with probability one. Thus (3.24)
becomes
1 mψ (X n )
≤ 0,
lim sup ln
m(X n )
n→∞ n
which with (3.23) completes the proof of the lemma.
We shall later see that the quantity
in (X n ; ψ ) = 1 mψ (X n )
ln
n
m(X n ) 2 3.4. AMS SOURCES 59 is the sample mutual information (in a generalized sense so that it applies to the
usually nondiscrete ψ ) and hence the lemma states that the normalized sample
mutual information between the process outputs and the ergodic component
function goes to 0 as the number of samples goes to inﬁnity.
The two previous lemmas immediately yield the following result.
Corollary 3.3.1 The conclusions of Theorem 3.1.1 hold for sources that are
stationary. 3.4 AMS Sources The principal idea required to extend the entropy theorem from stationary
sources to AMS sources is contained in Lemma 3.4.2. It shows that an AMS
source inherits sample entropy properties from an asymptotically dominating
stationary source (just as it inherits ordinary ergodic properties from such a
source). The result is originally due to Gray and Kieﬀer [54], but the proof here
is somewhat diﬀerent. The tough part here is handling the fact that the sample
average being considered depends on a speciﬁc measure. From Theorem 1.7.1,
the stationary mean of an AMS source dominates the original source on tail
events, that is, events in F∞ . We begin by showing that certain important
events can be recast as tail events, that is, they can be determined by looking
at only samples in the arbitrarily distant future. The following result is of this
variety: It implies that sample entropy is unaﬀected by the starting time.
Lemma 3.4.1 Let {Xn } be a ﬁnite alphabet source with distribution m. Recall
n
that Xk = (Xk , Xk+1 , · · · , Xk+n−1 ) and deﬁne the information density
n
i(X k ; Xk −k ) = ln Then
lim n→∞ m(X n )
.
n
m(X k )m(Xk −k ) 1
n
i(X k ; Xk −k ) = 0; m − a.e.
n Comment: The lemma states that with probability 1 the persample mutual
information density between the ﬁrst k samples and future samples goes to zero
in the limit. Equivalently, limits of n−1 ln m(X n ) will be the same as limits of
n
n−1 ln m(Xk −k ) for any ﬁnite k . Note that the result does not require even that
the source be AMS. The lemma is a direct consequence of Lemma 2.7.1.
Proof: Deﬁne the distribution p = mX k × mXk ,Xk+1 ,··· , that is, a distribution
for which all samples after the ﬁrst k are independent of the ﬁrst k samples.
n
Thus, in particular, p(X n ) = m(X k )m(Xk ). We will show that p
m, in
which case the lemma will follow from Lemma 2.7.1. Suppose that p(F ) = 0. If
+
we denote Xk = Xk , Xk+1 , · · · , then
m(xk )mX + (Fxk ), 0 = p(F ) = k xk 60 CHAPTER 3. THE ENTROPY ERGODIC THEOREM where Fxk is the section {x+ : (xk , x+ ) = x ∈ F }. For the above relation to
k
k
hold, we must have mX + (Fxk ) = 0 for all xk with m(xk ) = 0. We also have,
k
however, that
+
m(F ) =
m(X k = ak , Xk ∈ Fak )
ak
+
+
m(X k = ak Xk ∈ Fak )m(Xk ∈ Fak ). =
ak But this sum must be 0 since the rightmost terms are 0 for all ak for which
+
m(X k = ak ) is not 0. (Observe that we must have m(X k = ak Xk ∈ Fak ) =
+
+
k
k
k
k
0 if m(Xk ∈ Fak ) = 0 since otherwise m(X = a ) ≥ m(X = a , Xk ∈ Fak )
> 0, yielding a contradiction.) Thus p
m and the lemma is proved.
2
For later use we note that we have shown that a joint distribution is dominated by a product of its marginals if one of the marginal distributions is
discrete.
Lemma 3.4.2 Suppose that {Xn } is an AMS source with distribution m and
suppose that m is a stationary source that asymptotically dominates m (e.g., m
¯
¯
is the stationary mean). If there is an invariant function h such that
lim − n→∞ 1
ln m(X n ) = h; m − a.e.,
¯
¯
n then also,
lim − n→∞ 1
ln m(X n ) = h; m − a.e.
n Proof: For any k we can write using the chain rule for densities
− 1
1
n
ln m(X n ) + ln m(Xk −k )
n
n 1
n
ln m(X k Xk −k )
n
1
1
n
= − i(X k ; Xk −k ) − ln m(X k ).
n
n
=− From the previous lemma and from the fact that Hm (X k ) = −Em ln m(X k ) is
ﬁnite, the right hand terms converge to 0 as n → ∞ and hence for any k
lim − n→∞ 1
n
ln m(X k Xk −k ) =
n
1
1
n
lim (− ln m(X n ) + ln m(Xk −k )) = 0; m − a.e. (3.25)
n→∞
n
n This implies that there is a subsequence k (n) → ∞ such that
− 1
1
1
n−k(n)
n
ln m(X k(n) Xk(n) ) = − ln m(X n ) − ln m(Xk(−k (n)) → 0; m − a.e.
n)
n
n
n
(3.26) 3.4. AMS SOURCES 61 To see this, observe that (3.25) ensures that for each k there is an N (k ) large
enough so that N (k ) > N (k − 1) and
m( − 1
N (k)−k
ln m(X k Xk
) > 2−k ) ≤ 2−k .
N (k ) (3.27) Applying the BorelCantelli lemma implies that for any ,
N (k)−k m( − 1/N (k ) ln m(X k Xk ) > i.o.) = 0. Now let k (n) = k for N (k ) ≤ n < N (k + 1). Then
n−k ( n) m( − 1/n ln m(X k(n) Xk(n) ) > i.o.) = 0 and therefore
lim n→∞ − 1
1
n−k(n)
ln m(X n ) + ln m(Xk(n) )
n
n = 0; m − a.e. as claimed in (3.26).
In a similar manner we can also choose the sequence so that
lim n→∞ − 1
1
n−k(n)
ln m(X n ) + ln m(Xk(n) )
¯
¯
n
n = 0; m − a.e.,
¯ that is, we can choose N (k ) so that (3.27) simultaneously holds for both m and
m. Invoking the entropy ergodic theorem for the stationary m (Corollary 3.3.1)
¯
¯
we have therefore that
1
n−k ( n)
¯¯
lim − ln m(Xk(n) ) = h; m − a.e..
¯
(3.28)
n→∞
n
From Markov’s inequality (Lemma 4.4.3 of [50])
m(−
¯ 1
n
ln m(Xk ) ≤
n − n
1
m(Xk )
ln m(Xk ) − ) = m(
¯n
¯
≥ en )
n
m(Xk )
¯n ≤ e−n Em
¯ = n
m(Xk −k )
m(Xk −k )
¯n e−n
xn−k :m(xn−k )=0
¯k
k ≤ n
m(xk −k )
m(xn−k )
¯k
m(xk −k )
¯n e−n . Hence taking k = k (n) and again invoking the BorelCantelli lemma we have
that
1
1
n−k(n)
n−k(n)
m(− ln m(Xk(n) ) ≤ − ln m(Xk(n) ) − i.o.) = 0
¯
¯
n
n
or, equivalently, that
n−k(n) lim inf −
n→∞ 1 m(Xk(n) )
ln
≥ 0; m − a.e.
¯
n m(X n−k(n) )
¯
k(n) (3.29) 62 CHAPTER 3. THE ENTROPY ERGODIC THEOREM Therefore from (3.28)
lim inf −
n→∞ 1
n−k(n)
ln m(Xk(n) ) ≥ h; m − a.e..
¯
n (3.30) The above event is in the tail σ ﬁeld F∞ = n σ (Xn , Xn+1 , · · · ) since it can be
determined from Xk(n) , · · · for arbitrarily large n and since h is invariant. Since
m dominates m on the tail σ ﬁeld (Theorem 1.7.2), we have also
¯
lim inf −
n→∞ 1
n−k ( n)
ln m(Xk(n) ) ≥ h; m − a.e.
n and hence by (3.26)
lim inf −
n→∞ 1
ln m(X n ) ≥ h; m − a.e.
n which proves half of the lemma.
Since
m( lim −
¯
n→∞ 1
ln m(X n ) = h) = 0
¯
n and since m asymptotically dominates m (Theorem 1.7.1), given
¯
a k such that
1
m( lim − ln m(Xk ) = h) ≥ 1 − .
¯n
n→∞
n > 0 there is Again applying Markov’s inequality and the BorelCantelli lemma as in the
development of (3.28) we have that
lim inf −
n→∞ 1 m(Xk )
¯n
ln
n ≥ 0; m − a.e,
n m(Xk ) which implies that
m(lim sup −
n→∞ 1
n
ln m(Xk ) ≤ h) ≥ 1 −
n and hence also that
m(lim sup −
n→∞ Since 1
ln m(X n ) ≤ h) ≥ 1 − .
n can be made arbitrarily small, this proves that ma.e.
lim sup −n−1 ln m(X n ) ≤ h,
n→∞ which completes the proof of the lemma. 2 The lemma combined with Corollary 3.3.1 completes the proof of Theorem 3.1.1.
2 3.5. THE ASYMPTOTIC EQUIPARTITION PROPERTY 3.5 63 The Asymptotic Equipartition Property Since convergence almost everywhere implies convergence in probability, Theorem 3.1.1 has the following implication: Suppose that {Xn } is an AMS ergodic
¯
source with entropy rate H . Given > 0 there is an N such that for all n > N
the set
¯
¯
¯
Gn = {xn : n−1 hn (x) − H  ≥ } = {xn : e−n(H + ) ≤ m(xn ) ≤ e−n(H − ) } has probability greater then 1 − . Furthermore, as in the proof of the theorem,
¯
there can be no more than en(H + ) ntuples in Gn . Thus there are two sets of n¯
tuples: a “good” set of approximately enH ntuples having approximately equal
¯
−nH
probability of e
and the complement of this set which has small total probability. The set of good sequences are often referred to as “typical sequences”
in the information theory literature and in this form the theorem is called the
asymptotic equipartition property or the AEP.
As a ﬁrst information theoretic application of an ergodic theorem, we consider a simple coding scheme called an “almost noiseless source code.” As we
often do, we consider logarithms to the base 2 when considering speciﬁc coding
applications. Suppose that a random process {Xn } has a ﬁnite alphabet A with
¯
¯
cardinality A and entropy rate H . Suppose that H < log A, e.g., A might
have 16 symbols, but the entropy rate is slightly less than 2 bits per symbol
rather than log 16 = 4. Larger alphabets cost money in either storage or communication applications. For example, to communicate a source with a 16 letter
alphabet sending one letter per second without using any coding and using a
binary communication system we would need to send 4 binary symbols (or four
bits) for each source letter and hence 4 bits per second would be required. If
the alphabet only had 4 letters, we would need to send only 2 bits per second.
The question is the following: Since our source has an alphabet of size 16 but
an entropy rate of less than 2, can we code the original source into a new source
with an alphabet of only 4 letters so as to communicate the source at the smaller
rate and yet have the receiver be able to recover the original source? The AEP
suggests a technique for accomplishing this provided we are willing to tolerate
occasional errors.
We construct a code of the original source by ﬁrst picking a small and
¯
a δ small enough so that H + δ < 2. Choose a large enough n so that the
AEP holds giving a set Gn of good sequences as above with probability greater
¯
than 1 − . Index this collection of fewer than 2n(H +δ) < 22n sequences using
n
binary 2ntuples. The source Xk is parsed into blocks of length n as Xkn =
(Xkn , Xkn+1 , · · · , X(k+1)n ) and each block is encoded into a binary 2ntuple as
follows: If the source ntuple is in Gn , the codeword is its binary 2ntuple index.
Select one of the unused binary 2ntuples as the error index and whenever an
ntuple is not in Gn , the error index is the codeword. The receiver or decoder
than uses the received index and decodes it as the appropriate ntuple in Gn . If
the error index is received, the decoder can declare an arbitrary source sequence
or just declare an error. With probability at least 1 − a source ntuple at
a particular time will be in Gn and hence it will be correctly decoded. We 64 CHAPTER 3. THE ENTROPY ERGODIC THEOREM can make this probability as small as desired by taking n large enough, but we
cannot in general make it 0.
The above simple scheme is an example of a block coding scheme. If considered as a mapping from sequences into sequences, the map is not stationary,
but it is block stationary in the sense that shifting an input block by n results
in a corresponding block shift of the encoded sequence by 2n binary symbols. Chapter 4 Information Rates I
4.1 Introduction Before proceeding to generalizations of the various measures of information,
entropy, and divergence to nondiscrete alphabets, we consider several properties
of information and entropy rates of ﬁnite alphabet processes. We show that
codes that produce similar outputs with high probability yield similar rates and
that entropy and information rate, like ordinary entropy and information, are
reduced by coding. The discussion introduces a basic tool of ergodic theory–
the partition distance–and develops several versions of an early and fundamental
result from information theory–Fano’s inequality. We obtain an ergodic theorem
for information densities of ﬁnite alphabet processes as a simple application of
the general ShannonMcMillanBreiman theorem coupled with some deﬁnitions.
In Chapter 6 these results easily provide L1 ergodic theorems for information
densities for more general processes. 4.2 Stationary Codes and Approximation We consider the behavior of entropy when codes or measurements are taken on
the underlying random variables. We have seen that entropy is a continuous
function with respect to the underlying measure. We now wish to ﬁx the measure
and show that entropy is a continuous function with respect to the underlying
measurement.
Say we have two ﬁnite alphabet measurements f and g on a common probability space having a common alphabet A. Suppose that Q and R are the
corresponding partitions. A common metric or distance measure on partitions
in ergodic theory is
1
Q − R =
P (Qi ∆Ri ),
(4.1)
2i
which in terms of the measurements (assuming they have distinct values on distinct atoms) is just Pr(f = g ). If we consider f and g as two codes on a common
65 66 CHAPTER 4. INFORMATION RATES I space, random variable, or random process (that is, ﬁnite alphabet mappings),
then the partition distance can also be considered as a form of distance between
the codes. The following lemma shows that entropy of partitions or measurements is continuous with respect to this distance. The result is originally due
to Fano and is called Fano’s inequality [37].
Lemma 4.2.1 Given two ﬁnite alphabet measurements f and g on a common
probability space (Ω, B , P ) having a common alphabet A or, equivalently, the
given corresponding partitions Q = {f −1 (a); a ∈ A} and R = {g −1 (a); a ∈ A},
deﬁne the error probability Pe = Q − R = Pr(f = g ). Then
H (f g ) ≤ h2 (Pe ) + Pe ln(A − 1)
and
H (f ) − H (g ) ≤ h2 (Pe ) + Pe ln(M − 1)
and hence entropy is continuous with respect to partition distance for a ﬁxed
measure.
Proof: Let M = A and deﬁne a measurement
r : A × A → {0, 1, · · · , M − 1}
by r(a, b) = 0 if a = b and r(a, b) = i if a = b and a is the ith letter in the
alphabet Ab = A − b. If we know g and we know r(f, g ), then clearly we know
f since either f = g (if r(f, g ) is 0) or, if not, it is equal to the r(f, g )th letter
in the alphabet A with g removed. Since f can be considered a function of g
and r(f, g ),
H (f g, r(f, g )) = 0
and hence
H (f, g, r(f, g )) = H (f g, r(f, g )) + H (g, r(f, g )) = H (g, r(f, g )).
Similarly
H (f, g, r(f, g )) = H (f, g ).
From Lemma 2.3.2
H (f, g ) = H (g, r(f, g )) ≤ H (g ) + H (r(f, g ))
or
H (f, g ) − H (g ) = H (f g ) ≤ H (r(f, g ))
M −1 = −P (r = 0) ln P (r = 0) − P (r = i) ln P (r = i).
i=1 4.2. STATIONARY CODES AND APPROXIMATION
Since P (r = 0) = 1 − Pe and since P (r = i) = Pe , this becomes i=0 M −1 H (f g ) ≤ −(1 − Pe ) ln(1 − Pe ) − Pe
i=1 ≤ 67 P (r = i) P (r = i)
ln
− Pe ln Pe
Pe
Pe h2 (Pe ) + Pe ln(M − 1) since the entropy of a random variable with an alphabet of size M − 1 is no
greater than ln(M − 1). This proves the ﬁrst inequality. Since H (f ) ≤ H (f, g ) =
H (f g ) + H (g ), this implies
H (f ) − H (g ) ≤ h2 (Pe ) + Pe ln(M − 1).
Interchanging the roles of f and g completes the proof.
2
The lemma can be used to show that related information measures such
as mutual information and conditional mutual information are also continuous
with respect to the partition metric. The following corollary provides useful
extensions. Similar extensions may be found in Csisz´r and K¨rner [26].
a
o
Corollary 4.2.1 Given two sequences of measurements {fn } and {gn } with
ﬁnite alphabet A on a common probability space, deﬁne
(
Pe n) 1
=
n n−1 Pr(fi = gi ).
i=0 Then
1
(
(
H (f n g n ) ≤ Pe n) ln(A − 1) + h2 (Pe n) )
n
and
1
1
(
(
 H (f n ) − H (g n ) ≤ Pe n) ln(A − 1) + h2 (Pe n) ).
n
n
If {fn , gn } are also AMS and hence the limit
(
¯
Pe = lim Pe n)
n→∞ exists, then if we deﬁne
1
1
¯
H (f g ) = lim H (f n g n ) = lim (H (f n , g n ) − H (g n )),
n→∞ n
n→∞ n
where the limits exist since the processes are AMS, then
¯
H (f g ) ≤
¯
¯
H (f ) − H (g ) ≤ ¯
¯
Pe ln(A − 1) + h2 (Pe )
¯
¯
Pe ln(A 1) + h2 (Pe ). 68 CHAPTER 4. INFORMATION RATES I Proof: From the chain rule for entropy (Corollary 2.5.1), Lemma 2.5.2, and
Lemma 4.2.1
n−1 H ( f n g n ) n−1 H (fi f i , g n ) ≤ =
i=0 n−1 H (fi g i ) ≤
i=0 H (fi gi )
i=0 n−1 ≤ (Pr(fi = gi ) ln(A − 1) + h2 (Pr(fi = gi )))
i=0 from the previous lemma. Dividing by n yields the ﬁrst inequality which implies the second as in the proof of the previous lemma. If the processes are
jointly AMS, then the limits exist and the entropy rate results follows from the
continuity of h2 by taking the limit.
2
(n)
The persymbol probability of error Pe has an alternative form. Recall
that the (average) Hamming distance between two vectors is the number of
positions in which they diﬀer, i.e.,
(1) dH (x0 , y0 ) = 1 − δx0 ,y0 ,
where δa,b is the Kronecker delta function (0 if a = b and 1 otherwise), and
n−1
(n) (1) dH (xn , y n ) = dH (xi , yi ).
i=0 We have then that
(
Pe n ) = E 1 (n) n n
d (f , g ) ,
nH the normalized average Hamming distance.
The next lemma and corollary provide a useful tool for approximating complicated codes by simpler ones.
Lemma 4.2.2 Given a probability space (Ω, B , P ) suppose that F is a generating ﬁeld: B = σ (F ). Suppose that B measurable Q is a partition of Ω and
> 0. Then there is a partition Q with atoms in F such that Q − Q  ≤ .
Proof: Let A = K . From Theorem1.2.1 given γ > 0 we can ﬁnd sets Ri ∈ F
such that P (Qi ∆Ri ) ≤ γ for i = 1, 2, · · · , K − 1. The remainder of the proof
consists of set theoretic manipulations showing that we can construct the desired
partition from the Ri by removing overlapping pieces. The algebra is given for
completeness, but it can be skipped. Form a partition from the sets as
i −1 Qi = Ri − Rj , i = 1, 2, · · · , K − 1
j =1 K −1 QK = Q i )c . (
i=1 4.2. STATIONARY CODES AND APPROXIMATION 69 For i < K
= P (Qi Q i ) − P (Qi Q i) ≤ P (Qi ∆Q i ) P (Qi Ri ) − P (Qi (Ri − Rj )).
j <i The rightmost term can be written as
(Ri − P (Qi Rj )) Ri ) − ( = P ((Qi j <i Qi Ri Rj )) j <i Ri ) − P ( = P (Qi Qi Ri Rj ), j <i where we have used the fact that a set diﬀerence is unchanged if the portion
being removed is intersected with the set it is being removed from and we have
used the fact that P (F − G) = P (F ) − P (G) if G ⊂ F . Combining (4.2) and
(4.2) we have that
P (Qi ∆Q i ) ≤ P (Qi Ri ) − P (Qi Ri ) + P ( Qi Ri Rj ) j <i = P (Qi ∆Ri ) + P ( Qi Ri Rj ) j <i ≤ γ+ P (Qi Rj ). j <i For j = i, however, we have that
P (Qi Rj ) Qc ) ≤ P (Rj
j = P (Qi Rj ≤ Qc )
j P (Rj ∆Qj ) ≤ γ, which with the previous equation implies that
P (Qi ∆Q i ) ≤ Kγ ; i = 1, 2, · · · , K − 1.
For the remaining atom:
P (QK ∆Q K ) = P (QK c QK Qc
K Q K ). (4.2) We have
QK c Q K = QK ( Q j ) = QK ( j <K Qj Qc ),
j j <K where the last equality follows since points in Q j that are also in Qj cannot contribute to the intersection with QK since the Qj are disjoint. Since
Q j Qc ⊂ Q j ∆Qj we have
j
QK c Q K ⊂ QK (
j <K Q j ∆Qj ) ⊂ Q j ∆Qj .
j <K 70 CHAPTER 4. INFORMATION RATES I A similar argument shows that
Qc
K QK ⊂ Q j ∆Qj
j <k and hence with (4.2)
P (QK ∆Q K ) ≤ P ( P (Qj ∆Q j ) ≤ K 2 γ. Qj ∆Q j ) ≤
j <K j <K To summarize, we have shown that
P (Qi ∆Q i ) ≤ K 2 γ ; i = 1, 2, · · · , K
If we now choose γ so small that K 2 γ ≤ /K , the lemma is proved. 2 Corollary 4.2.2 Let (Ω, B , P ) be a probability space and F a generating ﬁeld.
Let f : Ω → A be a ﬁnite alphabet measurement. Given > 0 there is a
measurement g : Ω → A that is measurable with respect to F (that is, g −1 (a) ∈
F for all a ∈ A) for which P (f = g ) ≤ .
Proof: Follows from the previous lemma by setting Q = {f −1 (a); a ∈ A},
choosing Q from the lemma, and then assigning g for atom Q i in Q the same
value that f takes on in atom Qi in Q. Then
P (f = g ) = 1
2 P (Qi ∆Q i ) ≤ .
i 2
We now develop applications of the previous results which relate the idea of
the entropy of a dynamical system with the entropy rate of a random process.
The result is not required for later coding theorems, but it provides insight into
the connections between entropy as considered in ergodic theory and entropy as
used in information theory. In addition, the development involves some ideas of
coding and approximation which are useful in proving the ergodic theorems of
information theory used to prove coding theorems.
Let {Xn } be a random process with alphabet AX . Let A∞ denote the one or
X
twosided sequence space. Consider the dynamical system (Ω, B , P, T ) deﬁned
by (A∞ , B (AX )∞ , P, T ), where P is the process distribution and T the shift.
X
Recall from Section 2.2 that a stationary coding or inﬁnite length sliding block
coding of {Xn } is a measurable mapping f : A∞ → Af into a ﬁnite alphabet
X
which produces an encoded process {fn } deﬁned by
fn (x) = f (T n x); x ∈ A∞ .
X
The entropy H (P, T ) of the dynamical system was deﬁned by
¯
H (P, T ) = sup HP (f ),
f 4.2. STATIONARY CODES AND APPROXIMATION 71 the supremum of the entropy rates of ﬁnite alphabet stationary codings of the
original process. We shall soon show that if the original alphabet is ﬁnite, then
the entropy of the dynamical system is exactly the entropy rate of the process.
First, however, we require several preliminary results, some of independent interest.
Lemma 4.2.3 If f is a stationary coding of an AMS process, then the process
{fn } is also AMS. If the input process is ergodic, then so is {fn }.
Proof: Suppose that the input process has alphabet AX and distribution P
and that the measurement f has alphabet Af . Deﬁne the sequence mapping
¯
¯
f : A∞ → A∞ by f (x) = {fn (x); n ∈ T }, where fn (x) = f (T n x) and T is
X
f
the shift on the input sequence space A∞ . If T also denotes the shift on the
X
¯
¯
output space, then by construction f (T x) = T f (x) and hence for any output
¯−1 (T −1 F ) = T −1 f −1 (F ). Let m denote the process distribution for
¯
event F , f
¯
the encoded process. Since m(F ) = P (f −1 (F )) for any event F ∈ B (Af )∞ , we
have using the stationarity of the mapping f that
1
n→∞ n n−1 m(T −i F ) lim = i=0 = 1
n→∞ n n−1 ¯
P (f −1 (T −i F )) lim lim n→∞ 1
n i=0
n−1 ¯
¯¯
P (T −i f −1 (F )) = P (f −1 (F )),
i=0 ¯
where P is the stationary mean of P . Thus m is AMS. If G is an invariant
¯
¯
¯
output event, then f −1 (G) is also invariant since T −1 f −1 (G) = f −1 (T −1 G).
Hence if input invariant sets can only have probability 1 or 0, the same is true
for output invariant sets.
2
The lemma and Theorem3.1.1 immediately yields the following:
Corollary 4.2.3 If f is a stationary coding of an AMS process, then
1
¯
H (f ) = lim H (f n ),
n→∞ n
that is, the limit exists.
For later use the next result considers general standard alphabets. A stationary code f is a scalar quantizer if there is a map q : AX → Af such that
f (x) = q (x0 ). Intuitively, f depends on the input sequence only through the
current symbol. Mathematically, f is measurable with respect to σ (X0 ). Such
codes are eﬀectively the simplest possible and have no memory or dependence
on the future.
Lemma 4.2.4 Let {Xn } be an AMS process with standard alphabet AX and
distribution m. Let f be a stationary coding of the process with ﬁnite alphabet 72 CHAPTER 4. INFORMATION RATES I Af . Fix > 0. If the process is twosided, then there is a scalar quantizer
q : AX → Aq , an integer N , and a mapping g : AN → Af such that
q
1
n→∞ n n−1 Pr(fi = g (q (Xi−N ), q (Xi−N +1 ), · · · , q (Xi+N ))) ≤ . lim i=0 If the process is onesided, then there is a scalar quantizer q : AX → Aq , an
integer N , and a mapping g : AN → Af such that
q
1
n→∞ n n−1 Pr(fi = g (q (Xi ), q (Xi+1 ), · · · , q (Xi+N −1 ))) ≤ . lim i=0 Comment: The lemma states that any stationary coding of an AMS process can
be approximated by a code that depends only on a ﬁnite number of quantized
inputs, that is, by a coding of a ﬁnite window of a scalar quantized version of
the original process. In the special case of a ﬁnite alphabet input process, the
lemma states that an arbitrary stationary coding can be well approximated by
a coding depending only on a ﬁnite number of the input symbols.
Proof: Suppose that m is the stationary mean and hence for any measurements
¯
f and g
n−1
1
Pr(fi = gi ).
m(f0 = g0 ) = lim
¯
n→∞ n
n=0
Let qn be an asymptotically accurate scalar quantizer in the sense that σ (qn (X0 ))
asymptotically generates B (AX ). (Since AX is standard this exists. If AX is
ﬁnite, then take q (a) = a.) Then Fn = σ (qn (Xi ); i = 0, 1, 2, · · · , n − 1) asymptotically generates B (AX )∞ for onesided processes and Fn = σ (qn (Xi ); i =
−n, · · · , n) does the same for twosided processes. Hence from Corollary 4.2.2
given we can ﬁnd a suﬃciently large n and a mapping g that is measurable
with respect to Fn such that m(f = g ) ≤ . Since g is measurable with respect
¯
to Fn , it must depend on only the ﬁnite number of quantized samples that
generate Fn . (See, e.g., Lemma 5.2.1 of [50].) This proves the lemma.
2
Combining the lemma and Corollary 4.2.1 immediately yields the following
corollary, which permits us to study the entropy rate of general stationary codes
by considering codes which depend on only a ﬁnite number of inputs (and hence
for which the ordinary entropy results for random vectors can be applied).
Corollary 4.2.4 Given a stationary coding f of an AMS process let Fn be
deﬁned as above. Then given > 0 there exists for suﬃciently large n a code g
measurable with respect to Fn such that
¯
¯
H (f ) − H (g ) ≤ .
The above corollary can be used to show that entropy rate, like entropy,
is reduced by coding. The general stationary code is approximated by a code
depending on only a ﬁnite number of inputs and then the result that entropy is
reduced by mapping (Lemma 2.3.3) is applied. 4.3. INFORMATION RATE OF FINITE ALPHABET PROCESSES 73 Corollary 4.2.5 Given an AMS process {Xn } with ﬁnite alphabet AX and a
stationary coding f of the process, then
¯
¯
H (X ) ≥ H (f ),
that is, stationary coding reduces entropy rate.
Proof: For integer n deﬁne Fn = σ (X0 , X1 , · · · , Xn ) in the onesided case
and σ (X−n , · · · , Xn ) in the twosided case. Then Fn asymptotically generates B (AX )∞ . Hence given a code f and an > 0 we can choose using the ﬁnite
alphabet special case of the previous lemma a large k and a Fk measurable code
¯
¯
¯
¯
g such that H (f ) − H (g ) ≤ . We shall show that H (g ) ≤ H (X ), which will
prove the lemma. To see this in the onesided case observe that g is a function
of X k and hence g n depends only on X n+k and hence
H (g n ) ≤ H (X n+k )
and hence
1n
1
¯
¯
H (X n+k ) = H (X ).
H (g ) = lim H (g n ) ≤ lim
n→∞ n n + k
n→∞ n
In the twosided case g depends on {X−k , · · · , Xk } and hence gn depends on
{X−k , · · · , Xn+k } and hence
H (g n ) ≤ H (X−k , · · · , X−1 , X0 , · · · , Xn+k ) ≤ H (X−k , · · · , X−1 ) + H (X n+k ).
Dividing by n and taking the limit completes the proof as before. 2 Theorem 4.2.1 Let {Xn } be a random process with alphabet AX . Let A∞
X
denote the one or twosided sequence space. Consider the dynamical system
(Ω, B , P, T ) deﬁned by (A∞ , B (AX )∞ , P, T ), where P is the process distribution
X
and T is the shift. Then
¯
H (P, T ) = H (X ).
¯
Proof: From (2.2.4), H (P, T ) ≥ H (X ). Conversely suppose that f is a code
¯
which yields H (f ) ≥ H (P, T ) − . Since f is a stationary coding of the process
¯
¯
{Xn }, the previous corollary implies that H (f ) ≤ H (X ), which completes the
proof.
2 4.3 Information Rate of Finite Alphabet Processes Let {(Xn , Yn )} be a onesided random process with ﬁnite alphabet A × B and
let ((A × B )Z+ , B (A × B )Z+ ) be the corresponding onesided sequence space of
outputs of the pair process. We consider Xn and Yn to be the sampling functions
on the sequence spaces A∞ and B ∞ and (Xn , Yn ) to be the pair sampling 74 CHAPTER 4. INFORMATION RATES I function on the product space, that is, for (x, y ) ∈ A∞ × B ∞ , (Xn , Yn )(x, y )
= (Xn (x), Yn (y )) = (xn , yn ). Let p denote the process distribution induced by
the original space on the process {(Xn , Yn )}. Analogous to entropy rate we
can deﬁne the mutual information rate (or simply information rate) of a ﬁnite
alphabet pair process by
1
¯
I (X, Y ) = lim sup I (X n , Y n ).
n→∞ n
The following lemma follows immediately from the properties of entropy rates
of Theorems 2.4.1 and 3.1.1 since for AMS ﬁnite alphabet processes
¯
¯
¯
¯
I (X ; Y ) = H (X ) + H (Y ) − H (X, Y )
and since from (3.4) the entropy rate of an AMS process is the same as that of its
stationary mean. Analogous to Theorem 3.1.1 we deﬁne the random variables
p(X n , Y n ) by p(X n , Y n )(x, y ) = p(X n = xn , Y n = y n ), p(X n ) by p(X n )(x, y )
= p(X n = xn ), and similarly for p(Y n ).
Lemma 4.3.1 Suppose that {Xn , Yn } is an AMS ﬁnite alphabet random process
with distribution p and stationary mean p. Then the limits supremum deﬁning
¯
information rates are limits and
¯
¯¯
Ip (X, Y ) = Ip (X, Y ).
¯
Ip is an aﬃne function of the distribution p. If p has ergodic decomposition pxy ,
¯
¯
then
¯
¯¯
Ip (X, Y ) = dp(x, y )Ipxy (X, Y ).
¯
If we deﬁne the information density
in (X n , Y n ) = ln p(X n , Y n )
,
p(X n )p(Y n ) then 1
¯¯
in (X n , Y n ) = Ipxy (X, Y )
n
almost everywhere with respect to p and p and in L1 (p).
¯
lim n→∞ The following lemmas follow either directly from or similarly to the corresponding results for entropy rate of the previous section.
Lemma 4.3.2 Suppose that {Xn , Yn , X n , Y
1
¯
P = lim
n→∞ n n} is an AMS process and n−1 Pr((Xi , Yi ) = (X i , Y i )) ≤
i=0 (the limit exists since the process is AMS). Then
¯
¯
I (X ; Y ) − I (X ; Y ) ≤ 3( ln(A − 1) + h2 ( )). 4.3. INFORMATION RATE OF FINITE ALPHABET PROCESSES 75 Proof: The inequality follows from Corollary 4.2.1 since
¯(X ; Y ) − I (X ; Y ) ≤
¯

¯
¯
¯
¯
¯
¯
H (X ) − H (X ) + H (Y ) − H (Y ) + H (X, Y ) − H (X , Y )
and since Pr((Xi , Yi ) = (Xi , Yi )) = Pr(Xi = Xi or Yi = Yi ) is no smaller
than Pr(Xi = Xi ) or Pr(Yi = Yi ).
2
Corollary 4.3.1 Let {Xn , Yn } be an AMS process and let f and g be stationary
measurements on X and Y , respectively. Given > 0 there is an N suﬃciently
large, scalar quantizers q and r, and mappings f and g which depend only
on {q (X0 ), · · · , q (XN −1 )} and {r(Y0 ), · · · , r(YN −1 )} in the onesided case and
{q (X−N ), · · · , q (XN )} and {r(Y−N ), · · · , r(YN )} in the twosided case such that
¯
¯
I (f ; g ) − I (f ; g ) ≤ .
Proof: Choose the codes f and g from Lemma 4.2.4 and apply the previous
lemma.
2
Lemma 4.3.3 If {Xn , Yn } is an AMS process and f and g are stationary codings of X and Y , respectively, then
¯
¯
I (X ; Y ) ≥ I (f ; g ).
Proof: This is proved as Corollary 4.2.5 by ﬁrst approximating f and g by ﬁnitewindow stationary codes, applying the result for mutual information (Lemma 2.5.2),
and then taking the limit.
2 76 CHAPTER 4. INFORMATION RATES I Chapter 5 Relative Entropy
5.1 Introduction A variety of information measures have been introduced for ﬁnite alphabet random variables, vectors, and processes:entropy, mutual information, relative entropy, conditional entropy, and conditional mutual information. All of these
can be expressed in terms of divergence and hence the generalization of these
deﬁnitions to inﬁnite alphabets will follow from a general deﬁnition of divergence. Many of the properties of generalized information measures will then
follow from those of generalized divergence.
In this chapter we extend the deﬁnition and develop the basic properties
of divergence, including the formulas for evaluating divergence as expectations
of information densities and as limits of divergences of ﬁnite codings. We also
develop several inequalities for and asymptotic properties of divergence. These
results provide the groundwork needed for generalizing the ergodic theorems of
information theory from ﬁnite to standard alphabets. The general deﬁnitions
of entropy and information measures originated in the pioneering work of Kolmogorov and his colleagues Gelfand, Yaglom, Dobrushin, and Pinsker [45] [91]
[32] [126]. 5.2 Divergence Given a probability space (Ω, B , P ) (not necessarily with ﬁnite alphabet) and
another probability measure M on the same space, deﬁne the divergence of P
with respect to M by
D(P M ) = sup HP
Q M (Q) = sup D(Pf Mf ), (5.1) f where the ﬁrst supremum is over all ﬁnite measurable partitions Q of Ω and the
second is over all ﬁnite alphabet measurements on Ω. The two forms have the
same interpretation: the divergence is the supremum of the relative entropies
77 78 CHAPTER 5. RELATIVE ENTROPY or divergences obtainable by ﬁnite alphabet codings of the sample space. The
partition form is perhaps more common when considering divergence per se,
but the measurement or code form is usually more intuitive when considering
entropy and information. This section is devoted to developing the basic properties of divergence, all of which will yield immediate corollaries for the measures
of information.
The ﬁrst result is a generalization of the divergence inequality that is a trivial
consequence of the deﬁnition and the ﬁnite alphabet special case.
Lemma 5.2.1 The Divergence Inequality:
For any two probability measures P and M
D(P M ) ≥ 0
with equality if and only if P = M .
Proof: Given any partition Q, Theorem 2.3.1 implies that
P (Q) ln
Q∈Q P (Q)
≥0
M (Q) with equality if and only if P (Q) = M (Q) for all atoms Q of the partition. Since
D(P Q) is the supremum over all such partitions, it is also nonnegative. It can
be 0 only if P and M assign the same probabilities to all atoms in all partitions
(the supremum is 0 only if the above sum is 0 for all partitions) and hence the
divergence is 0 only if the measures are identical.
2
As in the ﬁnite alphabet case, Lemma 5.2.1 justiﬁes interpreting divergence
as a form of distance or dissimilarity between two probability measures. It is
not a true distance or metric in the mathematical sense since it is not symmetric
and it does not satisfy the triangle inequality. Since it is nonnegative and equals
zero only if two measures are identical, the divergence is a distortion measure
as considered in information theory [51], which is a generalization of the notion
of distance. This view often provides interpretations of the basic properties of
divergence. We shall develop several relations between the divergence and other
distance measures. The reader is referred to Csisz´r [25] for a development of
a
the distancelike properties of divergence.
The following two lemmas provide means for computing divergences and
studying their behavior. The ﬁrst result shows that the supremum can be conﬁned to partitions with atoms in a generating ﬁeld. This will provide a means
for computing divergences by approximation or limits. The result is due to
Dobrushin and is referred to as Dobrushin’s theorem. The second result shows
that the divergence can be evaluated as the expectation of an entropy density
deﬁned as the logarithm of the RadonNikodym derivative of one measure relative to the other. This result is due to Gelfand, Yaglom, and Perez. The proofs
largely follow the translator’s remarks in Chapter 2 of Pinsker [126] (which in
turn follows Dobrushin [32]). 5.2. DIVERGENCE 79 Lemma 5.2.2 Suppose that (Ω, B ) is a measurable space where B is generated
by a ﬁeld F , B = σ (F ). Then if P and M are two probability measures on this
space,
D(P M ) = sup HP M (Q).
Q⊂F Proof: From the deﬁnition of divergence, the righthand term above is clearly
less than or equal to the divergence. If P is not absolutely continuous with
respect to M , then we can ﬁnd a set F such that M (F ) = 0 but P (F ) = 0 and
hence the divergence is inﬁnite. Approximating this event by a ﬁeld element F0
by applying Theorem 1.2.1 simultaneously to M and G will yield a partition
c
{F0 , F0 } for which the right hand side of the previous equation is arbitrarily
large. Hence the lemma holds for this case. Henceforth assume that M
P.
Fix > 0 and suppose that a partition Q = {Q1 , · · · , QK } yields a relative
entropy close to the divergence, that is,
K HP M (Q) = P (Qi ) ln
i=1 P (Qi )
≥ D(P M ) − /2.
M (Qi ) We will show that there is a partition, say Q with atoms in F which has
almost the same relative entropy, which will prove the lemma. First observe that
P (Q) ln[P (Q)/M (Q)] is a continuous function of P (Q) and M (Q) in the sense
that given /(2K ) there is a suﬃciently small δ > 0 such that if P (Q) − P (Q ) ≤
δ and M (Q) − M (Q ) ≤ δ , then provided M (Q) = 0
P (Q) ln P (Q)
P (Q )
− P (Q ) ln
≤
.
M (Q)
M (Q )
2K If we can ﬁnd a partition Q with atoms in F such that
P (Qi ) − P (Qi ) ≤ δ, M (Qi ) − M (Qi ) ≤ δ, i = 1, · · · , K, (5.2) then
HP M (Q ) − HP M (Q) ≤ P (Qi ) ln
i ≤K 2K = P (Qi )
P (Qi )
− P (Qi ) ln

M (Qi )
M (Qi ) 2 and hence
HP M (Q ) ≥ D(P M ) − which will prove the lemma. To ﬁnd the partition Q satisfying (5.2), let m
be the mixture measure P/2 + M/2. As in the proof of Lemma 4.2.2, we can
ﬁnd a partition Q ⊂ F such that m(Qi ∆Qi ) ≤ K 2 γ for i = 1, 2, · · · , K , which
implies that
P (Qi ∆Qi ) ≤ 2K 2 γ ; i = 1, 2, · · · , K, 80 CHAPTER 5. RELATIVE ENTROPY and
M (Qi ∆Qi ) ≤ 2K 2 γ ; i = 1, 2, · · · , K.
If we now choose γ so small that 2K 2 γ ≤ δ , then (5.2) and hence the lemma
follow from the above and the fact that
P (F ) − P (G) ≤ P (F ∆G). (5.3)
2 Lemma 5.2.3 Given two probability measures P and M on a common measurable space (Ω, B ), if P is not absolutely continuous with respect to M , then
D(P M ) = ∞.
If P
M (e.g., if D(P M ) < ∞), then the RadonNikodym derivative f =
dP/dM exists and
D(P M ) = ln f (ω )dP (ω ) = f (ω ) ln f (ω )dM (ω ). The quantity ln f (if it exists) is called the entropy density or relative entropy
density of P with respect to M .
Proof: The ﬁrst statement was shown in the proof of the previous lemma. If P
is not absolutely continuous with respect to M , then there is a set Q such that
M (Q) = 0 and P (Q) > 0. The relative entropy for the partition Q = {Q, Qc }
is then inﬁnite, and hence so is the divergence.
Assume that P
M and let f = dP/dM . Suppose that Q is an event for
which M (Q) > 0 and consider the conditional cumulative distribution function
for the real random variable f given that ω ∈ Q:
FQ (u) = M ({f < u}
M (Q) Q) ; u ∈ (−∞, ∞). Observe that the expectation with respect to this distribution is
∞ EM (f Q) = u dFQ (u) =
0 1
M (Q) f (ω ) dM (ω ) =
Q P (Q)
.
M (Q) We also have that
∞ u ln u dFQ (u) =
0 1
M (Q) f (ω ) ln f (ω ) dM (ω ),
Q where the existence of the integral is ensured by the fact that u ln u ≥ −e−1 . 5.2. DIVERGENCE 81 Applying Jensen’s inequality to the convex
equality
1
M (Q) ln f (ω ) dP (ω ) = Q 1
M (Q) function u ln u yields the in f (ω ) ln f (ω ) dM (ω )
Q ∞ u ln u dFQ (u) =
0 ∞ ≥[ ∞ u dFQ (u)] ln[
0 = u dFQ (u)]
0 P (Q)
P (Q)
ln
.
M (Q) M (Q) We therefore have that for any event Q with M (Q) > 0 that
ln f (ω ) dP (ω ) ≥ P (Q) ln
Q P (Q)
.
M (Q) (5.4) Let Q = {Qi } be a ﬁnite partition and we have
ln f (ω )dP (ω ) ln f (ω ) dP (ω ) =
Qi i ≥ ln f (ω ) dP (ω )
i:P (Qi )=0 = Qi P (Qi ) ln
i P (Qi )
,
M (Qi ) where the inequality follows from (5.4) since P (Qi ) = 0 implies that M (Qi ) = 0
since M
P . This proves that
D(P M ) ≤ ln f (ω ) dP (ω ). To obtain the converse inequality, let qn denote the asymptotically accurate
quantizers of Section 1.6. From (1.21)
ln f (ω ) dP (ω ) = lim n→∞ qn (ln f (ω )) dP (ω ). For ﬁxed n the quantizer qn induces a partition of into 2n2n + 1 atoms
Q. In particular, there are 2n2n − 1 “good” atoms such that for ω , ω inside
the atoms we have that  ln f (ω ) − ln f (ω ) ≤ 2−(n−1) . The remaining two
atoms group ω for which ln f (ω ) ≥ n or ln f (ω ) < −n. Deﬁning the shorthand
P (ln f < −n) = P ({ω : ln f (ω ) < −n}), we have then that
P (Q) ln
Q∈Q P (Q)
=
M (Q) P (Q) ln P (Q)
+
M (Q) good Q
P (ln f ≥ n)
P (ln f < −n)
P (ln f ≥ n) ln
+ P (ln f < −n) ln
.
M (ln f ≥ n)
M (ln f < −n) 82 CHAPTER 5. RELATIVE ENTROPY The rightmost two terms above are bounded below as
P (ln f ≥ n) ln P (ln f ≥ n)
P (ln f < −n)
+ P (ln f < −n) ln
M (ln f ≥ n)
M (ln f < −n) ≥ P (ln f ≥ n) ln P (ln f ≥ n) + P (ln f < −n) ln P (ln f < −n).
Since P (ln f ≥ n) and P (ln f < −n) → 0 as n → ∞ and since x ln x → 0 as
x → 0, given we can choose n large enough to ensure that the above term is
greater than − . This yields the lower bound
P (Q) ln
Q∈Q P (Q)
≥
M (Q) P (Q) ln
good P (Q)
−.
M (Q) Q ¯
Fix a good atom Q and deﬁne h = supω∈Q ln f (ω ) and h = inf ω∈Q ln f (ω )
and note that by deﬁnition of the good atoms
¯
h − h ≤ 2−(n−1) .
We now have that
¯
P (Q)h ≥ ln f (ω ) dP (ω )
Q and
M (Q)eh ≤ f (ω )dM (ω ) = P (Q).
Q Combining these yields
P (Q)
M (Q) P (Q)
= P (Q)h
P (Q)e−h ≥ P (Q) ln ≥ P (Q) ln ¯
P (Q)(h − 2−(n−1) )
ln f (ω )dP (ω ) − P (Q)2−(n−1) . ≥
Q Therefore
P (Q) ln
Q∈Q P (Q)
M (Q) ≥ P (Q) ln
good P (Q)
−
M (Q) Q ln f (ω ) dP − 2−(n−1) − ≥
good Q Q ln f (ω ) dP (ω ) − 2−(n−1) − . =
ω : ln f (ω )≤n Since this is true for arbitrarily large n and arbitrarily small ,
D(P Q) ≥ ln f (ω )dP (ω ), 5.2. DIVERGENCE 83 completing the proof of the lemma. 2 It is worthwhile to point out two examples for the previous lemma. If P and
M are discrete measures with corresponding pmf’s p and q , then the RadonNikodym derivative is simply dP/dM (ω ) = p(ω )/m(ω ) and the lemma gives the
known formula for the discrete case. If P and M are both probability measures
on Euclidean space Rn and if both measures are absolutely continuous with
respect to Lebesgue measure, then there exists a density f called a probability
density function or pdf such that
P (F ) = f (x)dx,
F where dx means dm(x) with m Lebesgue measure. (Lebesgue measure assigns
each set its volume.) Similarly, there is a pdf g for M . In this case,
D (P M ) = f (x) ln
Rn f (x)
dx.
g ( x) (5.5) The following immediate corollary to the previous lemma provides a formula
that is occasionally useful for computing divergences.
Corollary 5.2.1 Given three probability distributions M
D(P M ) = D(P Q) + EP (ln Q P , then dQ
).
dM Proof: From the chain rule for RadonNikodym derivatives (e.g., Lemma 5.7.3
of [50])
dP
dP dQ
=
dM
dQ dM
and taking expectations using the previous lemma yields the corollary. 2 The next result is a technical result that shows that given a mapping on
a space, the divergence between the induced distributions can be computed
from the restrictions of the original measures to the subσ ﬁeld induced by
the mapping. As part of the result, the relation between the induced RadonNikodym derivative and the original derivative is made explicit.
Recall that if P is a probability measure on a measurable space (Ω, B ) and
if F is a subσ ﬁeld of B , then the restriction PF of P to F is the probability
measure on the measurable space (Ω, F ) deﬁned by PF (G) = P (G), for all
G ∈ F . In other words, we can use either the probability measures on the new
space or the restrictions of the probability measures on the old space to compute
the divergence. This motivates considering the properties of divergences of
restrictions of measures, a useful generality in that it simpliﬁes proofs. The
following lemma can be viewed as a bookkeeping result relating the divergence
and the RadonNikodym derivatives in the two spaces. 84 CHAPTER 5. RELATIVE ENTROPY Lemma 5.2.4 (a) Suppose that M, P are two probability measures on a space
(Ω, B ) and that X is a measurement mapping this space into (A, A). Let PX
and MX denote the induced distributions (measures on (A, A)) and let Pσ(X )
and Mσ(X ) denote the restrictions of P and M to σ (X ), the subσ ﬁeld of B
generated by X . Then
D(PX MX ) = D(Pσ(X ) Mσ(X ) ).
If the RadonNikodym derivative f = dPX /dMX exists (e.g., the above divergence is ﬁnite), then deﬁne the function f (X ) : Ω → [0, ∞) by
f (X )(ω ) = f (X (ω )) = dPX
(X (ω ));
dMX then with probability 1 under both M and P
f (X ) = dPσ(X )
.
dMσ(X ) M . Then for any subσ ﬁeld F of B , we have that (b) Suppose that P dPF
dP
F ).
= EM (
dMF
dM
Thus the RadonNikodym derivative for the restrictions is just the conditional
expectation of the original RadonNikodym derivative.
Proof: The proof is mostly algebra: D(Pσ(X ) Mσ(X ) ) is the supremum over all
ﬁnite partitions Q with elements in σ (X ) of the relative entropy HPσ(X ) Mσ(X ) (Q).
Each element Q ∈ Q ⊂ σ (X ) corresponds to a unique set Q ∈ A via Q =
X −1 (Q ) and hence to each Q ⊂ σ (X ) there is a corresponding partition Q ⊂ A.
The corresponding relative entropies are equal, however, since
HPX MX (Q ) = Pf (Q ) ln
Q ∈Q PX (Q )
M X (Q ) P (X −1 (Q )) ln =
Q ∈Q = PX (Q) ln
Q∈Q = HPσ(X ) P (X −1 (Q ))
M (X −1 (Q )) PX (Q)
MX (Q) Mσ(X ) (Q). Taking the supremum over the partitions proves that the divergences are equal.
If the derivative is f = dPX /dMX , then f (X ) is measurable since it is a measurable function of a measurable function. In addition, it is measurable with
respect to σ (X ) since it depends on ω only through X (ω ). For any F ∈ σ (X )
there is a G ∈ A such that F = X −1 (G) and
f (X )dMσ(X ) =
F f (X )dM =
F f dMX
G 5.2. DIVERGENCE 85 from the change of variables formula (see, e.g., Lemma 4.4.7 of [50]). Thus
f (X )dMσ(X ) = PX (G) = Pσ(X ) (X −1 (G)) = Pσ(X ) (F ),
F which proves that f (X ) is indeed the claimed derivative with probability 1 under
M and hence also under P .
The variation quoted in part (b) is proved by direct veriﬁcation using iterated
expectation. If G ∈ F , then using iterated expectation we have that
EM (
G dP
F ) dMF =
dM EM (1G dP
F ) dMF
dM Since the argument of the integrand is F measurable (see, e.g., Lemma 5.3.1 of
[50]), invoking iterated expectation (e.g., Corollary 5.9.3 of [50]) yields
EM (
G dP
F ) dMF =
dM EM (1G dP
F ) dM
dM dP
) = P (G) = PF (G),
dM
proving that the conditional expectation is the claimed derivative.
2
Part (b) of the lemma was pointed out to the author by Paul Algoet.
Having argued above that restrictions of measures are useful when ﬁnding
divergences of random variables, we provide a key trick for treating such restrictions.
= E (1G Lemma 5.2.5 Let M
P be two measures on a space (Ω, B ). Suppose that
F is a subσ ﬁeld and that PF and MF are the restrictions of P and M to F
Then there is a measure S such that M
S
P and
dP
dP/dM
=
,
dS
dPF /dMF
dPF
dS
=
,
dM
dMF
and
D(P S ) + D(PF MF ) = D(P M ). (5.6) Proof: If M
P , then clearly MF
PF and hence the appropriate RadonNikodym derivatives exist. Deﬁne the set function S by
S (F ) =
F dPF
dM =
dMF EM (
F dP
F ) dM,
dM using part (b) of the previous lemma. Thus M
S and dS/dM = dPF /dMF .
Observe that for F ∈ F , iterated expectation implies that
S (F ) = EM (EM (1F dP
F ))
dM dP
)
dM
P (F ) = PF (F ); F ∈ F = EM (1F
= 86 CHAPTER 5. RELATIVE ENTROPY and hence in particular that S (Ω) is 1 so that dPF /dMF is integrable and S is
indeed a probability measure on (Ω, B ). (In addition, the restriction of S to F
is just PF .) Deﬁne
dP/dM
g=
.
dPF /dMF
This is well deﬁned since with M probability 1, if the denominator is 0, then
so is the numerator. Given F ∈ B the RadonNikodym theorem (e.g., Theorem
5.6.1 of [50]) implies that
gdS = 1F g F that is, P dS
dM =
dM 1F dP/dM
dPF /dMF dM = P (F ),
dPF /dMF S and
dP/dM
dP
=
,
dS
dPF /dMF proving the ﬁrst part of the lemma. The second part follows by direct veriﬁcation:
D (P M ) = ln dP
dP
dM dPF
dP/dM
dP + ln
dP
dMF
dPF /dMF
dPF
dP
dP
=
ln
dPF + ln
dMF
dS
= D(PF MF ) + D(P S ).
= ln 2
The two previous lemmas and the divergence inequality immediately yield
the following result for M
P . If M does not dominate P , then the result is
trivial.
Corollary 5.2.2 Given two measures M, P on a space (Ω, B ) and a subσ ﬁeld
F of B , then
D(P M ) ≥ D(PF MF ).
If f is a measurement on the given space, then
D(P M ) ≥ D(Pf Mf ).
The result is obvious for ﬁnite ﬁelds F or ﬁnite alphabet measurements f
from the deﬁnition of divergence. The general result for arbitrary measurable
functions could also have been proved by combining the corresponding ﬁnite
alphabet result of Corollary 2.3.1 and an approximation technique. As above,
however, we will occasionally get results comparing the divergences of measures
and their restrictions by combining the trick of Lemma 5.2.5 with a result for a
single divergence.
The following corollary follows immediately from Lemma 5.2.2 since the
union of a sequence of asymptotically generating subσ ﬁelds is a generating
ﬁeld. 5.2. DIVERGENCE 87 Corollary 5.2.3 Suppose that M, P are probability measures on a measurable
space (Ω, B ) and that Fn is an asymptotically generating sequence of subσ ﬁelds
and let Pn and Mn denote the restrictions of P and M to Fn (e.g., Pn = PFn ).
Then
D(Pn Mn ) ↑ D(P M ).
There are two useful special cases of the above corollary which follow immediately by specifying a particular sequence of increasing subσ ﬁelds. The
following two corollaries give these results.
Corollary 5.2.4 Let M, P be two probability measures on a measurable space
(Ω, B ). Suppose that f is an Avalued measurement on the space. Assume that
qn : A → An is a sequence of measurable mappings into ﬁnite sets An with
the property that the sequence of ﬁelds Fn = F (qn (f )) generated by the sets
−
{qn 1 (a); a ∈ An } asymptotically generate σ (f ). (For example, if the original
space is standard let Fn be a basis and let qn map the points in the ith atom of
Fn into i.) Then
D(Pf Mf ) = lim D(Pqn (f ) Mqn (f ) ).
n→∞ The corollary states that the divergence between two distributions of a random variable can be found as a limit of quantized versions of the random variable. Note that the limit could also be written as
lim HPf n→∞ Mf (qn ). In the next corollary we consider increasing sequences of random variables
instead of increasing sequences of quantizers, that is, more random variables
(which need not be ﬁnite alphabet) instead of ever ﬁner quantizers. The corollary follows immediately from Corollary 5.2.3 and Lemma 5.2.4.
Corollary 5.2.5 Suppose that M and P are measures on the sequence space
corresponding to outcomes of a sequence of random variables X0 , X1 , · · · with
alphabet A. Let Fn = σ (X0 , · · · , Xn−1 ), which asymptotically generates the
σ ﬁeld σ (X0 , X1 , · · · ). Then
lim D(PX n MX n ) = D(P M ). n→∞ We now develop two fundamental inequalities involving entropy densities
and divergence. The ﬁrst inequality is from Pinsker [126]. The second is an
improvement of an inequality of Pinsker [126] by Csisz´r [24] and Kullback [92].
a
The second inequality is more useful when the divergence is small. Coupling
these inequalities with the trick of Lemma 5.2.5 provides a simple generalization
of an inequality of [48] and will provide easy proofs of L1 convergence results
for entropy and information densities. A key step in the proof involves a notion
of distance between probability measures and is of interest in its own right. 88 CHAPTER 5. RELATIVE ENTROPY Given two probability measures M, P on a common measurable space (Ω, B ),
the variational distance between them is deﬁned by
d(P, M ) ≡ sup
Q P (Q) − M (Q),
Q∈Q where the supremum is over all ﬁnite measurable partitions. We will proceed by
stating ﬁrst the end goal, the two inequalities involving divergence, as a lemma,
and then state two lemmas giving the basic required properties of the variational
distance. The lemmas will be proved in a diﬀerent order.
Lemma 5.2.6 Let P and M be two measures on a common probability space
(Ω, B ) with P
M . Let f = dP/dM be the RadonNikodym derivative and let
h = ln f be the entropy density. Then
D (P M ) ≤ 2
hdP ≤ D(P M ) + ,
e hdP ≤ D(P M ) + (5.7) 2D(P M ). (5.8) Lemma 5.2.7 Given two probability measures M, P on a common measurable
space (Ω, B ), the variational distance is given by
d(P, M ) = 2 sup P (F ) − M (F ). (5.9) F ∈B Furthermore, if S is a measure for which P
for example), then also
 d(P, M ) = S and M S (S = (P + M )/2, dP
dM
−
 dS
dS
dS (5.10) and the supremum in (5.9) is achieved by the set
F = {ω : dM
dP
(ω ) >
(ω )}.
dS
dS Lemma 5.2.8
d(P, M ) ≤ 2D(P M ). Proof of Lemma 5.2.7: First observe that for any set F we have for the partition
Q = {F, F c } that
d(P, M ) ≥ P (Q) − M (Q) = 2P (F ) − M (F )
Q∈Q and hence
d(P, M ) ≥ 2 sup P (F ) − M (F ).
F ∈B 5.2. DIVERGENCE 89 Conversely, suppose that Q is a partition which approximately yields the variational distance, e.g.,
P (Q) − M (Q) ≥ d(P, M ) −
Q∈Q for > 0. Deﬁne a set F as the union of all of the Q in Q for which P (Q) ≥ M (Q)
and we have that
P (Q) − M (Q) = P (F ) − M (F ) + M (F c ) − P (F c ) = 2(P (F ) − M (F ))
Q∈Q and hence
d(P, M ) − ≤ sup 2P (F ) − M (F ).
F ∈B Since is arbitrary, this proves the ﬁrst statement of the lemma.
Next suppose that a measure S dominating both P and M exists and deﬁne
the set
dM
dP
(ω ) >
(ω )}
F = {ω :
dS
dS
and observe that
 dP
dM
−
 dS
dS
dS dP
dM
dP
dM
−
) dS −
(
−
) dS
dS
dS
F dS
F c dS
= P (F ) − M (F ) − (P (F c ) − M (F c )) = = ( 2(P (F ) − M (F )). From the deﬁnition of F , however,
P (F ) =
F dP
dS ≥
dS dM
dS = M (F )
dS F so that P (F ) − M (F ) = P (F ) − M (F ). Thus we have that
 dP
dM
−
 dS = 2P (F ) − M (F ) ≤ 2 sup P (G) − M (G) = d(P, M ).
dS
dS
G∈B To prove the reverse inequality, assume that Q approximately yields the variational distance, that is, for > 0 we have
P (Q) − M (Q) ≥ d(P, M ) − .
Q∈Q Then
P (Q) − M (Q) =
Q∈Q 
Q∈Q (
Q ≤ 
Q∈Q =  Q dP
dM
−
) dS 
dS
dS dM
dP
−
 dS
dS
dS dP
dM
−
 dS
dS
dS 90 CHAPTER 5. RELATIVE ENTROPY which, since is arbitrary, proves that
d(P, M ) ≤  dP
dM
−
 dS,
dS
dS Combining this with the earlier inequality proves (5.10). We have already seen
that this upper bound is actually achieved with the given choice of F , which
completes the proof of the lemma.
2
Proof of Lemma 5.2.8: Assume that M
P since the result is trivial otherwise
because the righthand side is inﬁnite. The inequality will follow from the ﬁrst
statement of Lemma 5.2.7 and the following inequality: Given 1 ≥ p, m ≥ 0,
p ln 1−p
p
+ (1 − p) ln
− 2(p − m)2 ≥ 0.
m
1−m (5.11) To see this, suppose the truth of (5.11). Since F can be chosen so that 2(P (F ) −
M (F )) is arbitrarily close to d(P, M ), given > 0 choose a set F such that
[2(P (F ) − M (F ))]2 ≥ d(P, M )2 − 2 . Since {F, F c } is a partition,
D(P M ) −
≥ P (F ) ln d(P, M )2
2 P (F )
1 − P (F )
+ (1 − P (F )) ln
− 2(P (F ) − M (F ))2 − .
M (F )
1 − M (F ) If (5.11) holds, then the righthand side is bounded below by − , which proves
the lemma since is arbitrarily small. To prove (5.11) observe that the lefthand side equals zero for p = m, has a negative derivative with respect to m
for m < p, and has a positive derivative with respect to m for m > p. (The
derivative with respect to m is (m − p)[1 − 4m(1 − m)]/[m(1 − m).) Thus the
left hand side of (5.11) decreases to its minimum value of 0 as m tends to p from
above or below.
2
Proof of Lemma 5.2.6: The magnitude entropy density can be written as
h(ω ) = h(ω ) + 2h(ω )− (5.12) where a− = −min(a, 0). This inequality immediately gives the trivial lefthand
inequality of (5.7). The righthand inequality follows from the fact that
h− dP = f [ln f ]− dM and the elementary inequality a ln a ≥ −1/e.
The second inequality will follow from (5.12) if we can show that
2 h− dP ≤ 2D(P M ). Let F denote the set {h ≤ 0} and we have from (5.4) that
2 h− dP = −2 hdP ≤ −2P (F ) ln
F P (F )
M (F ) 5.2. DIVERGENCE 91 and hence using the inequality ln x ≤ x − 1 and Lemma 5.2.7
2 h− dP ≤ 2P (F ) ln M (F )
≤ 2(M (F ) − P (F ))
P (F ) ≤ d(P, M ) ≤ 2D(P M ), completing the proof.
2
Combining Lemmas 5.2.6 and 5.2.5 yields the following corollary, which generalizes Lemma 2 of [54].
Corollary 5.2.6 Let P and M be two measures on a space (Ω, B ). Suppose
that F is a subσ ﬁeld and that PF and MF are the restrictions of P and M
to F . Assume that M
P . Deﬁne the entropy densities h = ln dP/dM and
h = ln dPF /dMF . Then
2
h − h  dP ≤ D(P M ) − D(PF MF ) + ,
e (5.13) and
h − h  dP ≤ D(P M )−
D ( PF M F ) + 2D(P M ) − 2D(PF MF ). (5.14) Proof: Choose the measure S as in Lemma 5.2.5 and then apply Lemma 5.2.6
with S replacing M .
2 Variational Description of Divergence
As in the discrete case, divergence has a variational characterization that is a
fundamental property for its applications to large deviations theory [145] [31].
We again take a detour to state and prove the property without delving into its
applications.
Suppose now that P and M are two probability measures on a common
probability space, say (Ω, B ), such that M
P and hence the density
f= dP
dM is well deﬁned. Suppose that Φ is a realvalued random variable deﬁned on the
same space, which we explicitly require to be ﬁnitevalued (it cannot assume ∞
as a value) and to have ﬁnite cumulant generating function:
EM (eΦ ) < ∞. 92 CHAPTER 5. RELATIVE ENTROPY Then we can deﬁne a probability measure M φ by
M Φ (F ) =
F eΦ
dM
EM (eΦ ) and observe immediately that by construction M (5.15)
M Φ and dM Φ
eΦ
=
.
dM
EM (eΦ )
The measure M Φ is called a “tilted” distribution. Furthermore, by construction
dM Φ /dM = 0 and hence we can write F f
dQ =
φ /E (eΦ )
e
M and hence P F dM Φ
f
dM =
eφ /EM (eΦ ) dM f dM = P (F )
F M Φ and
f
dP
=φ
.
Φ
dM
e /EM (eΦ ) We are now ready to state and prove the principal result of this section, a
variational characterization of divergence.
Theorem 5.2.1 Suppose that M P . Then D(P M ) = sup EP Φ − ln(EM (eΦ )) , (5.16) Φ where the supremum is over all random variables Φ for which Φ is ﬁnitevalued
and eΦ is M integrable.
Proof: First consider the random variable Φ deﬁned by Φ = ln f and observe
that
EP Φ − ln(EM (eΦ )) = dP ln f − ln( = D(P M ) − ln dM f )
dP = D(P M ). This proves that the supremum over all Φ is no smaller than the divergence. To
prove the other half observe that for any Φ,
H (P M ) − EP Φ − ln EM (eΦ ) = EP ln dP/dM
dP/dM Φ where M Φ is the tilted distribution constructed above. Since M
we have from the chain rule for RadonNikodym derivatives that
H (P M ) − EP Φ − ln EM (eΦ ) = EP ln ,
MΦ P, dP
= D(P M Φ ) ≥ 0
dM Φ from the divergence inequality, which completes the proof. Note that equality
holds and the supremum is achieved if and only if M Φ = P .
2 5.3. CONDITIONAL RELATIVE ENTROPY 5.3 93 Conditional Relative Entropy Lemmas 5.2.4 and 5.2.5 combine with basic properties of conditional probability
in standard spaces to provide an alternative form of Lemma 5.2.5 in terms of
random variables that gives an interesting connection between the densities for
combinations of random variables and those for individual random variables.
The results are collected in Theorem 5.3.1. First, however, several deﬁnitions are
required. Let X and Y be random variables with standard alphabets AX and AY
and σ ﬁelds BAX and BAY , respectively. Let PXY and MXY be two distributions
on (AX × AY , BAX ×AY ) and assume that MXY
PXY . Let MY and PY denote
the induced marginal distributions, e.g., MY (F ) = MXY (AX × F ). Deﬁne the
(nonnegative) densities (RadonNikodym derivatives):
fXY = dPY
dPXY
, fY =
dMXY
dMY so that
PXY (F ) fXY dMXY ; F ∈ BAX ×AY =
F PY (F ) fY dMY ; F ∈ BAY . =
F Note that MXY
PXY implies that MY
PY and hence fY is well deﬁned if
fXY is. Deﬁne also the conditional density
fX Y (xy ) = fXY (x,y )
fY (y ) ; 1; if fY (y ) > 0
otherwise. Suppose now that the entropy density
hY = ln fY
exists and deﬁne the conditional entropy density or conditional relative entropy
density by
hX Y = ln fX Y .
Again suppose that these densities exist, we (tentatively) deﬁne the conditional
relative entropy
HP M (X Y ) =
= E ln fX Y = dPXY (x, y ) ln fX Y (xy ) dMXY (x, y )fXY (x, y ) ln fX Y (xy ). if the expectation exists. Note that unlike unconditional relative entropies,
the above deﬁnition of conditional relative entropy requires the existence of
densities. Although this is suﬃcient in many of the applications and is convenient for the moment, it is not suﬃciently general to handle all the cases 94 CHAPTER 5. RELATIVE ENTROPY we will encounter. In particular, there will be situations where we wish to deﬁne a conditional relative entropy HP M (X Y ) even though it is not true that
MXY
PXY . Hence at the end of this section we will return to this question and provide a general deﬁnition that agrees with the current one when the
appropriate densities exist and that shares those properties not requiring the
existence of densities, e.g., the chain rule for relative entropy. An alternative
approach to a general deﬁnition for conditional relative entropy can be found in
Algoet [6].
The previous construction immediately yields the following lemma providing
chain rules for densities and relative entropies.
Lemma 5.3.1
fXY = fX Y fY hXY = hX Y + hY , and hence
D(PXY MXY ) = HP M (X Y ) + D(PY MY ), (5.17) or, equivalently,
HP M (X, Y ) = HP M (Y ) + HP M ( X Y ), (5.18) a chain rule for relative entropy analogous to that for ordinary entropy. Thus
if HP M (Y ) < ∞ so that the indeterminate form ∞ − ∞ is avoided, then
HP M (X Y ) = HP M (X, Y ) − HP M (Y ). Since the alphabets are standard, there is a regular version of the conditional
probabilities of X given Y under the distribution MXY ; that is, for each y ∈
B there is a probability measure MX Y (F y ); F ∈ BA for ﬁxed F ∈ BAX
MX Y (F y ) is a measurable function of y and such that for all G ∈ BAY
MXY (F × G) = E (1G (Y )MX Y (F Y )) = MX Y (F y )dMY (y ).
G ¯
Lemma 5.3.2 Given the previous deﬁnitions, deﬁne the set B ∈ BB to be the
set of y for which
fX Y (xy )dMX Y (xy ) = 1.
A ¯
Deﬁne PX Y for y ∈ B by
PX Y (F y ) = fX Y (xy )dMX Y (xy ); F ∈ BA
F 5.3. CONDITIONAL RELATIVE ENTROPY 95 and let PX Y (.y ) be an arbitrary ﬁxed probability measure on (A, BA ) for all
¯
¯
y ∈ B . Then MY (B ) = 1, PX Y is a regular conditional probability for X given
Y under the distribution PXY , and
PX Y MX Y ; MY − a.e., that is, MY ({y : PX Y (·y )
MX Y (·y )}) = 1. Thus if PXY
MXY , we
can choose regular conditional probabilities under both distributions so that with
probability one under MY the conditional probabilities under P are dominated
by those under M and
dPX Y (·y )
dPX Y
(xy ) ≡
(x) = fX Y (xy ); x ∈ A.
dMX Y
dMX Y (·y )
Proof: Deﬁne for each y ∈ B the set function
fX Y (xy )dMX Y (xy ); F ∈ BA . Gy (F ) =
F We shall show that Gy (F ), y ∈ B , F ∈ BA is a version of a regular conditional
probability of X given Y under PXY . First observe using iterated expectation and the fact that conditional expectations are expectations with respect to
conditional probability measures ([50], Section 5.9) that for any F ∈ BB
fX Y (xy )dMX Y (xy )] dMY (y ) [
F A = E (1F (Y )E [1A (X )fX Y Y ]) = E (1F (Y )1A (X )
= 1
1{fY >0} fXY dMXY =
fY
1
1
1{fY >0} dPY
dPY ,
fY
fY
F 1A×F =
F A×F fXY
1fY >0 )
fY 1
1{fY >0} dPXY
fY where the last step follows since since the function being integrated depends only
on Y and hence is measurable with respect to σ (Y ) and therefore its expectation
can be computed from the restriction of PXY to σ (Y ) (see, for example, Lemma
5.3.1 of [50]) and since PY (fY > 0) = 1. We can compute this last expectation,
however, using MY as F 1
dPY =
fY F 1
fY dMY =
fY dMY = MY (F )
F which yields ﬁnally that
fX Y (xy ) dMX Y (xy )] dMY (y ) = MY (F ); all F ∈ BB . [
F A If
1dMY (y ), all F ∈ BB , g (y )dMY (y ) =
F F 96 CHAPTER 5. RELATIVE ENTROPY however, it must also be true that g = 1 MY a.e. (See, for example, Corollary
5.3.1 of [50].) Thus we have MY a.e. and hence also PY a.e. that
fX Y (xy )dMX Y (xy )]dMY (y ) = 1;
A ¯
¯
that is, MY (B ) = 1. For y ∈ B , it follows from the basic properties of integration
that Gy is a probability measure on (A, BA ) (see Corollary 4.4.3 of [50]).
¯
By construction, PX Y (·y )
MX Y (·y ) for all y ∈ B and hence this is true
with probability 1 under MY and PY . Furthermore, by construction
dPX Y (·y )
(x) = fX Y (xy ).
dMX Y (·y )
To complete the proof we need only show that PX Y is indeed a version of the
conditional probability of X given Y under PXY . To do this, ﬁx G ∈ BA and
observe for any F ∈ BB that
PX Y (Gy ) dPY (y ) = F = G fX Y (xy ) dMX Y (xy )]fY (y ) dMY (y ) [
F = fX Y (xy )dMX Y (xy )] dPY (y ) [
F G E [1F (Y )fY E [1G (X )fX Y Y ] = EM [1G×F fXY ], again using iterated expectation. This immediately yields
PX Y (Gy ) dPY (y ) = dPXY = PXY (G × F ), fXY dMXY =
G×F F G ×F which proves that PX Y (Gy ) is a version of the conditional probability of X
given Y under PXY , thereby completing the proof.
2
Theorem 5.3.1 Given the previous deﬁnitions with MXY
distribution SXY by
SXY (F × G) = MX Y (F y )dPY (y ), PXY , deﬁne the (5.19) G that is, SXY has PY as marginal distribution for Y and MX Y as the conditional
distribution of X given Y . Then the following statements are true:
1. MXY SXY PXY . 2. dSXY /dMXY = fY and dPXY /dSXY = fX Y .
3. D(PXY MXY ) = D(PY MY ) + D(PXY SXY ), and hence D(PXY MXY )
exceeds D(PY MY ) by an amount D(PXY SXY ) = HP M (X Y ). 5.3. CONDITIONAL RELATIVE ENTROPY 97 Proof: To apply Lemma 5.2.5 deﬁne P = PXY , M = MXY , F = σ (Y ), P =
Pσ(Y ) , and M = Mσ(Y ) . Deﬁne S by
S (F × G) =
F ×G dPσ(Y )
dMXY ,
dMσ(Y ) for F ∈ BA and G ∈ BB . We begin by showing that S = SXY . All of the
properties will then follow from Lemma 5.2.5.
For F ∈ BAX and G ∈ BAY
S (F × G) =
F ×G dPσ(Y )
dPσ(Y )
dMXY = E 1F ×G
dMσ(Y )
dMσ(Y ) , where the expectation is with respect to MXY . Using Lemma 5.2.4 and iterated
conditional expectation (c.f. Corollary 5.9.3 of [50]) yields
E 1F ×G dPσ(Y )
dMσ(Y ) = E 1F (X )1G (Y ) dPY
(Y )
dMY dPY
(Y )E [1F (X )Y ]
dMY
dPY
= E 1G (Y )
(Y )MX Y (F Y )
dMY
= E 1G (Y ) MX Y (F y ) dPY
(Y )dMY (y )
dMY MX Y (F y ) dPY (y ), =
G proving that S = SXY . Thus Lemma 5.5.2 implies that MXY
SXY
PXY ,
proving the ﬁrst property.
From Lemma 5.2.4, dP /dM = dPσ(Y ) /dMσ(Y ) = dPY /dMY = fY , proving
the ﬁrst equality of property 2. This fact and the ﬁrst property imply the second
equality of property 2 from the chain rule of RadonNikodym derivatives. (See,
e.g., Lemma 5.7.3 of [50].) Alternatively, the second equality of the second
property follows from Lemma 5.2.5 since
dPXY
dPXY /dMXY
fXY
=
=
.
dSXY
dMXY /dSXY
fY
Corollary 5.2.1 therefore implies that D(PXY MXY ) = D(PXY SXY ) +
D(SXY MXY ), which with Property 2, Lemma 5.2.3, and the deﬁnition of
relative entropy rate imply Property 3.
2
It should be observed that it is not necessarily true that D(PXY SXY ) ≥
D(PX MX ) and hence that D(PXY MXY ) ≥ D(PX MX ) + D(PY MY ) as one
might expect since in general SX = MX . These formulas will, however, be true
in the special case where MXY = MX × MY .
We next turn to an extension and elaboration of the theorem when there
are three random variables instead of two. This will be a crucial generalization
for our later considerations of processes, when the three random variables will 98 CHAPTER 5. RELATIVE ENTROPY be replaced by the current output, a ﬁnite number of previous outputs, and the
inﬁnite past.
Suppose that MXY Z
PXY Z are two distributions for three standard
alphabet random variables X , Y , and Z taking values in measurable spaces
(AX , BAX ), (AY , BAY ), (AZ , BAZ ), respectively. Observe that the absolute continuity implies absolute continuity for the restrictions, e.g., MXY
PXY and
MY
PY . Deﬁne the RadonNikodym derivatives fXY Z , fY Z , fY , etc. in the
obvious way; for example,
dPXY Z
.
fXY Z =
dMXY Z
Let hXY Z , hY Z , hY , etc., denote the corresponding relative entropy densities,
e.g.,
hXY Z = ln fXY Z .
Deﬁne as previously the conditional densities
fX Y Z = fXY
fXY Z
; fX Y =
,
fY Z
fY the conditional entropy densities
hX Y Z = ln fX Y Z ; hX Y = ln fX Y ,
and the conditional relative entropies
HP M (X Y ) = E (ln fX Y ) and
HP M (X Y, Z ) = E (ln fX Y Z ). By construction (or by double use of Lemma 5.3.1) we have the following chain
rules for conditional relative entropy and its densities.
Lemma 5.3.3
fXY Z = fX Y Z fY Z fZ , hXY Z = hX Y Z + hY Z + hZ , and hence
HP M (X, Y, Z ) = HP M ( X Y Z ) + HP M (Y Z ) + HP M (Z ). Corollary 5.3.1 Given a distribution PXY , suppose that there is a product
distribution MXY = MX × MY
PXY . Then D(PXY PX × PY
PXY ,
fX Y
fXY
=
,
fX fY
fX MXY
dPXY
d(PX × PY )
d(PX × PY )
dMXY
PX × PY ) + H P M ( X ) = fX fY , = HP M (X Y D(PX × PY MXY ) = HP M (X ) = ), and + HP M (Y ). 5.3. CONDITIONAL RELATIVE ENTROPY 99 Proof: First apply Theorem 5.3.1 with MXY = MX × MY . Since MXY is a
product measure, MX Y = MX and MXY
SXY = MX × PY
PXY from the
theorem. Next we again apply Theorem 5.3.1, but this time the roles of X and
Y in the theorem are reversed and we replace MXY in the theorem statement
by the current SXY = MX × PY and we replace SXY in the theorem statement
by
S XY (F × G) = SY X (Gx) dPX (x) = PX (F )PY (G);
F that is, S
PX × PY = PX × PY . We then conclude from the theorem that SX Y =
PXY , proving the ﬁrst statement. We now have that XY MXY = MX × MY PX × PY PXY and hence the chain rule for RadonNikodym derivatives (e.g., Lemma 5.7.3 of
[50]) implies that
fXY = d(PX × PY )
dPXY
dPXY
.
=
dMXY
d(PX × PY ) d(MX × MY ) It is straightforward to verify directly that
dPX dPY
d(PX × PY )
=
= fX fY
d(MX × MY )
dMX dMY
and hence
fXY = dPXY
)fX fY ,
d(PX × PY ) as claimed. Taking expectations using Lemma 5.2.3 then completes the proof
(as in the proof of Corollary 5.2.1.)
2
The lemma provides an interpretation of the product measure PX × PY . This
measure yields independent random variables with the same marginal distributions as PXY , which motivates calling PX × PY the independent approximation
or memoryless approximation to PXY . The next corollary further enhances this
name by showing that PX × PY is the best such approximation in the sense of
yielding the minimum divergence with respect to the original distribution.
Corollary 5.3.2 Given a distribution PXY let M denote the class of all product
distributions for XY ; that is, if MXY ∈ M, then MXY = MX × MY . Then
inf MXY ∈M D(PXY MXY ) = D(PXY PX × PY ). Proof: We need only consider those M yielding ﬁnite divergence (since if there
are none, both sides of the formula are inﬁnite and the corollary is trivially
true). Then
D(PXY MXY ) = D(PXY PX × PY ) + D(PX × PY MXY )
≥ D(PXY PX × PY ) 100 CHAPTER 5. RELATIVE ENTROPY with equality if and only if D(PX × PY MXY ) = 0, which it will be if MXY =
PX × PY .
2
Recall that given random variables (X, Y, Z ) with distribution MXY Z , then
X → Y → Z is a Markov chain (with respect to MXY Z ) if for any event
F ∈ BAZ with probability one
MZ Y X (F y, x) = MZ Y (F y ).
If this holds, we also say that X and Z are conditionally independent given Y .
Equivalently, if we deﬁne the distribution MX ×Z Y by
MX ×Z Y (FX × FZ × FY ) = MX Y (FX y )MZ Y (FZ y )dMY (y );
Fy FX ∈ BAX ; FZ BAZ ; FY ∈ BAY ;
then Z → Y → X is a Markov chain if MX ×Z Y = MXY Z . (See Section 5.10 of
[50].) This construction shows that a Markov chain is symmetric in the sense
that X → Y → Z if and only if Z → Y → X .
Note that for any measure MXY Z , X → Y → Z is a Markov chain under
MX ×Z Y by construction.
The following corollary highlights special properties of the various densities
and relative entropies when the dominating measure is a Markov chain. It will
lead to the idea of a Markov approximation to an arbitrary distribution on
triples extending the independent approximation of the previous corollary.
Corollary 5.3.3 Given a probability space, suppose that MXY Z
PXY Z are
two distributions for a random vector (X, Y, Z ) with the property that Z → Y →
X forms a Markov chain under M . Then
MXY Z PX ×Z Y PXY Z and
fX Y Z
fX  Y (5.20) = fY Z fX Y . dPXY Z
dPX ×Z Y
dPX ×Z Y
dMXY Z (5.21) = Thus
ln dPXY Z
+ hX Y
dPX ×Z Y
dPX ×Z Y
ln
dMXY Z = hX Y Z
= hY Z + hX Y and taking expectations yields
D(PXY Z PX ×Z Y ) + HP M (X Y ) D(PX ×Z Y MXY Z ) = HP M (X Y Z) = D(PY Z MY Z ) + HP M (X Y ). 5.3. CONDITIONAL RELATIVE ENTROPY 101 Furthermore,
PX ×Z Y = PX Y PY Z , (5.22) that is,
PX ×Z Y (FX × FZ × FY ) = PX Y (FX y )dPZY (z, y ). (5.23) FY ×FZ Lastly, if Z → Y → X is a Markov chain under M , then it is also a Markov
chain under P if and only if
hX Y = hX Y Z (5.24) in which case
HP M (X Y ) = HP M (X Y Z ). (5.25) Proof: Deﬁne
g (x, y, z ) = fX Y Z (xy, z )
fXY Z (x, y, z ) fY (y )
=
fX Y (xy )
fY Z (y, z ) fXY (x, y ) and simplify notation by deﬁning the measure Q = PX ×Z Y . Note that Z →
Y → X is a Markov chain with respect to Q. To prove the ﬁrst statement of
the corollary requires proving the following relation:
PXY Z (FX × FY × FZ ) = gdQ;
FX ×FY ×FZ all FX ∈ BAX , FZ ∈ BAZ , FY ∈ BAY .
From iterated expectation with respect to Q (e.g., Section 5.9 of [50])
E (g 1FX (X )1FZ (Z )1FY (Y )) = E (1FY (Y )1FZ (Z )E (g 1FX (X )Y Z ))
= g (x, y, z ) dQX Y Z (xy, z )) dQY Z (y, z ). 1FY (y )1FZ (z )(
FX Since QY Z = PY Z and QX Y Z = PX Y Qa.e. by construction, the previous
formula implies that
g dQ =
FX ×FY ×FZ dPY Z
FY ×FZ gdPX Y .
FX This proves (5.22. Since MXY Z
PXY Z , we also have that MXY
hence application of Theorem 5.3.1 yields
gdQ =
FX ×FY ×FZ dPY Z
FY ×FZ = gfX Y dMX Y
FX dPY Z
FY ×FZ fX Y Z dMX Y .
FX PXY and 102 CHAPTER 5. RELATIVE ENTROPY By assumption, however, MX Y = MX Y Z a.e. and therefore
g dQ = dPY Z FX ×FY ×FZ FY ×FZ = fX Y Z dMX Y Z
FX dPY Z
FY ×FZ dPX Y Z
FX PXY Z (FX × FY × FZ ), = where the ﬁnal step follows from iterated expectation. This proves (5.20) and
that Q
PXY Z .
To prove (5.21) we proceed in a similar manner and replace g by fX Y fZY
ˆ
and replace Q by MXY Z = MX ×Y Z . Also abbreviate PX ×Y Z to P . As in the
proof of (5.20) we have since Z → Y → X is a Markov chain under M that
g dMX Y dMY Z g dQ =
FY ×FZ FX ×FY ×FZ FX fZY dMY Z = fX Y dMX Y FY ×FZ FX dPY Z =
FY ×FZ fX Y dMX Y . FX From Theorem 5.3.1 this is
PX Y (FX y ) dPY Z .
FY ×FZ ˆ
But PY Z = PY Z and
ˆ
ˆ
PX Y (FX y ) = PX Y (FX y ) = PX Y Z (FX yz )
ˆ
ˆ
since P yields a Markov chain. Thus the previous formula is P (FX × FY × FZ ),
proving (5.21) and the corresponding absolute continuity.
If Z → Y → X is a Markov chain under both M and P , then PX ×Z Y =
PXY Z and hence
fX Y Z
dPXY Z
=1=
,
dPX ×Z Y
fX Y
which implies (5.24). Conversely, if (5.24) holds, then fX Y Z = fX Y which with
(5.20) implies that PXY Z = PX ×Z Y , proving that Z → Y → X is a Markov
chain under P .
2
The previous corollary and one of the constructions used will prove important
later and hence it is emphasized now with a deﬁnition and another corollary
giving an interesting interpretation.
Given a distribution PXY Z , deﬁne the distribution PX ×Z Y as the Markov
ˆ
approximation to PXY Z . Abbreviate PX ×Z Y to P . The deﬁnition has two
ˆ
motivations. First, the distribution P makes Z → Y → X a Markov chain 5.3. CONDITIONAL RELATIVE ENTROPY 103 ˆ
which has the same initial distribution PZY = PZY and the same conditional
ˆX Y = PX Y , the only diﬀerence is that P yields a Markov chain,
ˆ
distribution P
ˆ
ˆ
that is, PX ZY = PX Y . The second motivation is the following corollary which
ˆ
shows that of all Markov distributions, P is the closest to P in the sense of
minimizing the divergence.
Corollary 5.3.4 Given a distribution P = PXY Z , let M denote the class of all
distributions for XY Z for which Z → Y → X is a Markov chain under MXY Z
(MXY Z = MX ×Z Y ). Then
inf MXY Z ∈M D(PXY Z MXY Z ) = D(PXY Z PX ×Z Y ); that is, the inﬁmum is a minimum and it is achieved by the Markov approximation.
Proof: If no MXY Z in the constraint set satisﬁes MXY Z
PXY Z , then both
sides of the above equation are inﬁnite. Hence conﬁne interest to the case
MXY Z
PXY Z . Similarly, if all such MXY Z yield an inﬁnite divergence, we
are done. Hence we also consider only MXY Z yielding ﬁnite divergence. Then
the previous corollary implies that MXY Z
PX ×Z Y
PXY Z and hence
D(PXY Z MXY Z ) = D(PXY Z PX ×Z Y ) + D(PX ×Z Y MXY Z ) ≥ D(PXY Z PX ×Z Y ) with equality if and only if
D(PX ×Z Y MXY Z ) = D(PY Z MY Z ) + HP M ( X Y ) = 0. But this will be zero if M is the Markov approximation to P since then MY Z =
PY Z and MX Y = PX Y by construction.
2 Generalized Conditional Relative Entropy
We now return to the issue of providing a general deﬁnition of conditional
relative entropy, that is, one which does not require the existence of the densities
or, equivalently, the absolute continuity of the underlying measures. We require,
however, that the general deﬁnition reduce to that considered thus far when the
densities exist so that all of the results of this section will remain valid when
applicable. The general deﬁnition takes advantage of the basic construction of
the early part of this section. Once again let MXY and PXY be two measures,
where we no longer assume that MXY
PXY . Deﬁne as in Theorem 5.3.1 the
modiﬁed measure SXY by
SXY (F × G) = MX Y (F y )dPY (y );
G (5.26) 104 CHAPTER 5. RELATIVE ENTROPY that is, SXY has the same Y marginal as PXY and the same conditional distribution of X given Y as MXY . We now replace the previous deﬁnition by the
following: The conditional relative entropy is deﬁned by
HP M ( X Y ) = D(PXY SXY ). (5.27) If MXY
PXY as before, then from Theorem 5.3.1 this is the same quantity as
the original deﬁnition and there is no change. The divergence of (5.27), however,
is welldeﬁned even if it is not true that MXY
PXY and hence the densities
used in the original deﬁnition do not work. The key question is whether or not
the chain rule
HP M (Y ) + HP M (X Y ) = HP M (XY )
(5.28)
remains valid in the more general setting. It has already been proven in the case
that MXY
PXY , hence suppose this does not hold. In this case, if it is also
true that MY
PY does not hold, then both the marginal and joint relative
entropies will be inﬁnite and (5.28) again must hold since the conditional relative
entropy is nonnegative. Thus we need only show that the formula holds for the
case where MY
PY but it is not true that MXY
PXY . By assumption
there must be an event F for which
MXY (F ) = MX Y (Fy ) dMY (y ) = 0 PXY (F ) = PX Y (Fy ) dPY (y ) = 0, but where Fy = {(x, y ) : (x, y ) ∈ F } is the section of F at Fy . Thus MX Y (Fy ) = 0
MY a.e. and hence also PY a.e. since MY
PY . Thus
SXY (F ) = MX Y (Fy ) dPY (y ) = 0 and hence it is not true that SXY PXY and therefore D(PXY SXY ) = ∞,
which proves that the chain rule holds in the general case.
It can happen that PXY is not absolutely continuous with respect to MXY ,
and yet D(PXY SXY ) < ∞ and hence PXY
SXY and hence
HP M ( X Y )= dPXY ln dPXY
,
dSXY in which case it makes sense to deﬁne the conditional density
fX Y ≡ dPXY
dSXY 5.4. LIMITING ENTROPY DENSITIES 105 so that exactly as in the original tentative deﬁnition in terms of densities (5.17)
we have that
HP M (X Y )= dPXY ln fX Y . Note that this allows us to deﬁne a meaningful conditional density even though
the joint density fXY does not exist! If the joint density does exist, then the
conditional density reduces to the previous deﬁnition from Theorem 5.3.1.
We summarize the generalization in the following theorem.
Theorem 5.3.2 The conditional relative entropy deﬁned by (5.27) and (5.26)
agrees with the deﬁnition (5.17) in terms of densities and satisﬁes the chain
rule (5.28). If the conditional relative entropy is ﬁnite, then
HP M (X Y )= dPXY ln fX Y , where the conditional density is deﬁned by
fX  Y ≡
If MXY dPXY
.
dSXY PXY , then this reduces to the usual deﬁnition
fX Y = fXY
.
fY The generalizations can be extended to three or more random variables in the
obvious manner. 5.4 Limiting Entropy Densities We now combine several of the results of the previous section to obtain results
characterizing the limits of certain relative entropy densities.
Lemma 5.4.1 Given a probability space (Ω, B ) and an asymptotically generating sequence of subσ ﬁelds Fn and two measures M
P , let Pn = PFn ,
Mn = MFn and let hn = ln dPn /dMn and h = ln dP/dM denote the entropy
densities. If D(P M ) < ∞, then
lim n→∞ hn − h dP = 0, that is, hn → h in L1 . Thus the entropy densities hn are uniformly integrable.
Proof: Follows from the Corollaries 5.2.3 and 5.2.6.
The following lemma is Lemma 1 of Algoet and Cover [7]. 2 106 CHAPTER 5. RELATIVE ENTROPY Lemma 5.4.2 Given a sequence of nonnegative random variables {fn } deﬁned
on a probability space (Ω, B , P ), suppose that
E (fn ) ≤ 1; all n.
Then
lim sup
n→∞ Proof: Given any
that 1
ln fn ≤ 0.
n > 0 the Markov inequality and the given assumption imply
P (fn > en ) ≤ E (fn )
≤ e−n .
en We therefore have that
P( 1
ln fn ≥ ) ≤ e−n
n and therefore
∞ ∞ P(
n=1 1
1
ln fn ≥ ) ≤
e−n = −1 < ∞,
n
e
n=1 Thus from the BorelCantelli lemma (Lemma 4.6.3 of [50]), P (n−1 hn ≥
= 0. Since is arbitrary, the lemma is proved. i.o.)
2 The lemma easily gives the ﬁrst half of the following result, which is also
due to Algoet and Cover [7], but the proof is diﬀerent here and does not use
martingale theory. The result is the generalization of Lemma 2.7.1.
Theorem 5.4.1 Given a probability space (Ω, B ) and an asymptotically generating sequence of subσ ﬁelds Fn , let M and P be two probability measures with
their restrictions Mn = MFn and Pn = PFn . Suppose that Mn
Pn for all n
and deﬁne fn = dPn /dMn and hn = ln fn . Then
lim sup
n→∞ 1
hn ≤ 0, M − a.e.
n and
lim inf
n→∞ If it is also true that M 1
hn ≥ 0, P − a.e..
n P (e.g., D(P M ) < ∞), then
lim n→∞ 1
hn = 0, P − a.e..
n Proof: Since
EM fn = EMn fn = 1, 5.5. INFORMATION FOR GENERAL ALPHABETS 107 the ﬁrst statement follows from the previous lemma. To prove the second statement consider the probability
P (− 1 dPn
ln
>)
n
Mn = Pn (− 1
ln fn > ) = Pn (fn < e−n )
n = dPn =
fn <e−n < fn dMn
fn <e−n e−n dMn = e−n Mn (fn < e−n ) ≤ e−n .
fn <e−n Thus it has been shown that
1
P ( hn < − ) ≤ e−n
n
and hence again applying the BorelCantelli lemma we have that
P (n−1 hn ≤ − i.o.) = 0
which proves the second claim of the theorem.
If M
P , then the ﬁrst result also holds P a.e., which with the second
result proves the ﬁnal claim.
2
Barron [9] provides an additional property of the sequence hn /n. If M
P,
then the sequence hn /n is dominated by an integrable function. 5.5 Information for General Alphabets We can now use the divergence results of the previous sections to generalize the
deﬁnitions of information and to develop their basic properties. We assume now
that all random variables and processes are deﬁned on a common underlying
probability space (Ω, B , P ). As we have seen how all of the various information
quantities–entropy, mutual information, conditional mutual information–can be
expressed in terms of divergence in the ﬁnite case, we immediately have deﬁnitions for the general case. Given two random variables X and Y , deﬁne the
average mutual information between them by
I (X ; Y ) = D(PXY PX × PY ), (5.29) where PXY is the joint distribution of the random variables X and Y and
PX × PY is the product distribution.
Deﬁne the entropy of a single random variable X by
H (X ) = I (X ; X ). (5.30) From the deﬁnition of divergence this implies that
I (X ; Y ) = sup HPXY
Q PX ×PY (Q). 108 CHAPTER 5. RELATIVE ENTROPY From Dobrushin’s theorem (Lemma 5.2.2), the supremum can be taken over
partitions whose elements are contained in generating ﬁeld. Letting the generating ﬁeld be the ﬁeld of all rectangles of the form F × G, F ∈ BAX and
G ∈ BAY , we have the following lemma which is often used as a deﬁnition for
mutual information.
Lemma 5.5.1
I (X ; Y ) = sup I (q (X ); r(Y )),
q,r where the supremum is over all quantizers q and r of AX and AY . Hence there
exist sequences of increasingly ﬁne quantizers qn : AX → An and rn : AY → Bn
such that
I (X ; Y ) = lim I (qn (X ); rn (Y )).
n→∞ Applying this result to entropy we have that
H (X ) = sup H (q (X )),
q where the supremum is over all quantizers.
By “increasingly ﬁne” quantizers is meant that the corresponding partitions
−
Qn = {qn 1 (a); a ∈ An } are successive reﬁnements, e.g., atoms in Qn are unions
of atoms in Qn+1 . (If this were not so, a new quantizer could be deﬁned for
which it was true.) There is an important drawback to the lemma (which will
shortly be removed in Lemma 5.5.5 for the special case where the alphabets
are standard): the quantizers which approach the suprema may depend on the
underlying measure PXY . In particular, a sequence of quantizers which work
for one measure need not work for another.
Given a third random variable Z , let AX , AY , and AZ denote the alphabets
of X , Y , and Z and deﬁne the conditional average mutual information
I (X ; Y Z ) = D(PXY Z PX ×Y Z ). (5.31) This is the extension of the discrete alphabet deﬁnition of (2.25) and it makes
sense only if the distribution PX ×Y Z exists, which is the case if the alphabets
are standard but may not be the case otherwise. We shall later provide an
alternative deﬁnition due to Wyner [154] that is valid more generally and equal
to the above when the spaces are standard.
Note that I (X ; Y Z ) can be interpreted using Corollary 5.3.4 as the divergence between PXY Z and its Markov approximation.
Combining these deﬁnitions with Lemma 5.2.1 yields the following generalizations of the discrete alphabet results.
Lemma 5.5.2 Given two random variables X and Y , then
I (X ; Y ) ≥ 0 5.5. INFORMATION FOR GENERAL ALPHABETS 109 with equality if and only if X and Y are independent. Given three random
variables X , Y , and Z , then
I (X ; Y Z ) ≥ 0
with equality if and only if Y → Z → X form a Markov chain.
Proof: The ﬁrst statement follow from Lemma 5.2.1 since X and Y are independent if and only if PXY = PX × PY . The second statement follows from (5.31)
and the fact that Y → Z → X is a Markov chain if and only if PXY Z = PX ×Y Z
(see, e.g., Corollary 5.10.1 of [50]).
2
The properties of divergence provide means of computing and approximating
these information measures. From Lemma 5.2.3, if I (X ; Y ) is ﬁnite, then
I (X ; Y ) = ln dPXY
dPXY
d(PX × PY ) (5.32) and if I (X ; Y Z ) is ﬁnite, then
I (X ; Y Z ) = ln dPXY Z
dPXY Z .
dPX ×Y Z (5.33) For example, if X, Y are two random variables whose distribution is absolutely continuous with respect to Lebesgue measure dxdy and hence which have
a pdf fXY (x, y ) = dPXY (xy )/dxdy , then
I (X ; Y ) = dxdyfXY (xy ) ln fXY (x, y )
,
fX (x)fY (y ) where fX and fY are the marginal pdf’s, e.g.,
fX (x) = fXY (x, y ) dy = dPX (x)
.
dx In the cases where these densities exist, we deﬁne the information densities
iX ;Y = ln dPXY
d(PX × PY )
(5.34) iX ;Y Z = ln dPXY Z
.
dPX ×Y Z The results of Section 5.3 can be used to provide conditions under which the
various information densities exist and to relate them to each other. Corollaries 5.3.1 and 5.3.2 combined with the deﬁnition of mutual information immediately yield the following two results. 110 CHAPTER 5. RELATIVE ENTROPY Lemma 5.5.3 Let X and Y be standard alphabet random variables with distribution PXY . Suppose that there exists a product distribution MXY = MX × MY
such that MXY
PXY . Then
P X × PY MXY PXY , iX ;Y = ln(fXY /fX fY ) = ln(fX Y /fX )
and
I (X ; Y ) + HP M (X ) = HP M (X Y ). (5.35) Comment: This generalizes the fact that I (X ; Y ) = H (X ) − H (X Y ) for the
ﬁnite alphabet case. The sign reversal results from the diﬀerence in deﬁnitions
of relative entropy and entropy. Note that this implies that unlike ordinary
entropy, relative entropy is increased by conditioning, at least when the reference
measure is a product measure.
The previous lemma provides an apparently more general test for the existence of a mutual information density than the requirement that PX × PY
PXY , it states that if PXY is dominated by any product measure, then it is also
dominated by the product of its own marginals and hence the densities exist.
The generality is only apparent, however, as the given condition implies from
Corollary 5.3.1 that the distribution is dominated by its independent approximation. Restating Corollary 5.3.1 in terms of mutual information yields the
following.
Corollary 5.5.1 Given a distribution PXY let M denote the collection of all
product distributions MXY = MX × MY . Then
I (X ; Y ) = inf MXY ∈M HP M ( X Y )= inf MXY ∈M D(PXY MXY ). The next result is an extension of Lemma 5.5.3 to conditional information
densities and relative entropy densities when three random variables are considered. It follows immediately from Corollary 5.3.3 and the deﬁnition of conditional information density.
Lemma 5.5.4 (The chain rule for relative entropy densities) Suppose that MXY Z
PXY Z are two distributions for three standard alphabet random variables and
that Z → Y → X is a Markov chain under MXY Z . Let fX Y Z , fX Y , hX Y Z ,
and hX Y be as in Section 5.3. Then PX ×Z Y
PXY Z ,
hX Y Z = iX ;Z Y + hX Y (5.36) and
HP M (X Y, Z ) = I (X ; Z Y ) + HP M (X Y Thus, for example,
HP M (X Y, Z ) ≥ HP M ( X Y ). ). (5.37) 5.5. INFORMATION FOR GENERAL ALPHABETS 111 As with Corollary 5.5.1, the lemma implies a variational description of conditional mutual information. The result is just a restatement of Corollary 5.3.4.
Corollary 5.5.2 Given a distribution PXY Z let M denote the class of all distributions for XY Z under which Z → Y → X is a Markov chain, then
I (X ; Z Y ) = inf MXY Z ∈M HP M (X Y, Z ) = inf MXY Z ∈M D(PXY Z MXY Z ), and the minimum is achieved by MXY Z = PX ×Z Y .
The following corollary relates the information densities of the various information measures and extends Kolmogorov’s equality to standard alphabets.
Corollary 5.5.3 (The chain rule for information densities and Kolmogorov’s
formula.) Suppose that X ,Y , and Z are random variables with standard alphabets and distribution PXY Z . Suppose also that there exists a distribution
MXY Z = MX × MY Z such that MXY Z
PXY Z . (This is true, for example, if
PX × PY Z
PXY Z .) Then the information densities iX ;Z Y , iX ;Y , and iX ;(Y Z )
exist and are related by
iX ;Z Y + iX ;Y = iX ;(Y,Z ) (5.38) I (X ; Z Y ) + I (X ; Y ) = I (X ; (Y, Z )). (5.39) and Proof: If MXY Z = MX × MY Z , then Z → Y → X is trivially a Markov chain
since MX Y Z = MX Y = MX . Thus the previous lemma can be applied to this
MXY Z to conclude that PX ×Z Y
PXY Z and that (5.36) holds. We also have
that MXY = MX × MY
PXY . Thus all of the densities exist. Applying
Lemma 5.5.3 to the product measures MXY = MX × MY and MX (Y Z ) =
MX × MY Z in (5.36) yields
iX ;Z Y = hX Y Z − hX Y = ln fX Y Z − ln fX Y
= ln fX Y Z
fX Y
− ln
= iX ;Y Z − iX ;Y .
fX
fX Taking expectations completes the proof. 2 The previous corollary implies that if PX ×PY Z
PXY Z , then also PX ×Z Y
PXY Z and PX × PY
PXY and hence that the existence of iX ;(Y,Z ) implies that
of iX ;Z Y and iX ;Y . The following result provides a converse to this fact: the
existence of the latter two densities implies that of the ﬁrst. The result is due
to Dobrushin [32]. (See also Theorem 3.6.1 of Pinsker [126] and the translator’s
comments.) 112 CHAPTER 5. RELATIVE ENTROPY Corollary 5.5.4 If PX ×Z Y
PXY Z and PX × PY
PY Z
PXY Z and
dPXY Z
dPXY
=
.
d(PX × PY Z )
d(PX × PY ) PXY , then also PX × Thus the conclusions of Corollary 5.5.3 hold.
Proof: The key to the proof is the demonstration that
dPX ×Z Y
dPXY
=
,
d(PX × PY )
d(PX × PY Z ) (5.40) which implies that PX × PY Z
PX ×Z Y . Since it is assumed that PX ×Z Y
PXY Z , the result then follows from the chain rule for RadonNikodym derivatives.
Eq. (5.40) will be proved if it is shown that for all FX ∈ BAX , FY ∈ BAY ,
and FZ ∈ BAZ ,
PX ×Z Y (FX × FZ × FY ) =
FX ×FZ ×FY dPXY
d(PX × PY Z ).
d(PX × PY ) (5.41) The thrust of the proof is the demonstration that for any measurable nonnegative function f (x, z )
f (x, y ) d(PX × PY Z )(x, y, z ) = f (x, y )PZ Y (FZ y )d(PX × PY )(x, y ). z ∈FZ (5.42)
The lemma will then follow by substituting
f (x, y ) = dPXY
(x, y )1FX (x)1FY (y )
d(PX × PY ) into (5.42) to obtain (5.41).
To prove (5.42) ﬁrst consider indicator functions of rectangles: f (x, y ) =
1FX (x)1FY (y ). Then both sides of (5.42) equal PX (FX )PY Z (FY × FY ) from the
deﬁnitions of conditional probability and product measures. In particular, from
Lemma 5.10.1 of [50] the lefthand side is
1FX (x)1FY (y ) d(PX × PY Z )(x, y, z ) = ( 1FX dPX )( 1FY ×FZ dPY Z ) = PX (F )PY Z (FY × FZ ) z ∈FZ and the righthand side is
1FX (x)1FY (y )PZ Y (FZ y ) d(PX × PY )(x, y ) =
( 1FX (x) dPX (x))( 1FY (y )PZ Y (FZ y ) dPY (y )) = PX (F )PY Z (FY × FZ ), 5.5. INFORMATION FOR GENERAL ALPHABETS 113 as claimed. This implies (5.42) holds also for simple functions and hence also
for positive functions by the usual approximation arguments.
2
Note that Kolmogorov’s formula (5.37) gives a formula for computing conditional mutual information as
I (X ; Z Y ) = I (X ; (Y, Z )) − I (X ; Y ).
The formula is only useful if it is not indeterminate, that is, not of the form ∞−
∞. This will be the case if I (Y ; Z ) (the smaller of the two mutual informations)
is ﬁnite.
Corollary 5.2.5 provides a means of approximating mutual information by
that of ﬁnite alphabet random variables. Assume now that the random variables
X, Y have standard alphabets. For, say, random variable X with alphabet AX
there must then be an asymptotically generating sequence of ﬁnite ﬁelds FX (n)
with atoms AX (n), that is, all of the members of FX (n) can be written as unions
of disjoint sets in AX (n) and FX (n) ↑ BAX ; that is, BAX = σ ( n FX (n)). The
atoms AX (n) form a partition of the alphabet of X .
Consider the divergence result of Corollary 5.2.5. with P = PXY , M =
( n)
( n)
PX ×PY and quantizer q (n) (x, y ) = (qX (x), qY (y )). Consider the limit n → ∞.
Since FX (n) asymptotically generates BAX and FY (n) asymptotically generates
BAY and since the pair σ ﬁeld BAX ×AY is generated by rectangles, the ﬁeld
generated by all sets of the form FX × FY with FX ∈ FX (n), some n, and
FY ∈ FY (m), some m, generates BAX ×AY . Hence Corollary 5.2.5 yields the
ﬁrst result of the following lemma. The second is a special case of the ﬁrst. The
result shows that the quantizers of Lemma 5.5.1 can be chosen in a manner not
depending on the underlying measure if the alphabets are standard.
Lemma 5.5.5 Suppose that X and Y are random variables with standard al( n)
phabets deﬁned on a common probability space. Suppose that qX , n = 1, 2, · · ·
is a sequence of quantizers for AX such that the corresponding partitions asymptotically generate BAX . Deﬁne quantizers for Y similarly. Then for any distribution PXY
(n)
(n)
I (X ; Y ) = lim I (qX (X ); qY (Y ))
n→∞ and
(n) H (X ) = lim H (qX (X ));
n→∞ that is, the same quantizer sequence works for all distributions.
An immediate application of the lemma is the extension of the convexity
properties of Lemma 2.5.4 to standard alphabets.
Corollary 5.5.5 Let µ denote a distribution on a space (AX , BAX ), and let ν
be a regular conditional distribution ν (F x) = Pr(Y ∈ F X = x), x ∈ AX ,
F ∈ BAY . Let µν denote the resulting joint distribution. Let Iµν = Iµν (X ; Y )
be the average mutual information. Then Iµν is a convex
function of ν and
a convex
function of µ. 114 CHAPTER 5. RELATIVE ENTROPY Proof: Follows immediately from Lemma 5.5.5 and the ﬁnite alphabet result
Lemma 2.5.4.
2
Next consider the mutual information I (f (X ), g (Y )) for arbitrary measurable mappings f and g of X and Y . From Lemma 5.5.2 applied to the random
variables f (X ) and g (Y ), this mutual information can be approximated arbitrarily closely by I (q1 (f (X )); q2 (g (Y ))) by an appropriate choice of quantizers
q1 and q2 . Since the composition of q1 and f constitutes a ﬁnite quantization
of X and similarly q2 g is a quantizer for Y , we must have that
I (f (X ); g (Y )) ≈ I (q1 (f (X )); q2 (g (Y )) ≤ I (X ; Y ).
Making this precise yields the following corollary.
Corollary 5.5.6 If f is a measurable function of X and g is a measurable
function of Y , then
I (f (X ), g (Y )) ≤ I (X ; Y ).
The corollary states that mutual information is reduced by any measurable
mapping, whether ﬁnite or not. For practice we point out another proof of
this basic result that directly applies a property of divergence. Let P = PXY ,
M = PX × PY , and deﬁne the mapping r(x, y ) = (f (x), g (y )). Then from
Corollary 5.2.2 we have
I (X ; Y ) = D(P M ) ≥ D(Pr Mr ) ≥ D(Pf (X ),g(Y ) Mf (X ),g(Y ) ).
But Mf (X ),g(Y ) = Pf (X ) × Pg(Y ) since
= M (f −1 (FX ) = PX (f −1 (FX )) × PY (g −1 (FY )) = Mf (X ),g(Y ) (FX × FZ ) g −1 (FY ) Pf (X ) (FX ) × Pg(Y ) (FY ). Thus the previous inequality yields the corollary.
2
For the remainder of this section we focus on conditional entropy and information.
Although we cannot express mutual information as a diﬀerence of ordinary
entropies in the general case (since the entropies of nondiscrete random variables
are generally inﬁnite), we can obtain such a representation in the case where one
of the two variables is discrete. Suppose we are given a joint distribution PXY
and that X is discrete. We can choose a version of the conditional probability
given Y so that pX Y (xy ) = P (X = xY = y ) is a valid pmf (considered as a
function of x for ﬁxed y ) with PY probability 1. (This follows from Corollary
5.8.1 of [50] since the alphabet of X is discrete; the alphabet of Y need not be
even standard.) Deﬁne
H (X Y = y ) = pX Y (xy ) ln
x 1
pX Y (xy ) 5.5. INFORMATION FOR GENERAL ALPHABETS 115 and
H (X Y ) = H (X Y = y ) dPY (y ). Note that this agrees with the formula of Section 2.5 in the case that both
alphabets are ﬁnite. The following result is due to Wyner [154].
Lemma 5.5.6 If X, Y are random variables and X has a ﬁnite alphabet, then
I (X ; Y ) = H (X ) − H (X Y ).
Proof: We ﬁrst claim that pX Y (xy )/pX (x) is a version of dPXY /d(PX × PY ).
To see this observe that for F ∈ B (AX × AY ), letting Fy denote the section
{x : (x, y ) ∈ F } we have that F pX Y (xy )
d(PX × PY )
p X ( x) =
x∈Fy = pX Y (xy )
pX (x)dPY (y )
pX (x)
pX Y (xy ) dPY (y )
x∈ Fy = dPY (y )PX (Fy y ) = PXY (F ). Thus
I (X ; Y ) pX Y (xy )
) dPXY
p X ( x) = ln( = H (X ) + pX Y (xy ) ln pX Y (xy ). dPY (y )
x 2
We now wish to study the eﬀects of quantizing on conditional information.
As discussed in Section 2.5,it is not true that I (X ; Y Z ) is always greater than
I (f (X ); q (Y )r(Z )) and hence that I (X ; Y Z ) can be written as a supremum
over all quantizers and hence the deﬁnition of (5.31) and the formula (5.33)
do not have the intuitive counterpart of a limit of informations of quantized
values. We now consider an alternative (and more general) deﬁnition of conditional mutual information due to Wyner [154]. The deﬁnition has the form of a
supremum over quantizers and does not require the existence of the probability
measure PX ×Y Z and hence makes sense for alphabets that are not standard.
Given PXY Z and any ﬁnite measurements f and g on X and Y , we can choose
a version of the conditional probability given Z = z so that
pz (a, b) = Pr(f (X ) = a, g (Y ) = bZ = z )
is a valid pmf with probability 1 (since the alphabets of f and g are ﬁnite and
hence standard a regular conditional probability exists from Corollary 5.8.1 of
[50]). For such ﬁnite measurements we can deﬁne
I (f (X ); g (Y )Z = z ) = pz (a, b) ln
a∈Af b∈Ag a pz (a, b)
,
pz (a , b) b pz (a, b ) 116 CHAPTER 5. RELATIVE ENTROPY that is, the ordinary discrete average mutual information with respect to the
distribution pz .
Lemma 5.5.7 Deﬁne
I (X ; Y Z ) = sup dPZ (z )I (f (X ); g (Y )Z = z ), f,g where the supremum is over all quantizers. Then there exist sequences of quantizers (as in Lemma 5.5.5) such that
I (X ; Y Z ) = lim I (qm (X ); rm (Y )Z ).
n→∞ I satisﬁes Kolmogorov’s formula, that is,
I (X ; Y Z ) = I ((X, Z ); Y ) − I (Y ; Z ).
If the alphabets are standard, then
I (X ; Y Z ) = I (X ; Y Z ).
Comment: The main point here is that conditional mutual information can be
expressed as a supremum or limit of quantizers. The other results simply point
out that the two conditional mutual informations have the same relation to
ordinary mutual information and are (therefore) equal when both are deﬁned.
The proof follows Wyner [154].
Proof: First observe that for any quantizers q and r of Af and Ag we have from
the usual properties of mutual information that
I (q (f (X )); r(g (Y ))Z = z ) ≤ I (f (X ); g (Y )Z = z )
and hence integrating we have that
I (q (f (X )); r(g (Y ))Z ) = I (q (f (X )); r(g (Y ))Z = z ) dPZ (z ) ≤ I (f (X ); g (Y )Z = z ) dPZ (z ) (5.43) and hence taking the supremum over all q and r to get I (f (X ); g (Y )Z ) yields
I (f (X ); g (Y )Z ) = I (f (X ); g (Y )Z = z ) dPZ (z ). (5.44) so that (5.43) becomes
I (q (f (X )); r(g (Y ))Z ) ≤ I (f (X ); g (Y )Z ) (5.45) for any quantizers q and r and the deﬁnition of I can be expressed as
I (X ; Y Z ) = sup I (f (X ); g (Y )Z ),
f,g (5.46) 5.6. SOME CONVERGENCE RESULTS 117 where the supremum is over all quantizers f and g . This proves the ﬁrst part of
the lemma since the supremum can be approached by a sequence of quantizers.
Next observe that
I (f (X ); g (Y )Z ) = I (f (X ); g (Y )Z = z ) dPZ (z ) = H (g (Y )Z ) − H (g (Y )f (X ), Z ).
Since we have from Lemma 5.5.6 that
I (g (Y ); Z ) = H (g (Y )) − H (g (Y )Z ),
we have by adding these equations and again using Lemma 5.5.6 that
I (g (Y ); Z ) + I (f (X ); g (Y )Z ) = H (g (Y )) − H (g (Y )f (X ), Z )
= I ((f (X ), Z ); g (Y )). Taking suprema over both sides over all quantizers f and g yields the relation
I (X ; Z ) + I (X ; Y Z ) = I ((X, Z ); Y ),
proving Kolmogorov’s formula. Lastly, if the spaces are standard, then from
Kolmogorov’s inequality for the original deﬁnition (which is valid for the standard space alphabets) combined with the above formula implies that
I (X ; Y Z ) = I ((X, Z ); Y ) − I (X ; Z ) = I (X ; Y Z ).
2 5.6 Some Convergence Results We now combine the convergence results for divergence with the deﬁnitions
and properties of information densities to obtain some convergence results for
information densities. Unlike the results to come for relative entropy rate and
information rate, these are results involving the information between a sequence
of random variables and a ﬁxed random variable.
Lemma 5.6.1 Given random variables X and Y1 , Y2 , · · · deﬁned on a common
probability space,
lim I (X ; (Y1 , Y2 , · · · , Yn )) = I (X ; (Y1 , Y2 , · · · )). n→∞ If in addition I (X ; (Y1 , Y2 , · · · )) < ∞ and hence PX × PY1 ,Y2 ,···
then
iX ;Y1 ,Y2 ,··· ,Yn → iX ;Y1 ,Y2 ,···
n→∞ 1 in L . PX,Y1 ,Y2 ,··· , 118 CHAPTER 5. RELATIVE ENTROPY Proof: The ﬁrst result follows from Corollary 5.2.5 with X, Y1 , Y2 , · · · , Yn−1
replacing X n , P being the distribution PX,Y1 ,··· , and M being the product distribution PX × PY1 ,Y2 ,··· . The density result follows from Lemma 5.4.1.
2
Corollary 5.6.1 Given random variables X , Y , and Z1 , Z2 , · · · deﬁned on a
common probability space, then
lim I (X ; Y Z1 , Z2 , · · · , Zn ) = I (X ; Y Z1 , Z2 , · · · ). n→∞ If
I ((X, Z1 , · · · ); Y ) < ∞,
( e.g., if Y has a ﬁnite alphabet and hence I ((X, Z1 , · · · ); Y ) ≤ H (Y ) < ∞),
then also
iX ;Y Z1 ,··· ,Zn → iX ;Y Z1 ,···
(5.47)
n→∞ in L1 .
Proof: From Kolmogorov’s formula
I (X ; Y Z1 , Z2 , · · · , Zn ) =
I (X ; (Y, Z1 , Z2 , · · · , Zn )) − I (X ; Z1 , · · · , Zn ) ≥ 0. (5.48)
From the previous lemma, the ﬁrst term on the left converges as n → ∞ to
I (X ; (Y, Z1 , · · · )) and the second term on the right is the negative of a term converging to I (X ; (Z1 , · · · )). If the ﬁrst of these limits is ﬁnite, then the diﬀerence
in (5.6) converges to the diﬀerence of these terms, which gives (5.47). From the
chain rule for information densities, the conditional information density is the
diﬀerence of the information densities:
iX ;Y Z1 ,··· ,Zn = iX ;(Y,Z1 ,··· ,Zn ) − iX ;(Z1 ,··· ,Zn )
which is converging in L1 x to
iX ;Y Z1 ,··· = iX ;(Y,Z1 ,··· ) − iX ;(Z1 ,··· ) ,
again invoking the density chain rule. If I (X ; Y Z1 , · · · ) = ∞ then quantize Y
as q (Y ) and note since q (Y ) has a ﬁnite alphabet that
I (X ; Y Z1 , Z2 , · · · , Zn ) ≥ I (X ; q (Y )Z1 , Z2 , · · · , Zn ) → I (X ; q (Y )Z1 , · · · )
n→∞ and hence
lim inf I (X ; Y Z1 , · · · ) ≥ I (X ; q (Y )Z1 , · · · ).
N →∞ Since the righthand term above can be made arbitrarily large, the remaining
part of the lemma is proved.
2 5.6. SOME CONVERGENCE RESULTS 119 Lemma 5.6.2 If
PX × PY1 ,Y2 ,··· PX,Y1 ,Y2 ,··· (e.g., I (X ; (Y1 , Y2 , · · · )) < ∞), then with probability 1.
1
i(X ; (Y1 , · · · , Yn )) = 0.
n→∞ n
lim Proof: This is a corollary of Theorem 5.4.1. Let P denote the distribution of
{X, Y1 , Y2 , · · · } and let M denote the distribution PX × PY1 ,··· . By assumption
M
P . The information density is
i(X ; (Y1 , · · · , Yn )) = ln dPn
,
dMn where Pn and Mn are the restrictions of P and M to σ (X, Y1 , · · · Yn ). Theorem 5.4.1 can therefore be applied to conclude that P a.e.
lim n→∞ dPn
1
ln
= 0,
n dMn
2 which proves the lemma.
The lemma has the following immediate corollary.
Corollary 5.6.2 If {Xn } is a process with the property that
I (X0 ; X−1 , X−2 , · · · ) < ∞, that is, there is a ﬁnite amount of information between the zero time sample
and the inﬁnite past, then
lim n→∞ 1
i(X0 ; X−1 , · · · , X−n ) = 0.
n If the process is stationary, then also
lim n→∞ 1
i(Xn ; X n ) = 0.
n 120 CHAPTER 5. RELATIVE ENTROPY Chapter 6 Information Rates II
6.1 Introduction In this chapter we develop general deﬁnitions of information rate for processes
with standard alphabets and we prove a mean ergodic theorem for information
densities. The L1 results are extensions of the results of Moy [106] and Perez
[124] for stationary processes, which in turn extended the ShannonMcMillan
theorem from entropies of discrete alphabet processes to information densities.
(See also Kieﬀer [86].) We also relate several diﬀerent measures of information
rate and consider the mutual information between a stationary process and its
ergodic component function. In the next chapter we apply the results of Chapter
5 on divergence to the deﬁnitions of this chapter for limiting information and
entropy rates to obtain a number of results describing the behavior of such
rates. In Chapter 8 almost everywhere ergodic theorems for relative entropy
and information densities are proved. 6.2 Information Rates for General Alphabets Suppose that we are given a pair random process {Xn , Yn } with distribution p.
The most natural deﬁnition of the information rate between the two processes
is the extension of the deﬁnition for the ﬁnite alphabet case:
1
¯
I (X ; Y ) = lim sup I (X n ; Y n ).
n→∞ n
This was the ﬁrst general deﬁnition of information rate and it is due to Dobrushin [32]. While this deﬁnition has its uses, it also has its problems. Another
deﬁnition is more in the spirit of the deﬁnition of information itself: We formed
the general deﬁnitions by taking a supremum of the ﬁnite alphabet deﬁnitions
over all ﬁnite alphabet codings or quantizers. The above deﬁnition takes the
limit of such suprema. An alternative deﬁnition is to instead reverse the order
and take the supremum of the limit and hence the supremum of the information
121 122 CHAPTER 6. INFORMATION RATES II rate over all ﬁnite alphabet codings of the process. This will provide a deﬁnition
of information rate similar to the deﬁnition of the entropy of a dynamical system.
There is a question as to what kind of codings we permit, that is, do the quantizers quantize individual outputs or long sequences of outputs. We shall shortly
see that it makes no diﬀerence. Suppose that we have a pair random process
{Xn , Yn } with standard alphabets AX and AY and suppose that f : A∞ → Af
X
and g : A∞ → Ag are stationary codings of the X and Y sequence spaces into
Y
a ﬁnite alphabet. We will call such ﬁnite alphabet stationary mappings sliding
block codes or stationary codes. Let {fn , gn } be the induced output process, that
is, if T denotes the shift (on any of the sequence spaces) then fn (x, y ) = f (T n x)
and gn (x, y ) = g (T n y ). Recall that f (T n (x, y )) = fn (x, y ), that is, shifting the
input n times results in the output being shifted n times.
Since the new process {fn , gn } has a ﬁnite alphabet, its mutual information
rate is deﬁned. We now deﬁne the information rate for general alphabets as
follows:
I ∗ (X ; Y ) =
= sup
sliding block codes
sup
sliding block codes ¯
I (f ; g )
f,g lim sup
f,g n→∞ 1
I (f n ; g n ).
n We now focus on AMS processes, in which case the information rates for
ﬁnite alphabet processes (e.g., quantized processes) is given by the limit, that
is,
I ∗ (X ; Y ) =
= sup
sliding block codes
sup
sliding block codes ¯
I (f ; g )
f,g lim f,g n→∞ 1
I (f n ; g n ).
n The following lemma shows that for AMS sources I ∗ can also be evaluated by
constraining the sliding block codes to be scalar quantizers.
Lemma 6.2.1 Given an AMS pair random process {Xn , Yn } with standard alphabet,
¯
I ∗ (X ; Y ) = sup I (q (X ); r(Y )) = sup lim sup
q,r q,r n→∞ 1
I (q (X )n ; r(Y )n ),
n where the supremum is over all quantizers q of AX and r of AY and where
q (X )n = q (X0 ), · · · , q (Xn−1 ).
Proof: Clearly the right hand side above is less than I ∗ since a scalar quantizer is
a special case of a stationary code. Conversely, suppose that f and g are sliding
¯
block codes such that I (f ; g ) ≥ I ∗ (X ; Y ) − . Then from Corollary 4.3.1 there
are quantizers q and r and codes f and g depending only on the quantized
¯
¯
processes q (Xn ) and r(Yn ) such that I (f ; g ) ≥ I (f ; g ) − . From Lemma 4.3.3, 6.2. INFORMATION RATES FOR GENERAL ALPHABETS 123 ¯
¯
however, I (q (X ); r(Y )) ≥ I (f ; g ) since f and g are stationary codings of the
¯
quantized processes. Thus I (q (X ); r(Y )) ≥ I ∗ (X ; Y ) − 2 , which proves the
lemma.
2
Corollary 6.2.1
¯
I ∗ (X ; Y ) ≤ I (X ; Y ).
If the alphabets are ﬁnite, then the two rates are equal.
Proof: The inequality follows from the lemma and the fact that
I (X n ; Y n ) ≥ I (q (X )n ; r(Y )n )
for any scalar quantizers q and r (where q (X )n is q (X0 ), · · · , q (Xn−1 )). If
the alphabets are ﬁnite, then the identity mappings are quantizers and yield
I (X n ; Y n ) for all n.
2
Pinsker [126] introduced the deﬁnition of information rate as a supremum
over all scalar quantizers and hence we shall refer to this information rate as
the Pinsker rate. The Pinsker deﬁnition has the advantage that we can use the
known properties of information rates for ﬁnite alphabet processes to infer those
for general processes, an attribute the ﬁrst deﬁnition lacks.
Corollary 6.2.2 Given a standard alphabet pair process alphabet AX ×AY there
is a sequence of scalar quantizers qm and rm such that for any AMS pair process {Xn , Yn } having this alphabet (that is, for any process distribution on the
corresponding sequence space)
I (X n ; Y n ) = lim I (qm (X )n ; rm (Y )n )
m→∞ ¯
I (X ; Y ) = lim I (qm (X ); rm (Y )).
∗ m→∞ Furthermore, the above limits can be taken to be increasing by using ﬁner and
ﬁner quantizers. Comment: It is important to note that the same sequence of
quantizers gives both of the limiting results.
Proof: The ﬁrst result is Lemma 5.5.5. The second follows from the previous
lemma.
2
Observe that
I ∗ (X ; Y ) = lim lim sup
m→∞ n→∞ whereas 1
I (qm (X ); rm (Y ))
n 1
¯
I (X ; Y ) = lim sup lim I (qm (X ); rm (Y )).
n→∞ m→∞ n
Thus the two notions of information rate are equal if the two limits can be
interchanged. We shall later consider conditions under which this is true and
we shall see that equality of these two rates is important for proving ergodic
theorems for information densities. 124 CHAPTER 6. INFORMATION RATES II Lemma 6.2.2 Suppose that {Xn , Yn } is an AMS standard alphabet random
process with distribution p and stationary mean p. Then
¯
∗
∗
Ip (X ; Y ) = Ip (X ; Y ).
¯
∗
Ip is an aﬃne function of the distribution p. If p has ergodic decomposition pxy ,
¯
¯
then
∗
Ip (X ; Y ) = dp(x, y )I ∗ pxy (X ; Y ).
¯
¯ If f and g are stationary codings of X and Y , then
∗
Ip (f ; g ) = ∗
dp(x, y )Ipxy (f ; g ).
¯
¯ ¯
Proof: For any scalar quantizers q and r of X and Y we have that Ip (q (X ); r(Y )) =
¯¯
Ip (q (X ); r(Y )). Taking a limit with ever ﬁner quantizers yields the ﬁrst equality. The fact that I ∗ is aﬃne follows similarly. Suppose that p has ergodic
¯
decomposition pxy . Deﬁne the induced distributions of the quantized process
¯
by m and mxy , that is, m(F ) = p(x, y : {q (xi ), r(yi ); i ∈ T } ∈ F ) and similarly
¯
for mxy . It is easy to show that m is stationary (since it is a stationary coding
of a stationary process), that the mxy are stationary ergodic (since they are
stationary codings of stationary ergodic processes), and that the mxy form an
ergodic decomposition of m. If we let Xn , Yn denote the coordinate functions
on the quantized output sequence space (that is, the processes {q (Xn ), r(Yn )}
and {Xn , Yn } are equivalent), then using the ergodic decomposition of mutual
information for ﬁnite alphabet processes (Lemma 4.3.1) we have that
¯
Ip (q (X ); r(Y )) ¯
= Im (X ; Y ) =
= ¯
Imx y (X ; Y ) dm(x , y ) ¯¯
Ipxy (q (X ); r(Y )) dp(x, y ).
¯ Replacing the quantizers by the sequence qm , rm the result then follows by
taking the limit using the monotone convergence theorem. The result for stationary codings follows similarly by applying the previous result to the induced
distributions and then relating the equation to the original distributions.
2
¯
The above properties are not known to hold for I in the general case. Thus
¯
although I may appear to be a more natural deﬁnition of mutual information
∗
rate, I is better behaved since it inherits properties from the discrete alphabet
case. It will be of interest to ﬁnd conditions under which the two rates are the
¯
same, since then I will share the properties possessed by I ∗ . The ﬁrst result of
the next section adds to the interest by demonstrating that when the two rates
are equal, a mean ergodic theorem holds for the information densities. 6.3 A Mean Ergodic Theorem for Densities Theorem 6.3.1 Given an AMS pair process {Xn , Yn } with standard alphabets,
assume that for all n
PX n × PY n
PX n Y n 6.3. A MEAN ERGODIC THEOREM FOR DENSITIES 125 and hence that the information densities
iX n ;Y n = ln dPX n ,Y n
d(PX n × PY n ) are well deﬁned. For simplicity we abbreviate iX n ;Y n to in when there is no
possibility of confusion. If the limit
lim n→∞ 1
¯
I (X n ; Y n ) = I (X ; Y )
n exists and
¯
I (X ; Y ) = I ∗ (X ; Y ) < ∞,
then n−1 in (X n ; Y n ) converges in L1 to an invariant function i(X ; Y ). If the
stationary mean of the process has an ergodic decomposition pxy , then the lim¯
iting density is I ∗ pxy (X ; Y ), the information rate of the ergodic component in
¯
eﬀect.
Proof: Let qm and rm be asymptotically accurate quantizers for AX and AY .
ˆ
ˆ
Deﬁne the discrete approximations Xn = qm (Xn ) and Yn = rm (Yn ). Observe
that PX n × PY n
PX n Y n implies that PX n × PY n
PX n Y n and hence we can
ˆ
ˆ
ˆˆ
deﬁne the information densities of the quantized vectors by
ˆn = ln
i dPXn Y n
ˆˆ
.
d(PX n × PY n )
ˆ
ˆ For any m we have that
1
 in (xn ; y n ) − I ∗ pxy (X ; Y ) dp(x, y ) ≤
¯
n
1
1
 in (xn ; y n ) − ˆn (qm (x)n ; rm (y )n ) dp(x, y )+
i
n
n
1
¯¯
 ˆn (qm (x)n ; rm (y )n ) − Ipxy (qm (X ); rm (Y )) dp(x, y )+
i
n
¯¯
Ipxy (qm (X ); rm (Y )) − I ∗ pxy (X ; Y ) dp(x, y ), (6.1)
¯
where
qm (x)n = (qm (x0 ), · · · , qm (xn−1 )),
rm (y )n = (rm (y0 ), · · · , rm (yn−1 )),
¯
and Ip (qm (X ); rm (Y )) denotes the information rate of the process {qm (Xn ), rm (Yn );
n = 0, 1, · · · , } when p is the process measure describing {Xn , Yn }.
Consider ﬁrst the rightmost term of (6.1). Since I ∗ is the supremum over
all quantized versions,
¯¯
Ipxy (qm (X ); rm (Y )) − I ∗ pxy (X ; Y ) dp(x, y ) =
¯
¯¯
(I ∗ pxy (X ; Y ) − Ipxy (qm (X ); rm (Y ))) dp(x, y ).
¯ 126 CHAPTER 6. INFORMATION RATES II ¯
Using the ergodic decomposition of I ∗ (Lemma 6.2.2) and that of I for discrete
alphabet processes (Lemma 4.3.1) this becomes
¯¯
Ipxy (qm (X ); rm (Y )) − I ∗ pxy (X ; Y ) dp(x, y ) =
¯
∗
¯
Ip (X ; Y ) − Ip (qm (X ); rm (Y )). (6.2) For ﬁxed m the middle term of (6.1) can be made arbitrarily small by taking
n large enough from the ﬁnite alphabet result of Lemma 4.3.1. The ﬁrst term on
the right can be bounded above using Corollary 5.2.6 with F = σ (q (X )n ; r(Y )n )
by
1
2
ˆˆ
(I (X n ; Y n ) − I (X n ; Y n ) + )
n
e
¯
¯
which as n → ∞ goes to I (X ; Y ) −I (qm (X ); rm (Y )). Thus we have for any m
that
lim sup
n→∞ 1
 in (xn ; y n ) − I ∗ pxy (X ; Y ) dp(x, y ) ≤
¯
n
¯
¯
¯
I (X ; Y ) − I (qm (X ); rm (Y )) + I ∗ (X ; Y ) − I (qm (X ); rm (Y )) ¯
which as m → ∞ becomes I (X ; Y ) − I ∗ (X ; Y ), which is 0 by assumption. 6.4 2 Information Rates of Stationary Processes In this section we introduce two more deﬁnitions of information rates for the
case of stationary twosided processes. These rates are useful tools in relating
the Dobrushin and Pinsker rates and they provide additional interpretations
of mutual information rates in terms of ordinary mutual information. The
deﬁnitions follow Pinsker [126].
Henceforth assume that {Xn , Yn } is a stationary twosided pair process with
standard alphabets. Deﬁne the sequences y = {yi ; i ∈ T } and Y = {Yi ; i ∈ T }
First deﬁne
1
˜
I (X ; Y ) = lim sup I (X n ; Y ),
n
n→∞
that is, consider the perletter limiting information between ntuples of X and
the entire sequence from Y . Next deﬁne
I − (X ; Y ) = I (X0 ; Y X−1 , X−2 , · · · ),
that is, the average conditional mutual information between one letter from X
and the entire Y sequence given the inﬁnite past of the X process. We could
deﬁne the ﬁrst rate for onesided processes, but the second makes sense only
when we can consider an inﬁnite past. For brevity we write X − = X−1 , X−2 , · · ·
and hence
I − (X ; Y ) = I (X0 ; Y X − ). 6.4. INFORMATION RATES OF STATIONARY PROCESSES 127 Theorem 6.4.1
˜
¯
I (X ; Y ) ≥ I (X ; Y ) ≥ I ∗ (X ; Y ) ≥ I − (X ; Y ).
If the alphabet of X is ﬁnite, then the above rates are all equal.
Comment: We will later see more general suﬃcient conditions for the equality
of the various rates, but the case where one alphabet is ﬁnite is simple and
important and points out that the rates are all equal in the ﬁnite alphabet case.
Proof: We have already proved the middle inequality. The left inequality follows
immediately from the fact that I (X n ; Y ) ≥ I (X n ; Y n ) for all n. The remaining
inequality is more involved. We prove it in two steps. First we prove the
second half of the theorem, that the rates are the same if X has ﬁnite alphabet.
We then couple this with an approximation argument to prove the remaining
inequality. Suppose now that the alphabet of X is ﬁnite. Using the chain rule
and stationarity we have that
1
I (X n ; Y n )
n = = 1
n
1
n n−1 I (Xi ; Y n X0 , · · · , Xi−1 )
i=0
n−1
n
I (X0 ; Y−i X−1 , · · · , X−i ),
i=0 n
where Y−i is Y−i , · · · , Y−i+n−1 , that is, the nvector starting at −i. Since X has
ﬁnite alphabet, each term in the sum is bounded. We can show as in Section 5.5
(or using Kolmogorov’s formula and Lemma 5.5.1) that each term converges as
i → ∞, n → ∞, and n − i → ∞ to I (X0 ; Y X−1 , X−2 , · · · ) or I − (X ; Y ). These
facts, however, imply that the above Ces`ro average converges to the same limit
a
¯
˜
and hence I = I − . We can similarly expand I as 1
n n−1 1
I (Xi ; Y X0 , · · · , Xi−1 ) =
n
i=0 n−1 I (X0 ; Y X−1 , · · · , X−i ),
i=0 ˜¯
which converges to the same limit for the same reasons. Thus I = I = I − for
stationary processes when the alphabet of X is ﬁnite. Now suppose that X
has a standard alphabet and let qm be an asymptotically accurate sequences of
quantizers. Recall that the corresponding partitions are increasing, that is, each
reﬁnes the previous partition. Fix > 0 and choose m large enough so that the
quantizer α(X0 ) = qm (X0 ) satisﬁes
I (α(X0 ); Y X − ) ≥ I (X0 ; Y X − ) − .
Observe that so far we have only quantized X0 and not the past. Since
Fm = σ (α(X0 ), Y, qm (X−i ); i = 1, 2, · · · )
asymptotically generates
σ (α(X0 ), Y, X−i ; i = 1, 2, · · · ), 128 CHAPTER 6. INFORMATION RATES II given we can choose for m large enough (larger than before) a quantizer β (x) =
qm (x) such that if we deﬁne β (X − ) to be β (X−1 ), β (X−2 ), · · · , then
I (α(X0 ); (Y, β (X − ))) − I (α(X0 ); (Y, X − )) ≤
and
I (α(X0 ); β (X − )) − I (α(X0 ); X − ) ≤ .
Using Kolmogorov’s formula this implies that
I (α(X0 ); Y X − ) − I (α(X0 ); Y β (X − )) ≤ 2
and hence that
I (α(X0 ); Y β (X − )) ≥ I (α(X0 ); Y X − ) − 2 ≥ I (X0 ; Y X − ) − 3 .
But the partition corresponding to β reﬁnes that of α and hence increases the
information; hence
I (β (X0 ); Y β (X − )) ≥ I (α(X0 ); Y β (X − )) ≥ I (X0 ; Y X − ) − 3 .
Since β (Xn ) has a ﬁnite alphabet, however, from the ﬁnite alphabet result the
¯
leftmost term above must be I (β (X ); Y ), which can be made arbitrarily close
to I ∗ (X ; Y ). Since is arbitrary, this proves the ﬁnal inequality.
2
The following two theorems provide suﬃcient conditions for equality of the
various information rates. The ﬁrst result is almost a special case of the second,
but it is handled separately as it is simpler, much of the proof applies to the
second case, and it is not an exact special case of the subsequent result since it
does not require the second condition of that result. The result corresponds to
condition (7.4.33) of Pinsker [126], who also provides more general conditions.
The more general condition is also due to Pinsker and strongly resembles that
considered by Barron [9].
Theorem 6.4.2 Given a stationary pair process {Xn , Yn } with standard alphabets, if
I (X0 ; (X−1 , X−2 , · · · )) < ∞,
then
˜
¯
I (X ; Y ) = I (X ; Y ) = I ∗ (X ; Y ) = I − (X ; Y ). (6.3) Proof: We have that
1
1
1
1
I (X n ; Y ) ≤ I (X n ; (Y, X − )) = I (X n ; X − ) + I (X n ; Y X − ),
n
n
n
n (6.4) where, as before, X − = {X−1 , X−2 , · · · }. Consider the ﬁrst term on the right.
Using the chain rule for mutual information
1
I (X n ; X − )
n = = 1
n
1
n n−1 I (Xi ; X − X i ) (6.5) (I (Xi ; (X i , X − )) − I (Xi ; X i )). (6.6) i=0
n−1 i=0 6.4. INFORMATION RATES OF STATIONARY PROCESSES 129 Using stationarity we have that
1
1
I (X n ; X − ) =
n
n n−1 (I (X0 ; X − ) − I (X0 ; (X−1 , · · · , X−i )). (6.7) i=0 The terms I (X0 ; (X−1 , · · · , X−i )) are converging to I (X0 ; X − ), hence the terms
in the sum are converging to 0, i.e.,
lim I (Xi ; X − X i ) = 0. i→∞ (6.8) The Ces`ro mean of (6.6) is converging to the same thing and hence
a
1
I (X n ; X − ) → 0.
n (6.9) Next consider the term I (X n ; Y X − ). For any positive integers n,m we have
m
I (X n+m ; Y X − ) = I (X n ; Y X − ) + I (Xn ; Y X − , X n ), (6.10) m
where Xn = Xn , · · · , Xn+m−1 . From stationarity, however, the rightmost term
is just I (X m ; Y X − ) and hence I (X m+n ; Y X − ) = I (X n ; Y X − ) + I (X m ; Y X − ). (6.11) This is just a linear functional equation of the form f (n + m) = f (n) + f (m)
and the unique solution to such an equation is f (n) = nf (1), that is,
1
I (X n ; Y X − ) = I (X0 ; Y X − ) = I − (X ; Y ).
n (6.12) Taking the limit supremum in (6.4) we have shown that
˜
I (X ; Y ) ≤ I − (X ; Y ), (6.13) which with Theorem 6.4.1 completes the proof.
2
Intuitively, the theorem states that if one of the processes has ﬁnite average
mutual information between one symbol and its inﬁnite past, then the Dobrushin
and Pinsker information rates yield the same value and hence there is an L1
ergodic theorem for the information density.
To generalize the theorem we introduce a condition that will often be useful
when studying asymptotic properties of entropy and information. A stationary
process {Xn } is said to have the ﬁnitegap information property if there exists
an integer K such that
I (XK ; X − X K ) < ∞,
(6.14)
where, as usual, X − = (X−1 , X−2 , · · · ). When a process has this property for
a speciﬁc K , we shall say that it has the K gap information property. Observe
that if a process possesses this property, then it follows from Lemma 5.5.4
I (XK ; (X−1 , · · · , X−l )X K ) < ∞; l = 1, 2, · · · (6.15) 130 CHAPTER 6. INFORMATION RATES II Since these informations are ﬁnite,
(K ) PX n PX n ; n = 1, 2, . . . , (6.16) (K ) where PX n is the K th order Markov approximation to PX n .
Theorem 6.4.3 Given a stationary standard alphabet pair process {Xn , Yn }, if
{Xn } satisﬁes the ﬁnitegap information property (6.14) and if, in addition,
I (X K ; Y ) < ∞, (6.17) then (6.3) holds.
If K = 0 then there is no conditioning and (6.17) is trivial, that is, the
previous theorem is the special case with K = 0.
Comment: This theorem shows that if there is any ﬁnite dimensional future vector (XK , XK +1 , · · · , XK +N −1 ) which has ﬁnite mutual information with respect
to the inﬁnite past X − when conditioned on the intervening gap (X0 , · · · , XK −1 ),
then the various deﬁnitions of mutual information are equivalent provided that
the mutual information betwen the “gap” X K and the sequence Y are ﬁnite.
˜
Note that this latter condition will hold if, for example, I (X ; Y ) is ﬁnite.
Proof: For n > K
1
1
1
n
I (X n ; Y ) = I (X K ; Y ) + I (XK−K ; Y X K ).
n
n
n
By assumption the ﬁrst term on the left will tend to 0 as n → ∞ and hence we
focus on the second, which can be broken up analogous to the previous theorem
with the addition of the conditioning:
1
n
I (XK−K ; Y X K ) ≤
n
= 1
n
I (XK−K ; (Y, X − X K ))
n
1
1
n
n
I (XK−K ; X − X K ) + I (XK−K ; Y X − , X K ).
n
n Consider ﬁrst the term
1
1
n
I (XK−K ; X − X K ) =
n
n n−1 I (Xi ; X − X i ),
i=K which is as (6.6) in the proof of Theorem 6.4.2 except that the ﬁrst K terms
are missing. The same argument then shows that the limit of the sum is 0. The
remaining term is
1
1
n
I (XK−K ; Y X − , X K ) = I (X n ; Y X − )
n
n
exactly as in the proof of Theorem 6.4.2 and the same argument then shows
that the limit is I − (X ; Y ), which completes the proof.
2 6.4. INFORMATION RATES OF STATIONARY PROCESSES 131 One result developed in the proofs of Theorems 6.4.2 and 6.4.3 will be important later in its own right and hence we isolate it as a corollary. The result
is just (6.8), which remains valid under the more general conditions of Theorem 6.4.3, and the fact that the Ces`ro mean of converging terms has the same
a
limit.
Corollary 6.4.1 If a process {Xn } has the ﬁnitegap information property
I (XK ; X − X K ) < ∞
for some K , then
lim I (Xn ; X − X n ) = 0 n→∞ and
lim n→∞ 1
I (X n ; X − ) = 0.
n The corollary can be interpreted as saying that if a process has the the ﬁnite
gap information property, then the mutual information between a single sample
and the inﬁnite past conditioned on the intervening samples goes to zero as the
number of intervening samples goes to inﬁnity. This can be interpreted as a
form of asymptotic independence property of the process.
Corollary 6.4.2 If a onesided stationary source {Xn } is such that for some
K
K , I (Xn ; X n−K Xn−K ) is bounded uniformly in n, then it has the ﬁnitegap
property and hence
¯
I (X ; Y ) = I ∗ (X ; Y ).
Proof: Simply imbed the onesided source into a twosided stationary source
with the same probabilities on all ﬁnitedimensional events. For that source
K
I (Xn ; X n−K Xn−K ) = I (XK ; X−1 , · · · , X−n−K X K ) → I (XK ; X − X K ).
n→∞ Thus if the terms are bounded, the conditions of Theorem 6.4.2 are met for the
twosided source. The onesided equality then follows.
2
The above results have an information theoretic implication for the ergodic
decomposition, which is described in the next theorem.
Theorem 6.4.4 Suppose that {Xn } is a stationary process with the ﬁnitegap
property (6.14). Let ψ be the ergodic component function of Theorem 1.8.3 and
suppose that for some n
I ( X n ; ψ ) < ∞.
(6.18)
(This will be the case, for example, if the ﬁnitegap information property holds
for 0 gap, that is, I (X0 ; X − ) < ∞ since ψ can be determined from X − and
information is decreased by taking a function.) Then
lim n→∞ 1
I (X n ; ψ ) = 0.
n 132 CHAPTER 6. INFORMATION RATES II Comment: For discrete alphabet processes this theorem is just the ergodic decomposition of entropy rate in disguise (Theorem 2.4.1). It also follows for
ﬁnite alphabet processes from Lemma 3.3.1. We shall later prove a corresponding almost everywhere convergence result for the corresponding densities. All
of these results have the interpretation that the persymbol mutual information
between the outputs of the process and the ergodic component decreases with
time because the ergodic component in eﬀect can be inferred from the process
output in the limit of an inﬁnite observation sequence. The ﬁniteness condition
on some I (X n ; ψ ) is necessary for the nonzero ﬁnitegap case to avoid cases such
as where Xn = ψ for all n and hence
I (X n ; ψ ) = I (ψ ; ψ ) = H (ψ ) = ∞,
in which case the theorem does not hold.
Proof:
Deﬁne ψn = ψ for all n. Since ψ is invariant, {Xn , ψn } is a stationary
¯
process. Since Xn satisﬁes the given conditions, however, I (X ; ψ ) = I ∗ (X ; ψ ).
¯(q (X ); ψ ) is 0 from Lemma 3.3.1. I ∗ (X ; ψ ) is
But for any scalar quantizer q , I
¯
therefore 0 since it is the supremum of I (q (X ); ψ ) over all quantizers q . Thus
1
1
I (X n ; ψ n ) = lim I (X n ; ψ ). 2
n→∞ n
n→∞ n ¯
0 = I (X ; ψ ) = lim Chapter 7 Relative Entropy Rates
7.1 Introduction This chapter extends many of the basic properties of relative entropy to sequences of random variables and to processes. Several limiting properties of
entropy rates are proved and a mean ergodic theorem for relative entropy densities is given. The principal ergodic theorems for relative entropy and information
densities in the general case are given in the next chapter. 7.2 Relative Entropy Densities and Rates Suppose that p and m are two AMS distributions for a random process {Xn }
with a standard alphabet A. For convenience we assume that the random variables {Xn } are coordinate functions of an underlying measurable space (Ω, B )
where Ω is a onesided or twosided sequence space and B is the corresponding
σ ﬁeld. Thus x ∈ Ω has the form x = {xi }, where the index i runs from
0 to ∞ for a onesided process and from −∞ to +∞ for a twosided process. The random variables and vectors of principal interest are Xn (x) = xn ,
X n (x) = xn = (x0 , · · · , xn−1 ), and Xlk (x) = (xl , · · · , xl+k−1 ). The process
distributions p and m are both probability measures on the measurable space
(Ω, B ).
For n = 1, 2, . . . let MX n and PX n be the vector distributions induced by p
and m. We assume throughout this section that MX n
PX n and hence that
n = dPX n /dMX n and the entropy densities
the RadonNikodym derivatives fX
hX n = ln fX n are well deﬁned for all n = 1, 2, . . . Strictly speaking, for each n
the random variable fX n is deﬁned on the measurable space (An , BAn ) and hence
fX n is deﬁned on a diﬀerent space for each n. When considering convergence
of relative entropy densities, it is necessary to consider a sequence of random
variables deﬁned on a common measurable space, and hence two notational
modiﬁcations are introduced: The random variables fX n (X n ) : Ω → [0, ∞) are
133 134 CHAPTER 7. RELATIVE ENTROPY RATES deﬁned by
fX n (X n )(x) ≡ fX n (X n (x)) = fX n (xn )
for n = 1, 2, . . .. Similarly the entropy densities can be deﬁned on the common
space (Ω, B ) by
hX n (X n ) = ln fX n (X n ).
The reader is warned of the potentially confusing dual use of X n in this notation: the subscript is the name of the random variable X n and the argument
is the random variable X n itself. To simplify notation somewhat, we will often
abbreviate the previous (unconditional) densities to
fn = fX n (X n ); hn = hX n (X n ).
For n = 1, 2, . . . deﬁne the relative entropy by
Hp m (X n ) = D(PX n MX n ) = EPX n hX n = Ep hX n (X n ). Deﬁne the relative entropy rate by
¯
Hp m (X ) = lim sup
n→∞ 1
Hp
n m (X n ). Analogous to Dobrushin’s deﬁnition of information rate, we also deﬁne
H ∗p m (X ) ¯
= sup Hp
q m (q (X )), where the supremum is over all scalar quantizers q .
Deﬁne as in Chapter 5 the conditional densities
fXn X n = dPXn X n
fX n+1
dPX n+1 /dMX n+1
=
=
fX n
dPX n /dMX n
dMXn X n (7.1) provided fX n = 0 and fXn X n = 1 otherwise. As for unconditional densities we
change the notation when we wish to emphasize that the densities can all be
deﬁned on a common underlying sequence space. For example, we follow the
notation for ordinary conditional probability density functions and deﬁne the
random variables
f n+1 (X n+1 )
fXn X n (Xn X n ) = X
fX n (X n )
and
hXn X n (Xn X n ) = ln fXn X n (Xn X n )
on (Ω, B ). These densities will not have a simple abbreviation as do the unconditional densities.
Deﬁne the conditional relative entropy
Hp m (Xn X n ) = EPX n (ln fXn X n ) = dp ln fXn X n (Xn X n ). (7.2) 7.2. RELATIVE ENTROPY DENSITIES AND RATES 135 All of the above deﬁnitions are immediate applications of deﬁnitions of Chapter
5 to the random variables Xn and X n . The diﬀerence is that these are now
deﬁned for all samples of a random process, that is, for all n = 1, 2, . . .. The
focus of this chapter is the interrelations of these entropy measures and on some
of their limiting properties for large n.
For convenience deﬁne
Dn = Hp
and D0 = Hp m (X0 ). m (Xn X n ); n = 1, 2, . . . , From Theorem 5.3.1 this quantity is nonnegative and Dn + D(PX n MX n ) = D(PX n+1 MX n+1 ).
If D(PX n MX n ) < ∞, then also
Dn = D(PX n+1 MX n+1 ) − D(PX n MX n ).
We can write Dn as a single divergence if we deﬁne as in Theorem 5.3.1 the
distribution SX n+1 by
MXn X n (F xn ) dPX n (xn ); F ∈ BA ; G ∈ BAn . SX n+1 (F × G) = (7.3) F Recall that SX n+1 combines the distribution PX n on X n with the conditional
distribution MXn X n giving the conditional probability under M for Xn given
X n . We shall abbreviate this construction by
SX n+1 = MXn X n PX n . (7.4) Dn = D(PX n+1 SX n+1 ). (7.5) Then
Note that SX n+1 is not in general a consistent family of measures in the sense
of the Kolmogorov extension theorem since its form changes with n, the ﬁrst
n samples being chosen according to p and the ﬁnal sample being chosen using
the conditional distribution induced by m given the ﬁrst n samples. Thus,
in particular, we cannot infer that there is a process distribution s which has
SX n ; , n = 1, 2, . . . as its vector distributions.
We immediately have a chain rule for densities
n−1 fX n = fXi X i (7.6) i=0 and a corresponding chain rule for conditional relative entropies similar to that
for ordinary entropies:
n−1 D(PX n MX n ) = Hp n
m (X ) = n−1 Hp
i=0 i
m (Xi X ) = Di .
i=0 (7.7) 136 7.3 CHAPTER 7. RELATIVE ENTROPY RATES Markov Dominating Measures The evaluation of relative entropy simpliﬁes for certain special cases and reduces to a mutual information when the dominating measure is a Markov approximation of the dominated measure. The following lemma is an extension to
sequences of the results of Corollary 5.5.2 and Lemma 5.5.4.
Theorem 7.3.1 Suppose that p is a process distribution for a standard alphabet random process {Xn } with induced vector distributions PX n ; n = 1, 2, . . ..
Suppose also that there exists a process distribution m with induced vector distributions MX n such that
(a) under m {Xn } is a k step Markov source, that is, for all n ≥ k , X n−k →
k
Xn−k → Xn is a Markov chain or, equivalently,
MXn X n = MXn Xn−k ,
k
and
(b) MX n PX n , n = 1, 2, . . . so that the densities
fX n = dPX n
dMX n are well deﬁned.
Suppose also that p(k) is the k step Markov approximation to p, that is, the
(k)
source with induced vector distributions PX n such that
(k) PX k = PX k
and for all n ≥ k
(k ) PXn X n = PXn Xn−k ;
k
that is, p(k) is a k step Markov process having the same initial distribution and
the same k th order conditional probabilities as p. Then for all n ≥ k
MX n (k) PX n and
(k ) dPX n
(k)
= fX n ≡ fX k
dMX n
dPX n
(k)
dPX n = PX n (7.8) n−1 fXl Xlk−k , (7.9) l=k fX n
(k ) . (7.10) fX n Furthermore
hXn X n = hXn Xn−k + iXn ;X n−k Xn−k
k
k (7.11) 7.3. MARKOV DOMINATING MEASURES 137 and hence
Dn m (Xn X n = Hp = k
Ip (Xn ; X n−k Xn−k ) + Hp Thus )
k
m (Xn Xn−k ). n−1 hX n = hX k + hXl Xlk−k + iXl ;X l−k Xlk−k (7.12) l=k and hence
D(PX n MX n ) = Hp m (X k )+ n−1 (Ip (Xl ; X l−k Xlk k ) + Hp
− k
m (Xl Xl−k )). (7.13) l=k If m = p(k) , then for all n ≥ k we have that hXn Xn−k = 0 and hence
k
Hp k
p(k) (Xn Xn−k ) =0 (7.14) and
k
Dn = Ip (Xn ; X n−k Xn−k ), and hence (7.15) n−1
(k) Ip (Xl ; X l−k Xlk k ).
− D(PX n PX n ) = (7.16) l=k Proof: If n = k +1, then the results follow from Corollary 5.3.3 and Lemma 5.5.4
with X = Xn , Z = X k , and Y = Xk . Now proceed by induction and assume
that the results hold for n. Consider the distribution QX (n+1) speciﬁed by
QX n = PX n and QXn X n = PXn Xn−k . In other words,
k
QX n+1 = PXn Xn−k PX n
k
k
Application of Corollary 5.3.1 with Z = X n−k , Y = Xn−k , and X = Xn implies
that MX n+1
QX n+1
PX n+1 and that fXn X n
dPX n+1
=
.
dQX n+1
fXn Xn−k
k
This means that we can write
PX n+1 (F ) =
F dPX n+1
dQX n+1 =
dQX n+1
=
F F dPX n+1
dQXn X n dQX n
dQX n+1 dPX n+1
dPXn Xn−k dPX n .
k
dQX n+1 138 CHAPTER 7. RELATIVE ENTROPY RATES From the induction hypothesis we can express this as
PX n+1 (F ) dPX n+1 dPX n
(k)
dPXn Xn−k dPX n
k
)
dQX n+1 dP (kn
X
dPX n+1 dPX n
(k )
dP n+1 ,
)
dQX n+1 dP (kn X =
F =
F proving that (k)
PX n+1 X PX n+1 and that dPX n+1
(k )
dPX n+1 = fXn X n dPX n
dPX n+1 dPX n
=
.
(k )
)
dQX n+1 dP n
fXn Xn−k dP (kn
k
X
X This proves the right hand part of (7.9) and (7.10).
ˆ
Next deﬁne the distribution PX n by
(k ) ˆ
PX n (F ) =
F fX n dMX n , (k )
fX n (k)
ˆ
is deﬁned in (7.9). Proving that PX n = PX n will prove both the left
where
hand relation of (7.8) and (7.9). Clearly ˆ
dPX n
(k )
= fX n
dMX n
and from the deﬁnition of f (k) and conditional densities
(k) (k)
.
k
n Xn−k fXn X n = fX (7.17) k
From Corollary 5.3.1 it follows that X n−k → Xn−k → Xn is a Markov
ˆ
chain. Since this is true for any n ≥ k , PX n is the distribution of a k step
Markov process. By construction we also have that
(k)
k
n Xn−k fX = fXn Xn−k
k (7.18) and hence from Theorem 5.3.1
(k)
k
n Xn−k PX = PXn Xn−k .
k (k )
(k)
ˆ
Since also fX k = fX k , PX n = PX n as claimed. This completes the proof of
(7.8)–(7.10). Eq. (7.11) follows since fXn X n = fXn Xn−k ×
k fXn X n
.
fXn Xn−k
k Eq. (7.12) then follows by taking expectations. Eq. (7.12) follows from (7.11)
and
n−1 fX n = fX k fXl X l ,
l=k 7.4. STATIONARY PROCESSES 139 whence (7.13) follows by taking expectations. If m = p(k) , then the claims
follow from (5.24)–(5.25).
2
Corollary 7.3.1 Given a stationary source p, suppose that for some K there
exists a K step Markov source m with distributions MX n
PX n , n = 1, 2, . . ..
Then for all k ≥ K (7.8)–(7.10) hold.
Proof: If m is a K step Markov source with the property MX n
PX n , n =
1, 2, . . ., then it is also a k step Markov source with this property for all k ≥ K .
The corollary then follows from the theorem.
2
Comment: The corollary implies that if any K step Markov source dominates
p on its ﬁnite dimensional distributions, then for all k ≥ K the k step Markov
approximations p(k) also dominate p on its ﬁnite dimensional distributions.
The following variational corollary follows from Theorem 7.3.1.
Corollary 7.3.2 For a ﬁxed k let Let Mk denote the set of all k step Markov
distributions. Then inf M ∈Mk D(PX n M ) is attained by P (k) , and
n−1 inf (k ) M ∈Mk Ip (Xl ; X l−k Xlk k ).
− D(PX n M ) = D(PX n PX n ) =
l=k Since the divergence can be thought of as a distance between probability
distributions, the corollary justiﬁes considering the k step Markov process with
the same k th order distributions as the k step Markov approximation or model
for the original process: It is the minimum divergence distribution meeting the
k step Markov requirement. 7.4 Stationary Processes Several of the previous results simplify when the processes m and p are both stationary. We can consider the processes to be twosided since given a stationary
onesided process, there is always a stationary twosided process with the same
probabilities on all positive time events. When both processes are stationary,
n
the densities fXm and fX n satisfy
n
fXm = n
dPXm
dPX n m
= fX n T m =
T,
n
dMXm
dMX n and have the same expectation for any integer m. Similarly the conditional
n
densities fXn X n , fXk Xk−n , and fX0 X−1 ,X−2 ,··· ,X−n satisfy
n
fXn X n = fXk Xk−n T n−k = fX0 X−1 ,X−2 ,··· ,X−n T n (7.19) for any k and have the same expectation. Thus
1
Hp
n 1
m (X ) =
n n−1 n Hp
i=0 m (X0 X−1 , · · · , X−i ). (7.20) 140 CHAPTER 7. RELATIVE ENTROPY RATES
Using the construction of Theorem 5.3.1 we have also that
Di = Hp m (Xi X i ) = Hp m (X0 X−1 , · · · , X− i) = D(PX0 ,X−1 ,··· ,X−i SX0 ,X−1 ,··· ,X−i ),
where now
SX0 ,X−1 ,··· ,X−i = MX0 X−1 ,··· ,X−i PX−1 ,··· ,X−i ; (7.21) that is,
SX0 ,X−1 ,··· ,X−i (F × G) =
MX0 X−1 ,··· ,X−i (F xi ) dPX−1 ,··· ,X−i (xi ); F ∈ BA ; G ∈ BAi .
F As before the SX n distributions are not in general consistent. For example,
they can yield diﬀering marginal distributions SX0 . As we saw in the ﬁnite
case, general conclusions about the behavior of the limiting conditional relative
entropies cannot be drawn for arbitrary reference measures. If, however, we
assume as in the ﬁnite case that the reference measures are Markov, then we
can proceed.
Suppose now that under m the process is a k step Markov process. Then for
k
any n ≥ k (X−n , · · · , X−k−2 , X−k−1 ) → X−k → X0 is a Markov chain under m
and Lemma 5.5.4 implies that
Hp m (X0 X−1 , · · · , X−n ) = Hp m (Xk X k ) + Ip (Xk ; (X−1 , · · · , X−n )X k )
(7.22) and hence from (7.20)
¯
Hp m (X ) = Hp m (Xk X k ) + Ip (Xk ; X − X k ). (7.23) We also have, however, that X − → X k → Xk is a Markov chain under m
and hence a second application of Lemma 5.5.4 implies that
Hp m (X0 X − ) = Hp m (Xk X k ) + Ip (Xk ; X − X k ). (7.24) Putting these facts together and using (7.2) yields the following lemma.
Lemma 7.4.1 Let {Xn } be a twosided process with a standard alphabet and
let p and m be stationary process distributions such that MX n
PX n all n and
m is k th order Markov. Then the relative entropy rate exists and
¯
Hp m (X ) =
= 1
Hp m ( X n )
n
lim Hp m (X0 X−1 , · · · , X−n )
lim n→∞
n→∞ = Hp
= Hp m (X0 X − m (Xk X k ) ) + Ip (Xk ; X − X k ) = Ep [ln fXk X k (Xk X k )] + Ip (Xk ; X − X k ). 7.4. STATIONARY PROCESSES 141 Corollary 7.4.1 Given the assumptions of Lemma 7.4.1,
Hp m (X N X − ) = N Hp m (X0 X − ). Proof: From the chain rule for conditional relative entropy (equation (7.7),
n−1 Hp m (X N − X ) = m (Xl X Hp l , X − ). l=0 Stationarity implies that each term in the sum equals Hp
the corollary. m (X0 X − ), proving
2 The next corollary extends Corollary 7.3.1 to processes.
Corollary 7.4.2 Given k and n ≥ k , let Mk denote the class of all k step
stationary Markov process distributions. Then
¯
inf Hp m∈Mk m (X ) ¯
= Hp p(k) (X ) = Ip (Xk ; X − X k ).
2 Proof: Follows from (7.22) and Theorem 7.3.1. This result gives an interpretation of the ﬁnitegap information property
(6.14): If a process has this property, then there exists a k step Markov process
which is only a ﬁnite “distance” from the given process in terms of limiting
persymbol divergence. If any such process has a ﬁnite distance, then the k step Markov approximation also has a ﬁnite distance. Furthermore, we can
apply Corollary 6.4.1 to obtain the generalization of the ﬁnite alphabet result
of Theorem 2.6.1
.
Corollary 7.4.3 Given a stationary process distribution p which satisﬁes the
ﬁnitegap information property,
¯
inf inf Hp
k m∈Mk m (X ) ¯
= inf Hp
k p(k) (X ) ¯
= lim Hp
k→∞ p( k ) ( X ) = 0. Lemma 7.4.1 also yields the following approximation lemma.
Corollary 7.4.4 Given a process {Xn } with standard alphabet A let p and m
be stationary measures such that PX n
MX n for all n and m is k th order
Markov. Let qk be an asymptotically accurate sequence of quantizers for A.
Then
¯
¯
Hp m (X ) = lim Hp m (qk (X )),
k→∞ that is, the divergence rate can be approximated arbitrarily closely by that of a
quantized version of the process. Thus, in particular,
¯
Hp m (X ) ∗
= Hp m (X ). 142 CHAPTER 7. RELATIVE ENTROPY RATES Proof: This follows from Corollary 5.2.3 by letting the generating σ ﬁelds be
Fn = σ (qn (Xi ); i = 0, −1, . . .) and the representation of conditional relative
entropy as an ordinary divergence.
2
Another interesting property of relative entropy rates for stationary processes is that we can “reverse time” when computing the rate in the sense of
the following lemma.
¯
Lemma 7.4.2 Let {Xn }, p, and m be as in Lemma 7.4.1. If either Hp
−
∞ or HP M (X0 X ) < ∞, then
Hp m (X0 X−1 , · · · , X−n ) = Hp m (X0 X1 , · · · m (X ) < , Xn ) and hence
Hp m (X0 X1 , X2 , · · · ) = Hp m (X0 ¯
X−1 , X−2 , · · · ) = Hp m (X ) < ∞. ¯
Proof: If Hp m (X ) is ﬁnite, then so must be the terms Hp m (X n ) = D(PX n MX n )
(since otherwise all such terms with larger n would also be inﬁnite and hence
¯
H could not be ﬁnite). Thus from stationarity
Hp m (X0 X−1 , · · · , X−n ) m (Xn X n Hp =
n
n
D(PX n+1 MX n+1 ) − D(PX1 MX1 ) = ) D(PX n+1 MX n+1 ) − D(PX n MX n ) = Hp m (X0 X1 , · · · , Xn ) from which the results follow. If on the other hand the conditional relative
entropy is ﬁnite, the results then follow as in the proof of Lemma 7.4.1 using the
fact that the joint relative entropies are arithmetic averages of the conditional
relative entropies and that the conditional relative entropy is deﬁned as the
divergence between the P and S measures (Theorem 5.3.2).
2 7.5 Mean Ergodic Theorems In this section we state and prove some preliminary ergodic theorems for relative
entropy densities analogous to those ﬁrst developed for entropy densities in
Chapter 3 and for information densities in Section 6.3. In particular, we show
that an almost everywhere ergodic theorem for ﬁnite alphabet processes follows
easily from the sample entropy ergodic theorem and that an approximation
argument then yields an L1 ergodic theorem for stationary sources. The results
involve little new and closely parallel those for mutual information densities
and therefore the details are skimpy. The results are given for completeness and
because the L1 results yield the byproduct that relative entropies are uniformly
integrable, a fact which does not follow as easily for relative entropies as it did
for entropies. 7.5. MEAN ERGODIC THEOREMS 143 Finite Alphabets
Suppose that we now have two process distributions p and m for a random
process {Xn } with ﬁnite alphabet. Let PX n and MX n denote the induced
nth order distributions and pX n and mX n the corresponding probability mass
functions (pmf’s). For example, pX n (an ) = PX n ({xn : xn = an }) = p({x :
X n (x) = an }). We assume that PX n
MX n . In this case the relative entropy
density is given simply by
hn (x) = hX n (X n )(x) = ln p X n ( xn )
,
mX n (xn ) where xn = X n (x).
The following lemma generalizes Theorem 3.1.1 from entropy densities to
relative entropy densities for ﬁnite alphabet processes. Relative entropies are of
more general interest than ordinary entropies because they generalize to continuous alphabets in a useful way while ordinary entropies do not.
Lemma 7.5.1 Suppose that {Xn } is a ﬁnite alphabet process and that p and m
are two process distributions with MX n
PX n for all n, where p is AMS with
stationary mean p, m is a k th order Markov source with stationary transitions,
¯
and {px } is the ergodic decomposition of the stationary mean of p. Assume also
¯
¯
that MX n
PX n for all n. Then
lim n→∞ 1
hn = h; p − a.e. and in L1 (p),
n where h(x) is the invariant function deﬁned by
h(x) =
=
= ¯¯
−Hpx (X ) − Epx ln m(Xk X k )
¯
1
lim Hpx m (X n )
¯
n→∞ n
¯¯
Hpx m (X ), where
m(Xk X k )(x) ≡ mX k+1 (xk+1 )
= MXk X k (xk xk ).
mX k (xk ) Furthermore,
1
Hp m (X n ),
(7.25)
n→∞ n
that is, the relative entropy rate of an AMS process with respect to a Markov
process with stationary transitions is given by the limit. Lastly,
¯
E p h = Hp ¯
Hp m (X ) m (X ) = lim ¯¯
= Hp m (X ); (7.26) that is, the relative entropy rate of the AMS process with respect to m is the
same as that of its stationary mean with respect to m. 144 CHAPTER 7. RELATIVE ENTROPY RATES Proof: We have that
1
h(X n )
n = = 1
1
1
ln p(X n ) − ln m(X k ) +
n
n
n
1
1
1
ln p(X n ) − ln m(X k ) −
n
n
n n−1
k
ln m(Xi Xi−k )
i=k
n−1 ln m(Xk X k )T i−k ,
i=k where T is the shift transformation, p(X n ) is an abbreviation for PX n (X n ), and
m(Xk X k ) = MXk X k (Xk X k ). From Theorem 3.1.1 the ﬁrst term converges to
¯¯
−Hpx (X )pa.e. and in L1 (p).
Since MX k
PX k , if MX k (F ) = 0, then also PX k (F ) = 0. Thus PX k and
hence also p assign zero probability to the event that MX k (X k ) = 0. Thus with
probability one under p, ln m(X k ) is ﬁnite and hence the second term in (7.5.4)
converges to 0 pa.e. as n → ∞.
Deﬁne α as the minimum nonzero value of the conditional probability m(xk xk ).
Then with probability 1 under MX n and hence also under PX n we have that
1
n n−1 ln
i=k 1
1
≤ ln
k)
α
m(Xi Xi−k since otherwise the sequence X n would have 0 probability under MX n and hence
also under PX n and 0 ln 0 is considered to be 0. Thus the rightmost term of
(7.27) is uniformly integrable with respect to p and hence from Theorem 1.8.3
this term converges to Epx (ln m(Xk X k )). This proves the leftmost equality of
¯
(7.25).
Let pX n x denote the distribution of X n under the ergodic component px .
¯
¯
¯
¯
Since MX n
PX n and PX n = dp(x)¯X n x , if MX n (F ) = 0, then pX n x (F ) =
¯p
¯
0 pa.e. Since the alphabet of Xn if ﬁnite, we therefore also have with probability
one under p that MX n
¯
pX n x and hence
¯
Hp x
¯ n
m (X ) = pX n x (an ) ln
¯
an pX n x (an )
¯
MX n (an ) is well deﬁned for palmost all x. This expectation can also be written as
¯
n−1
m (X n ) ln m(Xk X k )T i−k ] = −Hpx (X n ) − Epx [ln m(X k ) +
¯
¯ = Hpx
¯ −Hpx (X n ) − Epx [ln m(X k )] − (n − k )Epx [ln m(Xk X k )],
¯
¯
¯ i=k where we have used the stationarity of the ergodic components. Dividing by
n and taking the limit as n → ∞, the middle term goes to zero as previously
and the remaining limits prove the middle equality and hence the rightmost
inequality in (7.25).
Equation (7.25) follows from (7.25) and L1 (p) convergence, that is, since
n−1 hn → h, we must also have that Ep (n−1 hn (X n )) = n−1 Hp m (X n ) converges 7.5. MEAN ERGODIC THEOREMS 145 ¯
to Ep h. Since the former limit is Hp m (X ), (7.25) follows. Since px is invariant
¯
(Theorem 1.8.2) and since expectations of invariant functions are the same under
an AMS measure and its stationary mean (Lemma 6.3.1 of [50]), application of
the previous results of the lemma to both p and p proves that
¯
¯
Hp m (X ) = ¯¯
dp(x)Hpx m (X ) = ¯¯
dp(x)Hpx
¯ m (X ) ¯¯
= Hp which proves (7.27) and completes the proof of the lemma. m (X ), 2 Corollary 7.5.1 Given p and m as in the Lemma, then the relative entropy
rate of p with respect to m has an ergodic decomposition, that is,
¯
Hp m (X ) = ¯¯
dp(x)Hpx m (X ). 2 Proof: This follows immediately from (7.25) and (7.25). Standard Alphabets
We now drop the ﬁnite alphabet assumption and suppose that {Xn } is a standard alphabet process with process distributions p and m, where p is stationary,
m is k th order Markov with stationary transitions, and MX n
PX n are the induced vector distributions for n = 1, 2, . . . . Deﬁne the densities fn and entropy
densities hn as previously.
As an easy consequence of the development to this point, the ergodic decomposition for divergence rate of ﬁnite alphabet processes combined with the
deﬁnition of H ∗ as a supremum over rates of quantized processes yields an extension of Corollary 6.2.1 to divergences. This yields other useful properties as
summarized in the following corollary.
Corollary 7.5.2 Given a standard alphabet process {Xn } suppose that p and m
are two process distributions such that p is AMS and m is k th order Markov with
stationary transitions and MX n
PX n are the induced vector distributions. Let
p denote the stationary mean of p and let {px } denote the ergodic decomposition
¯
¯
of the stationary mean p. Then
¯
∗
Hp m (X ) = ∗
dp(x)Hpx
¯ m (X ). (7.27) In addition,
∗
Hp m (X ) ∗
= Hp
¯ m (X ) ¯¯
= Hp m (X ) ¯
= Hp m (X ); (7.28) that is, the two deﬁnitions of relative entropy rate yield the same values for
AMS p and stationary transition Markov m and both rates are the same as the
corresponding rates for the stationary mean. Thus relative entropy rate has an
ergodic decomposition in the sense that
¯
Hp m (X ) = ¯¯
dp(x)Hpx m (X ). (7.29) 146 CHAPTER 7. RELATIVE ENTROPY RATES Comment: Note that the extra technical conditions of Theorem 6.4.2 for equality
¯
of the analogous mutual information rates I and I ∗ are not needed here. Note
also that only the ergodic decomposition of the stationary mean p of the AMS
¯
measure p is considered and not that of the Markov source m.
Proof: The ﬁrst statement follows as previously described from the ﬁnite alphabet result and the deﬁnition of H ∗ . The leftmost and rightmost equalities
of (7.28) both follow from the previous lemma. The middle equality of (7.28)
follows from Corollary 7.4.2. Eq. (7.29) then follows from (7.27) and (7.28). 2
Theorem 7.5.1 Given a standard alphabet process {Xn } suppose that p and m
are two process distributions such that p is AMS and m is k th order Markov with
stationary transitions and MX n
PX n are the induced vector distributions. Let
{px } denote the ergodic decomposition of the stationary mean p. If
¯
¯
lim n→∞ 1
Hp
n m (X n ¯
) = Hp m (X ) < ∞, then there is an invariant function h such that n−1 hn → h in L1 (p) as n → ∞.
In fact,
¯¯
h(x) = Hpx m (X ),
the relative entropy rate of the ergodic component px with respect to m. Thus,
¯
in particular, under the stated conditions the relative entropy densities hn are
uniformly integrable with respect to p.
Proof: The proof exactly parallels that of Theorem 6.3.1, the mean ergodic
theorem for information densities, with the relative entropy densities replacing
the mutual information densities. The density is approximated by that of a
quantized version and the integral bounded above using the triangle inequality.
¯
One term goes to zero from the ﬁnite alphabet case. Since H = H ∗ (Corollary 7.5.2 the remaining terms go to zero because the relative entropy rate can
be approximated arbitrarily closely by that of a quantized process.
2
It should be emphasized that although Theorem 7.5.1 and Theorem 6.3.1
are similar in appearance, neither result directly implies the other. It is true
that mutual information can be considered as a special case of relative entropy,
but given a pair process {Xn , Yn } we cannot in general ﬁnd a k th order Markov
¯
distribution m for which the mutual information rate I (X ; Y ) equals a relative
¯ p m . We will later consider conditions under which convergence
entropy rate H
of relative entropy densities does imply convergence of information densities. Chapter 8 Ergodic Theorems for
Densities
8.1 Introduction This chapter is devoted to developing ergodic theorems ﬁrst for relative entropy
densities and then information densities for the general case of AMS processes
with standard alphabets. The general results were ﬁrst developed by Barron [9]
using the martingale convergence theorem and a new martingale inequality. The
similar results of Algoet and Cover [7] can be proved without direct recourse to
martingale theory. They infer the result for the stationary Markov approximation and for the inﬁnite order approximation from the ordinary ergodic theorem.
They then demonstrate that the growth rate of the true density is asymptotically sandwiched between that for the k th order Markov approximation and the
inﬁnite order approximation and that no gap is left between these asymptotic
upper and lower bounds in the limit as k → ∞. They use martingale theory
to show that the values between which the limiting density is sandwiched are
arbitrarily close to each other, but we shall see that this is not necessary and
this property follows from the results of Chapter 6. 8.2 Stationary Ergodic Sources Theorem 8.2.1 Given a standard alphabet process {Xn }, suppose that p and
m are two process distributions such that p is stationary ergodic and m is a
K step Markov source with stationary transition probabilities. Let MX n
PX n
be the vector distributions induced by p and m. As before let
hn = ln fX n (X n ) = ln
147 dPX n
(X n ).
dMX n 148 CHAPTER 8. ERGODIC THEOREMS FOR DENSITIES
Then with probability one under p
lim n→∞ 1
¯
hn = Hp
n m (X ). Proof: Let p(k) denote the k step Markov approximation of p as deﬁned in
Theorem 7.3.1, that is, p(k) has the same k th order conditional probabilities
and k dimensional initial distribution. From Corollary 7.3.1, if k ≥ K , then
(7.8)–(7.10) hold. Consider the expectation
(k) (k ) fX n ( X n )
fX n ( X n ) Ep = EPX n (k) fX n
fX n = fX n
fX n dPX n . Deﬁne the set An = {xn : fX n > 0}; then PX n (An ) = 1. Use the fact that
fX n = dPX n /dMX n to write
(k) (k) fX n (X n )
fX n (X n ) EP fX n
fX n =
An fX n dMX n (k ) =
An fX n dMX n . From Theorem 7.3.1,
(k ) (k) fX n = dPX n
dMX n and therefore
(k) Ep fX n (X n )
fX n (X n ) (k) =
An dPX n
(k)
dMX n = PX n (An ) ≤ 1.
dMX n
(k) Thus we can apply Lemma 5.4.2 to the sequence fX n (X n )/fX n (X n ) to conclude that with pprobability 1
(k) 1 fX n (X n )
ln
≤0
n→∞ n
fX n (X n )
lim and hence 1
1
(k)
ln fX n (X n ) ≤ lim inf fX n (X n ).
n→∞ n
n
The lefthand limit is well deﬁned by the usual ergodic theorem:
lim n→∞ 1
1
(k)
ln fX n (X n ) = lim
n→∞ n
n→∞ n n−1 ln fXl Xlk−k (Xl Xlk k ) + lim
− lim l=k n→∞ (8.1) 1
ln fX k (X k ).
n Since 0 < fX k < ∞ with probability 1 under MX k and hence also under PX k ,
then 0 < fX k (X k ) < ∞ under p and therefore n−1 ln fX k (X k ) → 0 as n → ∞
with probability one. Furthermore, from the ergodic theorem for stationary and 8.2. STATIONARY ERGODIC SOURCES 149 ergodic processes (e.g., Theorem 7.2.1 of [50]), since p is stationary ergodic we
have with probability one under p using (7.19) and Lemma 7.4.1 that
1
lim
n→∞ n
= n−1 ln fXl Xlk−k (Xl Xlk k )
−
l=k 1
n→∞ n n−1 ln fX0 X−1 ,··· ,X−k (X0  X−1 , · · · , X−k )T l lim l=k = Ep ln fX0 X−1 ,··· ,X−k (X0 X−1 , · · · , X−k )
¯
= Hp m (X0 X−1 , · · · , X−k ) = Hp(k) m (X ).
Thus with (8.1) we now have that
lim inf
n→∞ 1
ln fX n (X n ) ≥ Hp
n m (X0 X−1 , · · · , X−k ) (8.2) for any positive integer k . Since m is K th order Markov, Lemma 7.4.1 and the
above imply that
lim inf
n→∞ 1
ln fX n (X n ) ≥ Hp
n m (X0 X − ¯
) = Hp m (X ), (8.3) which completes half of the sandwich proof of the theorem.
¯
If Hp m (X ) = ∞, the proof is completed with (8.3). Hence we can suppose
¯
that Hp m (X ) < ∞. From Lemma 7.4.1 using the distribution SX0 ,X−1 ,X− 2,···
constructed there, we have that
D(PX0 ,X−1 ,··· SX0 ,X−1 ,··· ) = Hp
where
fX0 X − = m (X0 X − )= dPX0 ,X − ln fX0 X − dPX0 ,X−1 ,···
.
dSX0 ,X−1 ,··· It should be pointed out that we have not (and will not) prove that fX0 X−1 ,··· ,X− n
→ fX0 X − ; the convergence of conditional probability densities which follows
from the martingale convergence theorem and the result about which most generalized ShannonMcMillanBreiman theorems are built. (See, e.g., Barron [9].)
We have proved, however, that the expectations converge (Lemma 7.4.1), which
is what is needed to make the sandwich argument work.
For the second half of the sandwich proof we construct a measure Q which
will be dominated by p on semiinﬁnite sequences using the above conditional
densities given the inﬁnite past. Deﬁne the semiinﬁnite sequence
−
Xn = {· · · , Xn−1 }
−
−
n
n
for all nonnegative integers n. Let Bk = σ (Xk ) and Bk = σ (Xk ) = σ (· · · , Xk−1 )
n
be the σ ﬁelds generated by the ﬁnite dimensional random vector Xk and the
−
semiinﬁnite sequence Xk , respectively. Let Q be the process distribution 150 CHAPTER 8. ERGODIC THEOREMS FOR DENSITIES −
having the same restriction to σ (Xk ) as does p and the same restriction to
n
σ (X0 , X1 , · · · ) as does p, but which makes X − and Xk conditionally indepenk
dent given X for any n; that is, QX − = PX − ,
k k QXk ,Xk+1 ,··· = PXk ,Xk+1 ,··· ,
n
and X − → X k → Xk is a Markov chain for all positive integers n so that
−
n
n
Q(Xk ∈ F Xk ) = Q(Xk ∈ F X k ). The measure Q is a (nonstationary) k step Markov approximation to P in
the sense of Section 5.3 and
Q = PX − ×(Xk ,Xk+1 ,··· )X k
n
∞
(in contrast to P = PX − X k Xk ). Observe that X − → X k → Xk is a Markov
chain under both Q and m.
By assumption,
Hp m (X0 X − ) < ∞ and hence from Lemma 7.4.1
Hp −
n
m (Xk Xk ) = nHp −
n
m (Xk Xk ) <∞ and hence from Theorem 5.3.2 the density fX n X − is welldeﬁned as
k dSX − n+k fX n X − =
k k (8.4) PX − k n +k where
SX − n+k n
= MXk X k PX − , (8.5) k and
dPX − n+k ln fX n X −
k k = D(PX − n+k SX − )
n+k −
n
nHp m (Xk Xk ) = < ∞. Thus, in particular,
SX − PX − . n+k n+k Consider now the sequence of ratios of conditional densities
ζn = n
fXk X k (X n+k ) −
fX n X − (Xn+k )
k k We have that
dpζn = ζn
Gn 8.2. STATIONARY ERGODIC SOURCES 151 where
Gn = {x : fX n X − (x−+k ) > 0}
n
k k since Gn has probability 1 under p (or else (8.6) would be violated). Thus
dpζn = n
fXk X k (X n+k ) dPX − k = 1{f − >0}
X n X
k
k fX n X − n+k n
fXk X k (X n+k ) dSX − fX n X −
k n +k k fX n X − k k 1{f k − >0}
X n X
k
k n
dSX − fXk X k (X n+k )1{f = − >0}
X n X
k
k n +k n
dSX − fXk X k (X n+k ). ≤ n +k Using the deﬁnition of the measure S and iterated expectation we have that
dpζn n
dMX n X − dPX − fXk X k (X n+k ) ≤ k k k n
n
dMXk X k dPX − fXk X k (X n+k ). = k Since the integrand is now measurable with respect to σ (X n+k ), this reduces
to
dpζn ≤ n
n
dMXk X k dPX k fXk X k . Applying Lemma 5.3.2 we have
n
dPXk X k ≤ n
dMXk X k dPX k = dpζn n
dPX k dPXk X k = 1. n
dMXk X k Thus
dpζn ≤ 1
and we can apply Lemma 5.4.1 to conclude that pa.e.
lim sup ζn = lim sup
n→∞ n→∞ n
fXk X k
1
ln
≤ 0.
n fX n X −
k k Using the chain rule for densities,
n
fXk X k fX n X −
k k = fX n
×
fX k 1
n−1
l=k fXl X −
l . (8.6) 152 CHAPTER 8. ERGODIC THEOREMS FOR DENSITIES Thus from (8.6)
lim sup
n→∞ 1
1
1
ln fX n − ln fX k −
n
n
n n−1 ln fXl X − ≤ 0. l l=k Invoking the ergodic theorem for the rightmost terms and the fact that
the middle term converges to 0 almost everywhere since ln fX k is ﬁnite almost
everywhere implies that
lim sup
n→∞ 1
¯
ln fX n ≤ Ep (ln fXk X − ) = Ep (ln fX0 X − ) = Hp
k
n m (X ). (8.7) Combining this with (8.3) completes the sandwich and proves the theorem.
2 8.3 Stationary Nonergodic Sources Next suppose that the source p is stationary with ergodic decomposition {pλ ; λ ∈
Λ} and ergodic component function ψ as in Theorem 1.8.3. We ﬁrst require some
technical details to ensure that the various RadonNikodym derivatives are well
deﬁned and that the needed chain rules for densities hold.
Lemma 8.3.1 Given a stationary source {Xn }, let {pλ ; λ ∈ Λ} denote the
ergodic decomposition and ψ the ergodic component function of Theorem 1.8.3.
λ
Let Pψ denote the induced distribution of ψ . Let PX n and PX n denote the
induced marginal distributions of p and pλ . Assume that {Xn } has the ﬁnitegap information property of (6.14); that is, there exists a K such that
Ip (XK ; X − X K ) < ∞, (8.8) where X − = (X−1 , X−2 , · · · ). We also assume that for some n
I (X n ; ψ ) < ∞. (8.9) This will be the case, for example, if (8.8) holds for K = 0. Let m be a K step Markov process such that MX n
PX n for all n. (Observe that such
a process exists since from (8.8) the K th order Markov approximation p(K )
suﬃces.) Deﬁne MX n ,ψ = MX n × Pψ . Then
MX n ,ψ PX n × Pψ PX n ,ψ , (8.10) and with probability 1 under p
MX n PX n ψ
PX n . Lastly,
ψ
dPX n
dPX n ,ψ
= fX n ψ =
.
dMX n
d(MX n × Pψ ) (8.11) 8.3. STATIONARY NONERGODIC SOURCES 153 and therefore
ψ
ψ
fX n ψ
dPX n
dPX n /dMX n
=
.
=
n
n /dMX n
dPX
dPX
fX n (8.12) Proof: From Theorem 6.4.4 the given assumptions ensure that
lim n→∞ 1
1
Ep i(X n ; ψ ) = lim I (X n ; ψ ) = 0
n→∞ n
n (8.13) and hence PX n × Pψ
PX n ,ψ (since otherwise I (X n ; ψ ) would be inﬁnite for
some n and hence inﬁnite for all larger n since it is increasing with n). This
proves the rightmost absolute continuity relation of (8.10). This in turn implies
that MX n × Pψ
PX n ,ψ . The lemma then follows from Theorem 5.3.1 with
X = X n , Y = ψ and the chain rule for RadonNikodym derivatives.
2
We know that the source will produce with probability one an ergodic component pλ and hence Theorem 8.2.1 will hold for this ergodic component. In
other words, we have for all λ that
lim 1
¯
ln fX n ψ (X n λ) = Hpλ (X ); pλ − a.e.
n lim 1
¯
ln fX n ψ (X n ψ ) = Hpψ (X ); p − a.e.
n n→∞ This implies that
n→∞ (8.14) Making this step precise generalizes Lemma 3.3.1.
Lemma 8.3.2 Suppose that {Xn } is a stationary not necessarily ergodic source
with ergodic component function ψ . Then (8.14) holds.
Proof: The proof parallels that for Lemma 3.3.1. Observe that if we have two
random variables U, V (U = X0 , X1 , · · · and Y = ψ above) and a sequence of
¯
functions gn (U, V ) (n−1 fX n ψ (X n ψ )) and a function g (V ) (Hpψ (X )) with the
property
lim gn (U, v ) = g (v ), PU V =v − a.e.,
n→∞ then also
lim gn (U, V ) = g (V ); PU V − a.e. n→∞ since deﬁning the (measurable) set G = {u, v : limn→∞ gn (u, v ) = g (v )} and its
section Gv = {u : (u, v ) ∈ G}, then from (1.26)
PU V (G) = PU V (Gv v )dPV (v ) = 1 if PU V (Gv v ) = 1 with probability 1. 2 It is not, however, the relative entropy density using the distribution of the
ergodic component that we wish to show converges. It is the original sample 154 CHAPTER 8. ERGODIC THEOREMS FOR DENSITIES density fX n . The following lemma shows that the two sample entropies converge
to the same thing. The lemma generalizes Lemma 3.3.1 and is proved by a
sandwich argument analogous to Theorem 8.2.1. The result can be viewed as
an almost everywhere version of (8.13).
Theorem 8.3.1 Given a stationary source {Xn }, let {pλ ; λ ∈ Λ} denote the
ergodic decomposition and ψ the ergodic component function of Theorem 1.8.3.
Assume that the ﬁnitegap information property (8.8) is satisﬁed and that (8.9)
holds for some n. Then
lim n→∞ 1 fX n ψ
1
i(X n ; ψ ) = lim
ln
= 0; p − a.e.
n→∞ n
n
fX n Proof: From Theorem 5.4.1 we have immediately that
lim inf in (X n ; ψ ) ≥ 0, (8.15) n→∞ which provides half of the sandwich proof.
To develop the other half of the sandwich, for each k ≥ K let p(k) denote the
k step Markov approximation of p. Exactly as in the proof of Theorem 8.2.1,
it follows that (8.1) holds. Now, however, the Markov approximation relative
entropy density converges instead as
1
1
(k)
ln fX n (X n ) = lim
n→∞ n
n→∞ n ∞ fXk X k (Xk X k )T k = Epψ fXk X k (Xk X k ). lim l=k Combining this with (8.14 we have that
lim sup
n→∞ 1 fX n ψ (X n ψ )
¯
ln
≤ Hp ψ
n
fX n (X n ) m (X ) − Epψ fXk X k (Xk X k ). From Lemma 7.4.1, the right hand side is just Ipψ (Xk ; X − X k ) which from
¯
Corollary 7.4.2 is just Hp p(k) (X ). Since the bound holds for all k , we have that
lim sup
n→∞ 1 fX n ψ (X n ψ )
¯
ln
≤ inf Hpψ
k
n
fX n ( X n ) p(k) (X ) ≡ ζ. Using the ergodic decompostion of relative entropy rate (Corollary 7.5.1) that
and the fact that Markov approximations are asymptotically accurate (Corollary 7.4.3) we have further that
dPψ ζ ¯
dPψ inf Hpψ = p(k) (X ) ¯
dPψ Hpψ p(k) (X ) k ≤ inf = ¯
inf Hp k k p(k) (X ) =0 8.4. AMS SOURCES 155 and hence ζ = 0 with Pψ probability 1. Thus
lim sup
n→∞ 1 fX n ψ (X n ψ )
ln
≤ 0,
n
fX n (X n ) (8.16) which with (8.15) completes the sandwich proof.
2
Simply restating the theorem yields and using (8.14) the ergodic theorem
for relative entropy densities in the general stationary case.
Corollary 8.3.1: Given the assumptions of Theorem 8.3.1,
lim n→∞ 1
¯
ln fX n (X n ) = Hpψ
n m (X ), p − a.e. The corollary states that the sample relative entropy density of a process
satisfying (8.8) converges to the conditional relative entropy rate with respect
to the underlying ergodic component. This is a slight extension and elaboration
of Barron’s result [9] which made the stronger assumption that Hp m (X0 X − ) =
¯
Hp m (X ) < ∞. From Corollary 7.4.3 this condition is suﬃcient but not necessary for the nitegap information property of (8.8). In particular, the ﬁnite
gap information property implies that
¯
Hp p(k) (X ) = Ip (Xk ; X − X k ) < ∞, ¯
but it need not be true that Hp m (X ) < ∞. In addition, Barron [9] and
Algoet and Cover [7] do not characterize the limiting density as the entropy
rate of the ergodic component, instead they eﬀectively show that the limit is
Epψ (ln fX0 X − (X0 X − )). This, however, is equivalent since it follows from the
ergodic decomposition (see speciﬁcally Lemma 8.6.2 or [50]) that fX0 X − =
fX0 X − ,ψ with probability one since the ergodic component ψ can be determined
from the inﬁnite past X − . 8.4 AMS Sources The following lemma is a generalization of Lemma 3.4.1. The result is due to
Barron [9], who proved it using martingale inequalities and convergence results.
Lemma 8.4.1 Let {Xn } be an AMS source with the property that for every
integer k there exists an integer l = l(k ) such that
l
Ip (X k ; (Xk+l , Xk+l+1 , · · · )Xk ). < ∞. (8.17) Then 1
l
i(X k ; (Xk + l, · · · , Xn−1 )Xk ) = 0; p − a.e.
n
Proof: By assumption
lim n→∞ l
Ip (X k ; (Xk+l , Xk+l+1 , · · · )Xk ) = Ep ln fX k Xk ,Xk+1 ,··· (X k Xk , Xk+1 , · · · )
l
fX k Xk (X k Xk )
l < ∞. 156 CHAPTER 8. ERGODIC THEOREMS FOR DENSITIES This implies that
PX k ×(Xk +l,··· )Xk
l
with
dPX0 ,X1 ,....
dPX k ×(Xk +l,··· )Xk
l = PX0 ,X1 ,.... fX k Xk ,Xk +1,··· (X k Xk , Xk + 1, · · · )
.
l
fX k Xk (X k Xk ).
l Restricting the measures to X n for n > k + l yields dPX k ×(Xk +l,··· ,X n )Xk
l = fX k Xk ,Xk +1,··· ,Xn (X k Xk , Xk + 1, · · · )
l
fX k Xk (X k Xk )
l = dPX n l
i(X k ; (Xk + l, · · · , Xn )Xk ). With this setup the lemma follows immediately from Theorem 5.4.1. 2 The following lemma generalizes Lemma 3.4.2 and will yield the general theorem. The lemma was ﬁrst proved by Barron [9] using martingale inequalities.
Theorem 8.4.1 Suppose that p and m are distributions of a standard alphabet
process {Xn } such that p is AMS and m is k step Markov. Let p be a stationary
¯
measure that asymptotically dominates p (e.g., the stationary mean). Suppose
¯
that PX n , PX n , and MX n are the distributions induced by p, p, and m and that
¯
¯
¯
MX n dominates both PX n and PX n for all n and that fX n and fX n are the
corresponding densities. If there is an invariant function h such that
lim 1
¯
ln fX n (X n ) = h; p − a.e.
¯
n lim 1
ln fX n (X n ) = h; p − a.e.
n n→∞ then also
n→∞ Proof: For any k and n ≥ k we can write using the chain rule for densities
1
1
1
ln fX n − ln fX n−k = ln fX k X n−k .
k
k
n
n
n
Since for k ≤ l < n
1
1
1
l
ln fX k X n−k = ln fX k Xk + i(X k ; (Xk+l , · · · , Xn−1 )Xk ),
l
k
n
n
n
Lemma 8.4.1 and the fact that densities are ﬁnite with probability one implies
that
1
lim
ln fX k X n−k = 0; p − a.e.
k
n→∞ n
This implies that there is a subsequence k (n) → ∞ such that
1
1
n−k(n)
ln fX n (X n ) − ln fX n−k(n) ) (Xk(n) ); → 0, p − a.e.
k(n)
n
n 8.4. AMS SOURCES 157 To prove this, for each k chose N (k ) large enough so that
p( 1
N (k)−k
(k ) ln fX k X N (k)−k (X k Xk
) > 2−k ) ≤ 2−k
k
N and then let k (n) = k for N (k ) ≤ n < N (k + 1). Then from the BorelCantelli
lemma we have for any that
p( 1
N (k)−k
) >
ln fX k X N (k)−k (X k Xk
k
N (k ) i.o.) = 0 and hence
1
1
n−k(n)
ln fX n (X n ) = lim
ln fX n−k(n) (Xk(n) ); p − a.e.
n→∞ n
n→∞ n
k(n)
lim In a similar manner we can also choose the sequence so that
1
1
n−k(n)
¯
¯
ln fX n (X n ) = lim
ln fX n−k(n) (Xk(n) ); p − a.e.
¯
n→∞ n
n→∞ n
k(n)
lim From Markov’s inequality
p(
¯ 1
n
ln fX n−k (Xk −k ) ≥
k
n 1
n
¯
ln fX n−k (Xk −k ) + )
k
n
n
fX n−k (Xk −k )
= p( k
¯
≥ en )
¯
f n−k (X n−k )
k Xk ≤ e −n dp
¯ n
fX n−k (Xk −k )
k
¯
f n−k (X n−k )
Xk = e −n k n
dmfX n−k (Xk −k ) = e−n .
k Hence again invoking the BorelCantelli lemma we have that
p(
¯ 1
1
n
n
¯
ln fX n−k (Xk −k ) ≥ ln fX n−k (Xk −k ) + i.o.) = 0
k
k
n
n and therefore
lim sup
n→∞ 1
n
ln fX n−k (Xk −k ) ≤ h, p − a.e.
¯
k
n (8.18) The above event is in the tail σ ﬁeld n σ (Xn , Xn+1 , · · · ) since h is invariant
and p dominates p on the tail σ ﬁeld. Thus
¯
lim sup
n→∞ 1
n−k(n)
ln fX n−k(n) (Xk(n) ) ≤ h; p − a.e.
k(n)
n and hence
lim sup
n→∞ 1
ln fX n (X n ) ≤ h; p − a.e.
n 158 CHAPTER 8. ERGODIC THEOREMS FOR DENSITIES which half proves the lemma.
Since p asymptotically dominates p, given
¯ > 0 there is a k such that ¯n
p( lim n−1 f (Xk −k ) = h) ≥ 1 − .
n→∞ Again applying Markov’s inequality and the BorelCantelli lemma as previously
we have that
n−k ( n)
fX n−k(n) (Xk(n) )
1
k(n)
lim inf ln
≥ 0; p − a.e.
n−k ( n)
n→∞ n
¯
fX n−k(n) (Xk(n) )
k(n) which implies that
p(lim inf
n→∞ 1
n
f n−k(n) (Xk −k ) ≥ h) ≥
n Xk(n) and hence also that 1
fX n (X n ) ≥ h) ≥ .
n
Since can be made arbitrarily small, this proves that pa.e. lim inf n−1 hn ≥ h,
which completes the proof of the lemma.
2
We can now extend the ergodic theorem for relative entropy densities to the
general AMS case.
Corollary 8.4.1: Given the assumptions of Theorem 8.4.1,
p(lim inf
n→∞ lim n→∞ 1
¯¯
ln fX n (X n ) = Hpψ (X ),
n where pψ is the ergodic component of the stationary mean p of p.
¯
¯
Proof: The proof follows immediately from Theorem 8.4.1 and Lemma 8.3.1,
the ergodic theorem for the relative entropy density for the stationary mean. 2 8.5 Ergodic Theorems for Information Densities. As an application of the general theorem we prove an ergodic theorem for mutual
information densities for stationary and ergodic sources. The result can be
extended to AMS sources in the same manner that the results of Section 8.3
were extended to those of Section 8.4. As the stationary and ergodic result
suﬃces for the coding theorems and the AMS conditions are messy, only the
stationary case is considered here. The result is due to Barron [9].
Theorem 8.5.1 Let {Xn , Yn } be a stationary ergodic pair random process with
standard alphabet. Let PX n Y n , PX n , and PY n denote the induced distributions
and assume that for all n PX n × PY n
PX n Y n and hence the information
densities
dPX n Y n
in (X n ; Y n ) =
d(PX n × PY n ) 8.5. ERGODIC THEOREMS FOR INFORMATION DENSITIES. 159 are well deﬁned. Assume in addition that both the {Xn } and {Yn } processes have
the ﬁnitegap information property of (8.8) and hence by the comment following
Corollary 7.3.1 there is a K such that both processes satisfy the K gap property
I (XK ; X − X K ) < ∞, I (YK ; Y − Y K ) < ∞.
Then 1
¯
in (X n ; Y n ) = I (X ; Y ); p − a.e..
n→∞ n
lim (K ) (K ) Proof: Let Zn = (Xn , Yn ). Let MX n = PX n and MY n = PY n denote the K th
order Markov approximations of {Xn } and {Yn }, respectively. The ﬁnitegap
approximation implies as in Section 8.3 that the densities
fX n = dPX n
dMX n fY n = dPY n
dMY n and are well deﬁned. From Theorem 8.2.1
1
ln fX n (X n ) = HpX p(K ) (X0 X − ) = I (Xk ; X − X k ) < ∞,
lim
n→∞ n
X
1
n
lim
ln fY n (Y ) = I (Yk ; Y − Y k ) < ∞.
n→∞ n
Deﬁne the measures MZ n by MX n × MY n . Then this is a K step Markov
source and since
M X n × MY n PX n × PY n PX n ,Y n = PZ n , the density
dPZ n
dMZ n
is well deﬁned and from Theorem 8.2.1 has a limit
1
ln fZ n (Z n ) = Hp m (Z0 Z − ).
lim
n→∞ n
fZ n = If the density in (X n , Y n ) is inﬁnite for any n, then it is inﬁnite for all larger
n and convergence is trivially to the inﬁnite information rate. If it is ﬁnite, the
chain rule for densities yields
1
in (X n ; Y n )
n → 1
1
1
ln fZ n (Z n ) − ln fX n (X n ) − ln fY n (Y n )
n
n
n
Hp p(k) (Z0 Z − ) − Hp p(k) (X0 X − ) − Hp p(k) (Y0 Y − ) = ¯
Hp =
n→∞ p(k) (X, Y ¯
) − Hp p(k) (X ) ¯
− Hp p(k) (Y ). The limit is not indeterminate ( of the form ∞ − ∞) because the two subtracted
terms are ﬁnite. Since convergence is to a constant, the constant must also be
¯
the limit of the expected values of n−1 in (X n , Y n ), that is, I (X ; Y ).
2 160 CHAPTER 8. ERGODIC THEOREMS FOR DENSITIES Chapter 9 Channels and Codes
9.1 Introduction We have considered a random process or source {Xn } as a sequence of random
entities, where the object produced at each time could be quite general, e.g.,
a random variable, vector, or waveform. Hence sequences of pairs of random
objects such as {Xn , Yn } are included in the general framework. We now focus
on the possible interrelations between the two components of such a pair process.
In particular, we consider the situation where we begin with one source, say
{Xn }, called the input and use either a random or a deterministic mapping to
form a new source {Yn }, called the output. We generally refer to the mapping
as a channel if it is random and a code if it is deterministic. Hence a code is
a special case of a channel and results for channels will immediately imply the
corresponding results for codes. The initial point of interest will be conditions
on the structure of the channel under which the resulting pair process {Xn , Yn }
will inherit stationarity and ergodic properties from the original source {Xn }.
We will also be interested in the behavior resulting when the output of one
channel serves as the input to another, that is, when we form a new channel
as a cascade of other channels. Such cascades yield models of a communication
system which typically has a code mapping (called the encoder) followed by a
channel followed by another code mapping (called the decoder).
A fundamental nuisance in the development is the notion of time. So far we
have considered pair processes where at each unit of time, one random object is
produced for each coordinate of the pair. In the channel or code example, this
corresponds to one output for every input. Interesting communication systems
do not always easily ﬁt into this framework, and this can cause serious problems
in notation and in the interpretation and development of results. For example,
suppose that an input source consists of a sequence of real numbers and let
T denote the time shift on the real sequence space. Suppose that the output
source consists of a binary sequence and let S denote its shift. Suppose also
that the channel is such that for each real number in, three binary symbols are
161 162 CHAPTER 9. CHANNELS AND CODES produced. This ﬁts our usual framework if we consider each output variable to
consist of a binary threetuple since then there is one output vector for each
input symbol. One must be careful, however, when considering the stationarity
of such a system. Do we consider the output process to be physically stationary
if it is stationary with respect to S or with respect to S 3 ? The former might
make more sense if we are looking at the output alone, the latter if we are looking
at the output in relation to the input. How do we deﬁne stationarity for the pair
process? Given two sequence spaces, we might ﬁrst construct a shift on the pair
sequence space as simply the cartesian product of the shifts, e.g., given an input
sequence x and an output sequence y deﬁne a shift T ∗ by T ∗ (x, y ) = (T x, Sy ).
While this might seem natural given simply the pair random process {Xn , Yn },
it is not natural in the physical context that one symbol of X yields three
symbols of Y . In other words, the two shifts do not correspond to the same
amount of time. Here the more physically meaningful shift on the pair space
would be T (x, y ) = (T x, S 3 y ) and the more physically meaningful questions on
stationarity and ergodicity relate to T and not to T ∗ . The problem becomes
even more complicated when channels or codes produce a varying number of
output symbols for each input symbol, where the number of symbols depends
on the input sequence. Such variable rate codes arise often in practice, especially
for noiseless coding applications such as Huﬀman, LempelZiv, and arithmetic
codes. (See [142] for a survey of noiseless coding.) While we will not treat such
variable rate systems in any detail, they point out the diﬃculty that can arise
associating the mathematical shift operation with physical time when we are
considering cartesian products of spaces, each having their own shift.
There is no easy way to solve this problem notationally. We adopt the
following view as a compromise which is usually adequate for ﬁxedrate systems.
We will be most interested in pair processes that are stationary in the physical
sense, that is, whose statistics are not changed when both are shifted by an
equal amount of physical time. This is the same as stationarity with respect
to the product shift if the two shifts correspond to equal amounts of physical
time. Hence for simplicity we will usually focus on this case. More general cases
will be introduced when appropriate to point out their form and how they can
be put into the matching shift structure by considering groups of symbols and
diﬀerent shifts. This will necessitate occasional discussions about what is meant
by stationarity or ergodicity for a particular system.
The mathematical generalization of Shannon’s original notions of sources,
codes, and channels are due to Khinchine [73] [74]. Khinchine’s results characterizing stationarity and ergodicity of channels were corrected and developed
by Adler [2]. 9.2 Channels Say we are given a source [A, X, µ], that is, a sequence of Avalued random
variables {Xn ; n ∈ T } deﬁned on a common probability space (Ω, F , P ) having
a process distribution µ deﬁned on the measurable sequence space (B T , BA T ). 9.2. CHANNELS 163 We shall let X = {Xn ; n ∈ T } denote the sequencevalued random variable,
that is, the random variable taking values in AT according to the distribution
µ. Let B be another alphabet with a corresponding measurable sequence space
(AT , BB T ). We assume as usual that A and B are standard and hence so
are their sequence spaces and cartesian products. A channel [A, ν, B ] with
input alphabet A and output alphabet B (we denote the channel simply by ν
when these alphabets are clear from context) is a family of probability measures
{νx ; x ∈ AT } on (B T , BB T ) (the output sequence space) such that for every
output event F ∈ BB T νx (F ) is a measurable function of x. This measurability
requirement ensures that the set function p speciﬁed on the joint input/output
space (AT × B T ), BA T × BB T ) by its values on rectangles as
dµ(x)νx (F ); F ∈ BB T , G ∈ BA T , p(G × F ) =
G is well deﬁned. The set function p is nonnegative, normalized, and countably
additive on the ﬁeld generated by the rectangles G × F , G ∈ BA T , F ∈ BB T .
Thus p extends to a probability measure on the joint input/output space, which
is sometimes called the hookup of the source µ and channel ν . We will often denote this joint measure by µν . The corresponding sequences of random variables
are called the input/output process.
Thus a channel is a probability measure on the output sequence space for
each input sequence such that a joint input/output probability measure is welldeﬁned. The above equation shows that a channel is simply a regular conditional
probability, in particular,
νx (F ) = p((x, y ) : y ∈ F x); F ∈ BB T , x ∈ AT .
We can relate a channel to the notation used previously for conditional
distributions by using the sequencevalued random variables X = {Xn ; n ∈ T }
and Y = {Yn ; n ∈ T }:
νx (F ) = PY X (F x).
(9.1)
Eq. (1.26) then provides the probability of an arbitrary input/output event:
p(F ) = dµ(x)νx (Fx ), where Fx = {y : (x, y ) ∈ F } is the section of F at x.
If we start with a hookup p, then we can obtain the input distribution µ as
µ(F ) = p(F × B T ); F ∈ BA T .
Similarly we can obtain the output distribution, say η , via
η (F ) = p(AT × F ); F ∈ BB T .
Suppose one now starts with a pair process distribution p and hence also
with the induced source distribution µ. Does there exist a channel ν for which 164 CHAPTER 9. CHANNELS AND CODES p = µν ? The answer is yes since the spaces are standard. One can always deﬁne
the conditional probability νx (F ) = P (F × AT X = x) for all input sequences x,
but this need not possess a regular version, that is, be a probability measure for
all x, in the case of arbitrary alphabets. If the alphabets are standard, however,
we have seen that a regular conditional probability measure always exists. 9.3 Stationarity Properties of Channels We now deﬁne a variety of stationarity properties for channels that are related
to, but not the same as, those for sources. The motivation behind the various deﬁnitions is that stationarity properties of channels coupled with those
of sources should imply stationarity properties for the resulting sourcechannel
hookups.
The classical deﬁnition of a stationary channel is the following: Suppose that
we have a channel [A, ν, B ] and suppose that TA and TB are the shifts on the
input sequence space and output sequence space, respectively. The channel is
stationary with respect to TA and TB or (TA , TB )stationary if
−
νx (TB 1 F ) = νTA x (F ), x ∈ AT , F ∈ BB T . (9.2) If the transformations are clear from context then we simply say that the channel is stationary. Intuitively, a right shift of an output event yields the same
probability as the left shift of an input event. The diﬀerent shifts are required
−
because in general only TA x and not TA 1 x exists since the shift may not be
−1
invertible and in general only TB F and not TB F exists for the same reason. If
the shifts are invertible, e.g., the processes are twosided, then the deﬁnition is
equivalent to
−
νTA x (TB F ) = νT −1 x (TB 1 F ) = νx (F ), all x ∈ AT , F ∈ BB T (9.3) A that is, shifting the input sequence and output event in the same direction does
not change the probability.
The fundamental importance of the stationarity of a channel is contained in
the following lemma.
Lemma 9.3.1 If a source [A, µ], stationary with respect to TA , is connected
to channel [A, ν, B ], stationary with respect to TA and TB , then the resulting
hookup µν is also stationary (with respect to the cartesian product shift T =
TA×B = TA × TB deﬁned by T (x, y ) = (TA x, TB y )).
Proof: We have that
µν (T −1 F ) = dµ(x)νx ((T −1 F )x ). Now
(T −1 F )x = {y : T (x, y ) ∈ F } = {y : (TA x, TB y ) ∈ F }
−
= {y : TB y ∈ FTA x } = TB 1 FTA x 9.3. STATIONARITY PROPERTIES OF CHANNELS 165 and hence
µν (T −1 F ) = −
dµ(x)νx (TB 1 FTA x ). Since the channel is stationary, however, this becomes
µν (T −1 F ) = dµ(x)νTA x (FTA x ) = −
dµTA 1 (x)νx (Fx ), where we have used the change of variables formula. Since µ is stationary,
however, the right hand side is
dµ(x)νx (F ),
which proves the lemma. 2 Suppose next that we are told that a hookup µν is stationary. Does it then
follow that the source µ and channel ν are necessarily stationary? The source
must be since
−
µ(TA 1 F ) = µν ((TA × TB )−1 (F × B T )) = µν (F × B T ) = µ(F ). The channel need not be stationary, however, since, for example, the stationarity
could be violated on a set of µ measure 0 without aﬀecting the proof of the
above lemma. This suggests a somewhat weaker notion of stationarity which is
more directly related to the stationarity of the hookup. We say that a channel
[A, ν, B ] is stationary with respect to a source [A, µ] if µν is stationary. We also
state that a channel is stationary µa.e. if it satisﬁes (9.2) for all x in a set of
µprobability one. If a channel is stationary µa.e. and µ is stationary, then
the channel is also stationary with respect to µ. Clearly a stationary channel
is stationary with respect to all stationary sources. The reason for this more
general view is that we wish to extend the deﬁnition of stationary channels to
asymptotically mean stationary channels. The general deﬁnition extends; the
classical deﬁnition of stationary channels does not.
Observe that the various deﬁnitions of stationarity of channels immediately
extend to block shifts since they hold for any shifts deﬁned on the input and
N
K
output sequence spaces, e.g., a channel stationary with respect to TA and TB
could be a reasonable model for a channel or code that puts out K symbols
from an alphabet B every time it takes in N symbols from an alphabet A. We
N
K
shorten the name (TA , TB )stationary to (N, K )stationary channel in this case.
A stationary channel (without modiﬁers) is simply a (1,1)stationary channel in
this sense.
The most general notion of stationarity that we are interested in is that of
asymptotic mean stationarity We deﬁne a channel [A, ν, B ] to be asymptotically
mean stationary or AMS for a source [A, µ] with respect to TA and TB if the
hookup µν is AMS with respect to the product shift TA × TB . As in the stationary case, an immediate necessary condition is that the input source be AMS
with respect to TA . A channel will be said to be (TA , TB )AMS if the hookup
is (TA , TB )AMS for all TA AMS sources. 166 CHAPTER 9. CHANNELS AND CODES The following lemma shows that an AMS channel is indeed a generalization
of the idea of a stationary channel and that the stationary mean of a hookup of
an AMS source to a stationary channel is simply the hookup of the stationary
mean of the source to the channel.
Lemma 9.3.2 Suppose that ν is (TA , TB )stationary and that µ is AMS with
respect to TA . Let µ denote the stationary mean of µ and observe that µν is
¯
¯
stationary. Then the hookup µν is AMS with stationary mean
µν = µν.
¯
Thus, in particular, ν is an AMS channel.
Proof: We have that
(T −i F )x = {y : (x, y ) ∈ T −i F } = {y : T i (x, y ) ∈ F } = i
i
i
i
{y : (TA x, TB y ) ∈ F } = {y : TB y ∈ FTA x } = −
i
TB i FTA x and therefore since ν is stationary
= −
i
dµ(x)νx (TB i FTA x ) = µν (T −i F ) i
i
dµ(x)νTA x (FTA x ) = −
dµTA i (x)νx (F ). Therefore
1
n n−1 µν (T −i F) = i=0 → n→∞ 1
n n−1
−
dµTA i (x)νx (F )
i=0 dµ(x)νx (F ) = µν (F )
¯
¯ from Lemma 6.5.1 of [50]. This proves that µν is AMS and that the stationary
mean is µν .
¯
2
A ﬁnal property crucial to quantifying the behavior of random processes is
that of ergodicity. Hence we deﬁne a (stationary, AMS) channel ν to be ergodic
with respect to (TA , TB ) if it has the property that whenever a (stationary, AMS)
ergodic source (with respect to TA ) is connected to the channel, the overall
input/output process is (stationary, AMS) ergodic. The following modiﬁcation
of Lemma 6.7.4 of [50] is the principal tool for proving a channel to be ergodic.
Lemma 9.3.3 An AMS (stationary) channel [A, ν, B ] is ergodic if for all AMS
¯
¯
(stationary) sources µ and all sets of the form F = FA × FB , G = GA × GB for
∞
∞
rectangles FA , GA ∈ BA and FB , GB ∈ BB we have that for p = µν
1
lim
n→∞ n n−1
−i
¯
p(TA×B F ¯
¯
G) = p(F )p(G),
¯¯ i=0 where p is the stationary mean of p (p if p is already stationary).
¯ (9.4) 9.4. EXAMPLES OF CHANNELS 167 Proof: The proof parallels that of Lemma 6.7.4 of [50]. The result does not
follow immediately from that lemma since the collection of given sets does not
∞
itself form a ﬁeld. Arbitrary events F, G ∈ BA×B can be approximated arbitrarily closely by events in the ﬁeld generated by the above rectangles and
hence given > 0 we can ﬁnd ﬁnite disjoint rectangles of the given form Fi ,
L
L
Gi , i = 1, · · · , L such that if F0 = i=1 Fi and G0 = i=1 Gi , then p(F ∆F0 ),
p(G∆G0 ), p(F ∆F0 ), and p(G∆G0 ) are all less than . Then
¯
¯  n−1 1
n p(T −k F G) − p(F )p(G) ≤
¯ k=0   1
n 1
n n−1 p(T −k F G) − k=0 1
n n−1 p(T −k F0 G0 )+ k=0 n−1 p(T −k F0 G0 ) − p(F0 )p(G0 ) + p(F0 )p(G0 ) − p(F )p(G).
¯
¯
¯ k=0 Exactly as in Lemma 6.7.4 of [50], the rightmost term is bound above by 2
and the ﬁrst term on the left goes to zero as n → ∞. The middle term is the
absolute magnitude of
1
n n−1 p(T −k
k=0 Gj ) − p(
¯ Fi
i j Fi )p(
i i,j Gj ) =
j 1
n n−1 p(T −k Fi Gj ) − p(Fi )p(Gj ) .
¯ k=0 Each term in the ﬁnite sum converges to 0 by assumption. Thus p is ergodic
from Lemma 6.7.4 of [50].
2
Because of the speciﬁc class of sets chosen, the above lemma considered
separate sets for shifting and remaining ﬁxed, unlike using the same set for
both purposes as in Lemma 6.7.4 of [50]. This was required so that the cross
products in the ﬁnal sum considered would converge accordingly. 9.4 Examples of Channels In this section a variety of examples of channels are introduced, ranging from the
trivially simple to the very complicated. The ﬁrst two channels are the simplest,
the ﬁrst being perfect and the second being useless (at least for communication
purposes). 168 CHAPTER 9. CHANNELS AND CODES Example 9.4.1: Noiseless Channel
A channel [A, ν, B ] is said to be noiseless if A = B and
νx (F ) = 1
0 x∈F
x∈F that is, with probability one the channel puts out what goes in. Such a channel
is clearly stationary and ergodic. Example 9.4.2: Completely Random Channel
Suppose that η is a probability measure on the output space (B T , BB T ) and
deﬁne a channel
νx (F ) = η (F ), F ∈ BB T , x ∈ AT .
Then it is easy to see that the input/output measure satisﬁes
p(G × F ) = η (F )µ(G); F ∈ BB T , G ∈ BA T ,
and hence the input/output measure is a product measure and the input and
output sequences are therefore independent of each other. This channel is called
a completely random channel or product channel because the output is independent of the input. This channel is quite useless because the output tells us
nothing of the input. The completely random channel is stationary (AMS) if
the measure η is stationary (AMS). Perhaps surprisingly, such a channel need
not be ergodic even if η is ergodic since the product of two stationary and ergodic sources need not be ergodic. (See, e.g., [21].) We shall later see that if η
is also assumed to be weakly mixing, then the resulting channel is ergodic.
A generalization of the noiseless channel that is of much greater interest is
the deterministic channel. Here the channel is not random, but the output is
formed by a general mapping of the input rather than being the input itself. Example 9.4.3: Deterministic Channel and Sequence Coders
A channel [A, ν, B ] is said to be deterministic or nonrandom if each input string
is mapped into a ﬁxed output string, that is, if there is a mapping f : AT → B T
such that
1 f (x) ∈ G
νx (G) =
0 f (x) ∈ G.
The mapping f must be measurable in order to satisfy the measurability assumption of the channel. Note that such a channel can also be written as
νx (G) = 1f −1 (G) (x). 9.4. EXAMPLES OF CHANNELS 169 Deﬁne a sequence coder as a deterministic channel, that is, a measurable
mapping from one sequence space into another. It is easy to see that for a
deterministic code we have a hookup speciﬁed by
p(F × G) = µ(F f −1 (G)) and an output process with distribution
η (G) = µ(f −1 (G)).
A sequence coder is said to be (TA , TB )stationary (or just stationary) or
N
K
(TA , TB )stationary (or just (N, K )stationary) if the corresponding channel
is. Thus a sequence coder f is stationary if and only if f (TA x) = TB f (x) and
K
N
it is (N, K ) stationary if and only if f (TA x) = TB f (x).
Lemma 9.4.1 Lemma 9.4.1: A stationary deterministic channel is ergodic.
Proof: From Lemma 9.3.3 it suﬃces to show that
1
n→∞ n n−1
−i
p(TA×B F lim G) = p(F )P (G) i=0 for all rectangles of the form F = FA × FB , FA ∈ BB T , FB ∈ BA T and
G = GA × GB . Then
−i
p(TA×B F −
= p((TA i FA −
GA ) × (TB i FB −
= µ((TA i FA G) GA ) GB )) −
f −1 (TB i FB GB )). Since f is stationary and since inverse images preserve set theoretic operations,
−
f −1 (TB i FB −
GB ) = TA i f −1 (FB ) f −1 (GB ) and hence
1
n n−1 n−1 1
n µ(FA = G) =
→ −i
p(TA×B F p(FA × FB )p(GA × GB ) i=0
n→∞ −
µ(TA i (FA f −1 (FB )) GA f −1 (GB )) i=0 f −1 (FB ))µ(GA f −1 (GB )) since µ is ergodic. This means that the rectangles meet the required condition.
Some algebra then will show that ﬁnite unions of disjoint sets meeting the
conditions also meet the conditions and that complements of sets meeting the
conditions also meet them. This implies from the good sets principle (see, for
example, p. 14 of [50]) that the ﬁeld generated by the rectangles also meets the
condition and hence the lemma is proved.
2 170 CHAPTER 9. CHANNELS AND CODES A stationary sequence coder has a simple and useful structure. Suppose one
has a mapping f : AT → B , that is, a mapping that maps an input sequence into
an output letter. We can deﬁne a complete output sequence y corresponding to
an input sequence x by
i
yi = f (TA x); i ∈ T ,
(9.5)
that is, we produce an output, then shift or slide the input sequence by one time
unit, and then we produce another output using the same function, and so on. A
mapping of this form is called an inﬁnite length sliding block code because it produces outputs by successively sliding an inﬁnite length input sequence and each
time using a ﬁxed mapping to produce the output. The sequencetoletter mapi
¯
¯
ping implies a sequence coder, say f , deﬁned by f (x) = {f (TA x); i ∈ T }. Fur¯(TA x) = TB f (x), that is, a sliding block code induces a stationary
¯
thermore, f
¯
sequence coder. Conversely, any stationary sequence coder f induces a sliding
¯
block code f for which (9.5) holds by the simple identiﬁcation f (x) = (f (x))0 ,
the output at time 0 of the sequence coder. Thus the ideas of stationary sequence coders mapping sequences into sequences and sliding block codes mapping sequences into letters by sliding the input sequence are equivalent. We can
similarly deﬁne an (N, K )sliding block code which is a mapping f : AT → B K
which forms an output sequence y from an input sequence x via the construction
K
N
yiK = f (TA i x). By a similar argument, (N, K )sliding block coders are equivalent to (N, K )stationary sequence coders. When dealing with sliding block codes we will
usually assume for simplicity that K is 1. This involves no loss in generality
since it can be made true by redeﬁning the output alphabet. Example 9.4.4: B processes
The above construction using sliding block or stationary codes provides an easy
description of an important class of random processes that has several nice
properties. A process is said to be a B process or Bernoulli process if it can be
deﬁned as a stationary coding of an independent identically distributed (i.i.d.)
process. Let µ denote the original distribution of the i.i.d. process and let η
denote the induced output distribution. Then for any output events F and G
η (F −
¯
TB n G) = µ(f −1 (F −
¯
TB n G)) = µ(f −1 (F ) −¯
TA n f −1 (G)), ¯
since f is stationary. But µ is stationary and mixing since it is i.i.d. (see Section
6.7 of [50]) and hence this probability converges to
¯
¯
µ(f −1 (F ))µ(f −1 (G)) = η (F )η (G)
and hence η is also mixing. Thus a B process is mixing of all orders and hence
n
is ergodic with respect to TB for all positive integers n.
While codes that depend on inﬁnite input sequences may not at ﬁrst glance
seem to be a reasonable physical model of a coding system, it is possible for 9.4. EXAMPLES OF CHANNELS 171 such codes to depend on the inﬁnite sequence only through a ﬁnite number of
coordinates. In addition, some real codes may indeed depend on an unboundedly
large number of past inputs because of feedback.
Suppose that we consider twosided processes and that we have a measurable
mapping
φ: D × Ai → B i=−M and we deﬁne a sliding block code by
f (x) = φ(xi−M , · · · , x0 , · · · , xi+D ),
¯
then f is a stationary sequence coder. The mapping φ is also called a sliding
block code or a ﬁnitelength sliding block code or a ﬁnitewindow sliding block
code. M is called the memory of the code and D is called the delay of the code
since M past source symbols and D future symbols are required to produce the
current output symbol. The window length or constraint length of the code is
M + D +1, the number of input symbols viewed to produce an output symbol. If
D = 0 the code is said to be causal. If M = 0 the code is said to be memoryless.
There is a problem with the above model if we wish to code a onesided
source since if we wish to start coding at time 0, there are no input symbols with
negative indices. Hence we either must require the code be memoryless (M = 0)
or we must redeﬁne the code for the ﬁrst M instances (e.g., by “stuﬃng” the
code register with arbitrary symbols) or we must only deﬁne the output for times
i ≥ M . For twosided sources a ﬁnitelength sliding block code is stationary.
In the onesided case it is not even deﬁned precisely unless it is memoryless, in
which case it is stationary.
Another case of particular interest is when we have a measurable mapping
γ : AN → B K and we deﬁne a sequence coder f (x) = y by
K
ynK = (ynK , ynK +1 , · · · , y(n+1)K −1 ) = γ (xN ),
nN that is, the input is parsed into nonoverlapping blocks of length N and each is
successively coded into a block of length K outputs without regard to past or
previous input or output blocks. Clearly N input time units must correspond
to K output time units in physical time if the code is to make sense. A code of
this form is called a block code and it is a special case of an (N, K ) sliding block
N
K
code. Such a code is trivially (TA , TA )stationary.
We now return to genuinely random channels. The next example is perhaps
the most popular model for a noisy channel because of its simplicity. Example 9.4.5: Memoryless channels
Suppose that qx0 (·) is a probability measure on BB for all x0 ∈ A and that for
ﬁxed F ,qx0 (F ) is a measurable function of x0 . Let ν be a channel speciﬁed by
its values on output rectangles by
νx ( × Fi ) =
i∈J qxi (Fi ),
i∈J 172 CHAPTER 9. CHANNELS AND CODES for any ﬁnite index set J ⊂ T . Then ν is said to be a memoryless channel.
Intuitively,
Pr(Yi ∈ Fi ; i ∈ J X ) =
Pr(Yi ∈ Fi Xi ).
i∈J For later use we pause to develop a useful inequality for mutual information
between the input and output of a memoryless channel. For contrast we also
describe the corresponding result for a memoryless source and an arbitrary
channel.
Lemma 9.4.2 Let {Xn } be a source with distribution µ and let ν be a channel.
Let {Xn , Yn } be the hookup with distribution p. If the channel is memoryless,
then for any n
n−1 I (X n ; Y n ) ≤ I (Xi ; Yi )
i=0 If instead the source is memoryless, then the inequality is reversed:
n−1 I (X n ; Y n ) ≥ I (Xi ; Yi ).
i=0 Thus if both source and channel are memoryless,
n−1 I (X n ; Y n ) = I (Xi ; Yi ).
i=0 Proof: First suppose that the process is discrete. Then
I ( X n ; Y n ) = H ( Y n ) − H ( Y n X n ) .
Since by construction
n−1 PY n X n (y n xn ) = PY0 X0 (yi xi )
i=0 an easy computation shows that
n−1 H (Y n X n ) = H (Yi Xi ).
i=0 This combined with the inequality
n−1 H (Y n ) ≤ H (Yi )
i=0 (Lemma 2.3.2 used several times) completes the proof of the memoryless channel
result for ﬁnite alphabets. If instead the source is memoryless, we have
I (X n ; Y n ) = H (X n ) − H (X n Y n ) 9.4. EXAMPLES OF CHANNELS 173 n−1 H (Xi ) − H (X n Y n ). =
i=0 Extending Lemma 2.3.2 to conditional entropy yields
n−1 H (X n Y n ) ≤ H (Xi Y n )
i=0 which can be further overbounded by using Lemma 2.5.2 (the fact that reducing
conditioning increases conditional entropy) as
n−1 H (X n Y n ) ≤ H (Xi Yi )
i=0 which implies that
n−1 I (X n ; Y n ) ≥ n−1 H (Xi ) − H (Xi Yi ) =
i=0 I (Xi ; Yi ),
i=0 which completes the proof for ﬁnite alphabets.
To extend the result to standard alphabets, ﬁrst consider the case where the
Y n are quantized to a ﬁnite alphabet. If the Yk are conditionally independent
given X k , then the same is true for q (Yk ), k = 0, 1, · · · , n − 1. Lemma 5.5.6 then
implies that as in the discrete case, I (X n ; Y n ) = H (Y n ) − H (Y n X n ) and the
remainder of the proof follows as in the discrete case. Letting the quantizers
become asymptotically accurate then completes the proof.
2
In fact two forms of memorylessness are evident in a memoryless channel.
The channel is input memoryless in that the probability of an output event
involving {Yi ; i ∈ {k, k + 1, · · · , m}} does not involve any inputs before time k ,
that is, the past inputs. The channel is also input nonanticipatory since this
event does not depend on inputs after time m, that is, the future inputs. The
channel is also output memoryless in the sense that for any given input x, output
events involving nonoverlapping times are independent, i.e.,
νx (Y1 ∈ F1 Y2 ∈ F2 ) = νx (Y1 ∈ F1 )νx (Y2 ∈ F2 ). We pin down these ideas in the following examples. Example 9.4.6: Channels with ﬁnite input memory and
anticipation
A channel ν is said to have ﬁnite input memory of order M if for all onesided
events F and all n
νx ((Yn , Yn+1 , · · · ) ∈ F ) = νx ((Yn , Yn+1 , · · · ) ∈ F ) 174 CHAPTER 9. CHANNELS AND CODES whenever xi = xi for i ≥ n − M . In other words, for an event involving Yi ’s
after some time n, knowing only the inputs for the same times and M time
units earlier completely determines the output probability. Channels with ﬁnite
input memory were introduced by Feinstein [40]. Similarly ν is said to have
ﬁnite anticipation of order L if for all onesided events F and all n
νx ((· · · , Yn ) ∈ F ) = νx ((· · · , Yn ) ∈ F )
provided xi = xi for i ≤ n + L. That is, at most L future inputs must be known
to determine the probability of an event involving current and past outputs. Example 9.4.7: Channels with ﬁnite output memory
A channel ν is said to have ﬁnite output memory of order K if for all onesided
events F and G and all inputs x, if k > K then
νx ((· · · , Yn ) ∈ F (Yn+k , · · · ) ∈ G) = νx ((· · · , Yn ) ∈ F )νx ((Yn+k , · · · ) ∈ G); that is, output events involving output samples separated by more than K time
units are independent. Channels with ﬁnite output memory were introduced by
Wolfowitz [152].
Channels with ﬁnite memory and anticipation are historically important as
the ﬁrst real generalizations of memoryless channels for which coding theorems
could be proved. Furthermore, the assumption of ﬁnite anticipation is physically reasonable as a model for realworld communication channels. The ﬁnite
memory assumptions, however, exclude many important examples, e.g., ﬁnitestate or Markov channels and channels with feedback ﬁltering action. Hence
we will emphasize more general notions which can be viewed as approximations
or asymptotic versions of the ﬁnite memory assumption. The generalization of
ﬁnite input memory channels requires some additional tools and is postponed
to the next chapter. The notion of ﬁnite output memory can be generalized by
using the notion of mixing. Example 9.4.8: Output mixing channels
A channel is said to be output mixing (or asymptotically output independent
or asymptotically output memoryless) if for all output rectangles F and G and
all input sequences x
lim νx (T −n F n→∞ G) − νx (T −n F )νx (G) = 0. More generally it is said to be output weakly mixing if
1
lim
n→∞ n n−1 νx (T −i F
i=0 G) − νx (T −i F )νx (G) = 0. 9.4. EXAMPLES OF CHANNELS 175 Unlike mixing systems, the above deﬁnitions for channels place conditions only
on output rectangles and not on all output events. Output mixing channels
were introduced by Adler [2].
The principal property of output mixing channels is provided by the following
lemma.
Lemma 9.4.3 If a channel is stationary and output weakly mixing, then it is
also ergodic. That is, if ν is stationary and output weakly mixing and if µ is
stationary and ergodic, then also µν is stationary and ergodic.
Proof: The process µν is stationary by Lemma 9.3.1. To prove that it is ergodic
it suﬃces from Lemma 9.3.3 to prove that for all input/output rectangles of the
form F = FB × FA , FB ∈ BA T , FA ∈ BB T , and G = GB × GA that
1
n→∞ n n−1 µν (T −i F lim G) = µν (F )µν (G). i=0 We have that
1
n
1
n n−1 µν (T −i F G) − m(F )m(G) = i=0
n−1
−
µν ((TB i FB −
GB ) × (TA i FA GA )) − µν (FB × FA )µν (GB × GA ) i=0 = 1
n n−1 i=0 −
TA i FA = T GA 1
n −
dµ(x)νx (TB i FB n−1 i=0 −
−
TA i FA −
TA i FA T GA GB ) − µν (FB × FA )µ(GB × GA ) T GA −
TA i F A T GA −
dµ(x)νx (TB i FB −
dµ(x)νx (TB i FB )νx (GB ) GB ) + 1
n n−1 i=0 −
dµ(x)νx (TB i FB )νx (GB ) − µν (FB × FA )µν (GB × GA ) . The ﬁrst term is bound above by
1
n n−1 i=0 −
TA i F A T GA −
dµ(x)νx (TB i FB dµ(x) 1
n −
GB ) − νx (TB i FB )νx (GB ) ≤ n−1
−
νx (TB i FB GB ) − νx (T −i FB )νx (GB ) i=0 which goes to zero from the dominated convergence theorem since the integrand
converges to zero from the output weakly mixing assumption. The second term 176 CHAPTER 9. CHANNELS AND CODES can be expressed using the stationarity of the channel as
dµ(x)νx (GB )
FA 1
n n−1
i
i
1FA (TA x)νTA x (FB ) − µν (F )µν (G).
i=0 The ergodic theorem implies that as n → ∞ the sample average goes to its
expectation
dµ(x)1FA (x)νx (FB ) = µν (F )
and hence the above formula converges to 0, proving the lemma. 2 The lemma provides an example of a completely random channel that is also
ergodic in the following corollary.
Corollary 9.4.1 Suppose that ν is a stationary completely random channel described by an output measure η . If η is weakly mixing, then ν is ergodic. That
is, if µ is stationary and ergodic and η is stationary and weakly mixing, then
µν = µ × η is stationary and ergodic.
Proof: If η is weakly mixing, then the channel ν deﬁned by νx (F ) = η (F ), all
x ∈ AT , F ∈ BB T is output weakly mixing. Thus ergodicity follows from the
lemma.
2
The idea of a memoryless channel can be extended to a block memoryless
or block independent channel, as described next. Example 9.4.9: Block Memoryless Channels
Suppose now that we have an integers N and K (usually K = N ) and a probabilK
ity measure qxN (·) on BB for each xN ∈ AN such that qxN (F ) is a measurable
K
function of xN for each F ∈ BB . Let ν be speciﬁed by its values on output
rectangles by
n
K νx (y : yi ∈ Gi ; i = m, · · · , m + n − 1) = qxN (Gi ),
iN
i=0 where Gi ∈ BB , all i, where z is the largest integer contained in z , and where
m+(i+1)K −1 Gi = × j =m+iK Fj with Fj = B if j ≥ m + n. Such channels are called block memoryless channels
or block independent channels. They are a special case of the following class of
channels. 9.4. EXAMPLES OF CHANNELS 177 Example 9.4.10: Conditionally Block Independent Channels
A conditionally block independent or CBI channel resembles the block memoryless channel in that for a given input sequence the outputs are block independent.
It is more general, however, in that the conditional probabilities of the output
block may depend on the entire input sequence (or at least on parts of the input
sequence not in the same time block). Thus a channel is CBI if its values on
output rectangles satisfy
n
K N
νx (y : yiN ∈ Gi ). νx (y : yi ∈ Fi ; i = m, · · · , m + n − 1) =
i=0 where as before
m+(i+1)K −1 Gi = × j =m+iK Fj with Fj = B if j ≥ m + n. Block memoryless channels are clearly a special
case of CBI channels. These channels have only ﬁnite output memory, but
unlike the block memoryless channels they need not have ﬁnite input memory
or anticipation.
The primary use of block memoryless channels is in the construction of a
channel given ﬁnitedimensional conditional probabilities, that is, one has probabilities for output K tuples given input N tuples and one wishes to model a
channel consistent with these ﬁnitedimensional distributions. The ﬁnite dimensional distributions themselves may be the result of an optimization problem or
an estimate based on observed behavior. An immediate problem is that a channel constructed in this manner may not be stationary, although it is clearly
(N, K )stationary. The next example shows how to modify a block memoryless
channel so as to produce a stationary channel. The basic idea is to occasionally insert some random spacing between the blocks so as to “stationarize” the
channel.
Before turning to the example we ﬁrst develop the technical details required
for producing such random spacing. Random Punctuation Sequences
We demonstrate that we can obtain a sequence with certain properties by stationary coding of an arbitrary stationary and ergodic process. The lemma is a
variant of a theorem of Shields and Neuhoﬀ [135] as simpliﬁed by Neuhoﬀ and
Gilbert [109] for sliding block codings of ﬁnite alphabet processes. One of the
uses to which the result will be put is the same as theirs: constructing sliding
block codes from block codes.
Lemma 9.4.4 Suppose that {Xn } is a stationary and ergodic process. Then
given N and δ > 0 there exists a stationary (or sliding block) coding f : AT →
{0, 1, 2} yielding a ternary process {Zn } with the following properties: 178 CHAPTER 9. CHANNELS AND CODES (a) {Zn } is stationary and ergodic.
(b) {Zn } has a ternary alphabet {0, 1, 2} and it can output only N cells of the
form 011 · · · 1 (0 followed by N − 1 ones) or individual 2’s. In particular,
each 0 is always followed by at exactly N − 1 1’s.
(c) For all integers k
1−δ
1
N
≤ Pr(Zk = 011 · · · 1) ≤
N
N
and hence for any n
Pr(Zn is in an N − cell) ≥ 1 − δ.
A process {Zn } with these properties is called an (N, δ )random blocking
process or punctuation sequence {Zn }.
Proof: A sliding block coding is stationary and hence coding a stationary and
ergodic process will yield a stationary and ergodic process (Lemma 9.4.1) which
proves the ﬁrst part. Pick an > 0 such that N < δ . Given the stationary
and ergodic process {Xn } (that is also assumed to be aperiodic in the sense
that it does not place all of its probability on a ﬁnite set of sequences) we
can ﬁnd an event G ∈ BA T having probability less than . Consider the event
N −1
F = G − i=1 T −i G, that is, F is the collection of sequences x for which x ∈ G,
i
but T x ∈ G for i = 1, · · · , N − 1. We next develop several properties of this
set.
First observe that obviously µ(F ) ≤ µ(G) and hence
µ(F ) ≤ .
The sequence of sets T −i F are disjoint since if y ∈ T −i F , then T i y ∈ F ⊂ G
and T i+l y ∈ G for l = 1, · · · , N − 1, which means that T j y ∈ G and hence
T j y ∈ F for N − 1 ≥ j > i. Lastly we need to show that although F may have
small probability, it is not 0. To see this suppose the contrary, that is, suppose
N −1
that µ(G − i=1 T −i G) = 0. Then
N −1 µ(G N −1 T −i G)) = µ(G) − µ(G (
i=1
N −1 T −i G)c ) = µ(G) (
i=1 and hence µ( i=1 T −i GG) = 1. In words, if G occurs, then it is certain to
occur again within the next N shifts. This means that with probability 1 the
relative frequency of G in a sequence x must be no less than 1/N since if it
ever occurs (which it must with probability 1), it must thereafter occur at least
once every N shifts. This is a contradiction, however, since this means from the
ergodic theorem that µ(G) ≥ 1/N when it was assumed that µ(G) ≤ < 1/N .
Thus it must hold that µ(F ) > 0.
We now use the rare event F to deﬁne a sliding block code. The general
idea is simple, but a more complicated detail will be required to handle a special 9.4. EXAMPLES OF CHANNELS 179 case. Given a sequence x, deﬁne n(x) to be the smallest i for which T i x ∈ F ;
that is, we look into the future to ﬁnd the next occurrence of F . Since F has
nonzero probability, n(x) will be ﬁnite with probability 1. Intuitively, n(x)
should usually be large since F has small probability. Once F is found, we code
backwards from that point using blocks of a 0 preﬁx followed by N − 1 1’s. The
appropriate symbol is then the output of the sliding block code. More precisely,
if n(x) = kN + l, then the sliding block code prints a 0 if l = 0 and prints a
1 otherwise. This idea suﬃces until the event F actually occurs at the present
time, that is, when n(x) = 0. At this point the sliding block code has just
completed printing an N cell of 0111 · · · 1. It should not automatically start a
new N cell, because at the next shift it will be looking for a new F in the future
to code back from and the new cells may not align with the old cells. Thus
the coder looks into the future for the next F ,;that is, it again seeks n(x), the
smallest i for which T i x ∈ F . This time n(x) must be greater than or equal to
N since x is now in F and T −i F are disjoint for i = 1, · · · N − 1. After ﬁnding
n(x) = kN + l, the coder again codes back to the origin of time. If l = 0, then
the two codes are aligned and the coder prints a 0 and continues as before. If
l = 0, then the two codes are not aligned, that is, the current time is in the
middle of a new code word. By construction l ≤ N − 1. In this case the coder
prints l 2’s (ﬁller poop) and shifts the input sequence l times. At this point
there is an n(x) = kN for such that T n(x) x ∈ F and the coding can proceed as
before. Note that k is at least one, that is, there is at least one complete cell
before encountering the new F .
By construction, 2’s can occur only following the event F and then no more
than N 2’s can be produced. Thus from the ergodic theorem the relative frequency of 2’s (and hence the probability that Zn is not in an N block) is no
greater than
1
n→∞ n n−1 1
n→∞ n n−1 12 (Z0 (T i x)) ≤ lim lim i=0 1F (T i x)N = N µ(F ) ≤ N
i=0 that is,
Pr(Zn is in an N − cell) ≥ 1 − δ.
Since Zn is stationary by construction,
N
N
Pr(Zk = 011 · · · 1) = Pr(Z0 = 011 · · · 1) for all k. Thus
N
Pr(Z0 = 011 · · · 1) = 1
N N −1
N
Pr(Zk = 011 · · · 1).
k=0 δ
= δ, (9.6)
N 180 CHAPTER 9. CHANNELS AND CODES N
The events {Zk = 011 · · · 1}, k = 0, 1, . . . , N − 1 are disjoint, however, since
there can be at most one 0 in a single block of N symbols. Thus
N −1 N Pr(Z N = 011 · · · 1) N
Pr(Zk = 011 · · · 1) =
k=0 N −1 = N
{Zk = 011 · · · 1}). Pr( (9.7) k=0 Thus since the rightmost probability is between 1 − δ and 1,
1−δ
1
N
≥ Pr(Z0 = 011 · · · 1) ≥
N
N
which completes the proof.
2
The following corollary shows that a ﬁnite length sliding block code can be
used in the lemma.
Corollary 9.4.2 Given the assumptions of the lemma, a ﬁnitewindow sliding
block code exists with properties (a)(c).
Proof: The sets G and hence also F can be chosen in the proof of the lemma to
be ﬁnite dimensional, that is, to be measurable with respect to σ (X−K , · · · , XK )
for some suﬃciently large K . Choose these sets as before with δ/2 replacing δ .
Deﬁne n(x) as in the proof of the lemma. Since n(x) is ﬁnite with probability
one, there must be an L such that if
BL = {x : n(x) > L},
then δ
.
2
Modify the construction of the lemma so that if n(x) > L, then the sliding block
code prints a 2. Thus if there is no occurrence of the desired ﬁnite dimensional
pattern in a huge bunch of future symbols, a 2 is produced. If n(x) < L, then f
is chosen as in the proof of the lemma. The proof now proceeds as in the lemma
until (9.6), which is replaced by
µ(BL ) < 1
n→∞ n n−1 12 (Z0 (T i x)) lim ≤ i=0 1
n→∞ n n−1 1
n→∞ n n−1 1BL (T i x) + lim lim i=0 1F (T i x)N
i=0 ≤ δ.
The remainder of the proof is the same.
2
Application of the lemma to an i.i.d. source and merging the symbols 1 and
2 in the punctuation process immediately yield the following result since coding
an i.i.d. process yields a B process which is therefore mixing. 9.4. EXAMPLES OF CHANNELS 181 Corollary 9.4.3 Given an integer N and a δ > 0 there exists an (N, δ )punctuation sequence {Zn } with the following properties:
(a) {Zn } is stationary and mixing (and hence ergodic).
(b) {Zn } has a binary alphabet {0, 1} and it can output only N cells of the form
011 · · · 1 (0 followed by N − 1 ones) or individual ones, that is, each zero
is always followed by at least N − 1 ones.
(c) For all integers k
1
1−δ
N
≤ Pr(Zk = 011 · · · 1) ≤
N
N
and hence for any n
Pr(Zn is in an N − cell) ≥ 1 − δ. Example 9.4.11: Stationarized Block Memoryless Channel
Intuitively, a stationarized block memoryless (SBM) channel is a block memoryless channel with random spacing inserted between the blocks according to a
random punctuation process. That is, when the random blocking process produces N cells (which is most of the time), the channel uses the N dimensional
conditional distribution. When it is not using an N cell, the channel produces
some arbitrary symbol in its output alphabet. We now make this idea precise.
Let N , K , and qxN (·) be as in the previous example. We now assume that
K = N , that is, one output symbol is produced for every input symbol and
hence output blocks have the same number of symbols as input blocks. Given
δ > 0 let γ denote the distribution of an (N, δ )random blocking sequence {Zn }.
T
T
Let µ × γ denote the product distribution on (AT ×{0, 1}T , BA ×B{0,1} ); that is,
µ × γ is the distribution of the pair process {Xn , Zn } consisting of the original
source {Xn } and the random blocking source {Zn } with the two sources being
independent of one another. Deﬁne a regular conditional probability (and hence
a channel) πx,z (F ), F ∈ {BB }T , x ∈ AT , z ∈ {0, 1}T by its values on rectangles
as follows: Given z , let J0 (z ) denote the collection of indices i for which zi is
not in an N cell and let J1 (z ) denote those indices i for which zi = 0, that
is, those indices where N cells begin. Let q ∗ denote a trivial probability mass
function on B placing all of its probability on a reference letter b∗ . Given an
output rectangle
F = {y : yj ∈ Fj ; j ∈ J } = × Fj ,
j ∈J deﬁne i+N −1 q ∗ (Fi ) πx,z (F ) =
i∈J T J0 ( z ) qxN ( ×
i
i∈J T J1 (z ) j =i Fi ), where we assume that Fi = B if i ∈ J . Connecting the product source µ × γ
to the channel π yields a hookup process {Xn , Zn , Yn } with distribution, say, 182 CHAPTER 9. CHANNELS AND CODES r, which in turn induces a distribution p on the pair process {Xn , Yn } having
distribution µ on {Xn }. If the alphabets are standard, p also induces a regular
conditional probability for Y given X and hence a channel ν for which p = µν .
A channel of this form is said to be an (N, δ )stationarized block memoryless or
SBM channel.
Lemma 9.4.5 An SBM channel is stationary and ergodic. Thus if a stationary
(and ergodic) source µ is connected to a ν , then the output is stationary (and
ergodic).
Proof: The product source µ × γ is stationary and the channel π is stationary,
hence so is the hookup (µ × γ )π or {Xn , Zn , Yn }. Thus the pair process {Xn , Yn }
must also be stationary as claimed. The product source µ × γ is ergodic from
Corollary 9.4.1 since it can be considered as the input/output process of a
completely random channel described by a mixing (hence also weakly mixing)
output measure. The channel π is output strongly mixing by construction and
hence is ergodic from Lemma 9.4.1. Thus the hookup (µ × γ )π must be ergodic.
This implies that the coordinate process {Xn , Yn } must also be ergodic. This
completes the proof.
2
The block memoryless and SBM channels are principally useful for proving
theorems relating ﬁnitedimensional behavior to sequence behavior and for simulating channels with speciﬁed ﬁnite dimensional behavior. The SBM channels
will also play a key role in deriving sliding block coding theorems from block
coding theorems by replacing the block distributions by trivial distributions,
i.e., by ﬁnitedimensional deterministic mappings or block codes.
The SMB channel was introduced by Pursley and Davisson [29] for ﬁnite
alphabet channels and further developed by Gray and Saadat [61], who called it
a randomly blocked conditionally independent (RBCI) channel. We opt for the
alternative name because these channels resemble block memoryless channels
more than CBI channels.
We now consider some examples that provide useful models for realworld
channels. Example 9.4.12: Primitive Channels
Primitive channels were introduced by Neuhoﬀ and Shields [114],[111] as a physically motivated general channel model. The idea is that most physical channels
combine the input process with a separate noise process that is independent of
the signal and then ﬁlter the combination in a stationary fashion. The noise
is assumed to be i.i.d. since the ﬁltering can introduce dependence. The construction of such channels strongly resembles that of the SBM channels. Let γ
be the distribution of an i.i.d. process {Zn } with alphabet W , let µ × γ denote the product source formed by an independent joining of the original source
distribution µ and the noise process Zn , let π denote the deterministic channel
induced by a stationary sequence coder f : AT × W T → B T mapping an input sequence and a noise sequence into an output sequence. Let r = (µ × γ )π 9.4. EXAMPLES OF CHANNELS 183 denote the resulting hookup distribution and {Xn , Zn , Yn } denote the resulting
process. Let p denote the induced distribution for the pair process {Xn , Yn }.
If the alphabets are standard, then p and µ together induce a channel νx (F ),
x ∈ AT , F ∈ BB T . A channel of this form is called a primitive channel.
Lemma 9.4.6 A primitive channel is stationary with respect to any stationary
source and it is ergodic. Thus if µ is stationary and ergodic and ν is primitive,
then µν is stationary and ergodic.
Proof: Since µ is stationary and ergodic and γ is i.i.d. and hence mixing,
µ × ν is stationary and ergodic from Corollary 9.4.1. Since the deterministic
channel is stationary, it is also ergodic from Lemma 9.4.1 and the resulting
triple {Xn , Zn , Yn } is stationary and ergodic. This implies that the component
process {Xn , Yn } must also be stationary and ergodic, completing the proof. 2 Example 9.4.13: Additive Noise Channels
Suppose that {Xn } is a source with distribution µ and that {Wn } is a “noise”
process with distribution γ . Let {Xn , Wn } denote the induced product source,
that is, the source with distribution µ × γ so that the two processes are independent. Suppose that the two processes take values in a common alphabet A and
that A has an addition operation +, e.g., it is a semigroup. Deﬁne the sliding
¯
block code f by f (x, w) = x0 + w0 and let f denote the corresponding sequence
coder. Then as in the primitive channels we have an induced distribution r on
triples {Xn , Wn , Yn } and hence a distribution on pairs {Xn , Yn } which with µ
induces a channel ν if the alphabets are standard. A channel of this form is
called a additive noise channel or a signalindependent additive noise channel.
If the noise process is a B process, then this is easily seen to be a special case
of a primitive channel and hence the channel is stationary with respect to any
stationary source and ergodic. If the noise is only known to be stationary, the
channel is still stationary with respect to any stationary source. Unless the
noise is assumed to be at least weakly mixing, however, it is not known if the
channel is ergodic in general. Example 9.4.14: Markov Channels
We now consider a special case where A and B are ﬁnite sets with the same
number of symbols. For a ﬁxed positive integer K , let P denote the space
of all K × K stochastic matrices P = {P (i, j ); i, j = 1, 2, · · · , K }. Using the
Euclidean metric on this space we can construct the Borel ﬁeld P of subsets of
P generated by the open sets to form a measurable space (P, P ). This, in turn,
gives a onesided or twosided sequence space (PT , P T ).
A map φ : AT → PT is said to be stationary if φTA = TP φ. Given a
sequence P ∈ PT , let M(P ) denote the set of all probability measures on
(B T , B T ) with respect to which Ym , Ym+1 , Ym+2 , · · · forms a Markov chain with 184 CHAPTER 9. CHANNELS AND CODES transition matrices Pm , Pm+1 , · · · for any integer m, that is, λ ∈ M(P ) if and
only if for any m
n−1 λ[Ym = ym , · · · , Yn = yn ] = λ[Ym = ym ] Pi (yi , yi+1 ),
i=m n > m, ym , · · · , yn ∈ B.
In the onesided case only m = 1 need be veriﬁed. Observe that in general the
Markov chain is nonhomogeneous.
A channel [A, ν, B ] is said to be Markov if there exists a stationary measurable map φ : AT → PT such that νx ∈ M(φ(x)), x ∈ AT .
Markov channels were introduced by Kieﬀer and Rahe [87] who proved that
onesided and twosided Markov channels are AMS. Their proof is not included
as it is lengthy and involves techniques not otherwise used in this book. The
channels are introduced for completeness and to show that several important
channels and codes in the literature can be considered as special cases. A variety
of conditions for ergodicity for Markov channels are considered in [60]. Most are
equivalent to one already considered more generally here: A Markov channel is
ergodic if it is output mixing.
The most important special cases of Markov channels are ﬁnite state channels
and codes. Given a Markov channel with stationary mapping φ, the channel
is said to be a ﬁnite state channel (FSC) if we have a collection of stochastic
matrices Pa ∈ P; a ∈ A and that φ(x)n = Pxn , that is, the matrix produced
by φ at time n depends only on the input at that time, xn . If the matrices
Pa ; a ∈ A contain only 0’s and 1’s, the channel is called a ﬁnite state code. There
are several equivalent models of ﬁnite state channels and we pause to consider
an alternative form that is more common in information theory. (See Gallager
[43], Ch. 4, for a discussion of equivalent models of FSC’s and numerous physical
examples.) An FSC converts an input sequence x into an output sequence y
and a state sequence s according to a conditional probability
Pr(Yk = yk , Sk = sk ; k = m, · · · , nXi = xi , Si = si ; i < m) =
n P (yi , si xi , si−1 ),
i=m that is, conditioned on Xi , Si−1 , the pair Yi , Si is independent of all prior inputs,
outputs, and states. This speciﬁes a FSC deﬁned as a special case of a Markov
channel where the output sequence above is here the joint stateoutput sequence
{yi , si }. Note that with this setup, saying the Markov channel is AMS implies
that the triple process of source, states, and outputs is AMS (and hence obviously so is the Gallager inputoutput process). We will adapt the KieﬀerRahe
viewpoint and call the outputs {Yn } of the Markov channel states even though
they may correspond to stateoutput pairs for a speciﬁc physical model.
In the twosided case, the Markov channel is signiﬁcantly more general than
the FSC because the choice of matrices φ(x)i can depend on the past in a very 9.4. EXAMPLES OF CHANNELS 185 complicated (but stationary) way. One might think that a Markov channel is
not a signiﬁcant generalization of an FSC in the onesided case, however, because there stationarity of φ does not permit a dependence on past channel
inputs, only on future inputs, which might seem physically unrealistic. Many
practical communications systems do eﬀectively depend on the future, however,
by incorporating delay in the coding. The prime example of such lookahead
coders are trellis and tree codes used in an incremental fashion. Such codes investigate many possible output strings several steps into the future to determine
the possible eﬀect on the receiver and select the best path, often by a Viterbi
algorithm. (See, e.g., Viterbi and Omura [147].) The encoder then outputs only
the ﬁrst symbol of the selected path. While clearly a ﬁnite state machine, this
code does not ﬁt the usual model of a ﬁnite state channel or code because of
the dependence of the transition matrix on future inputs (unless, of course, one
greatly expands the state space). It is, however, a Markov channel. Example 9.4.15: Cascade Channels
We will often wish to connect more than one channel in cascade in order to
form a communication system, e.g., the original source is connected to a deterministic channel (encoder) which is connected to a communications channel
which is in turn connected to another deterministic channel (decoder). We now
make precise this idea. Suppose that we are given two channels [A, ν (1) , C ] and
[C, ν (2) , B ]. The cascade of ν (1) and ν (2) is deﬁned as the channel [A, ν, B ] given
by
νx (F ) =
CT (2)
(1)
νu (F ) dνx (u). In other words, if the original source sequence is X , the output to the ﬁrst
channel and input to the second is U , and the output of the second channel is
(1)
Y , then νx (F ) = PU X (F x), νu (G) = PY U (Gu), and νx (G) = PY X (Gx).
Observe that by construction X → U → Y is a Markov chain.
Lemma 9.4.7 A cascade of two stationary channels is stationary.
Proof: Let T denote the shift on all of the spaces. Then
νx (T −1 F ) =
CT =
CT (2)
(1)
νu (T −1 F )dνx (u)
(2)
(1)
νu (F )dνx T −1 (u). (1) (1) (1) But νx (T −1 F ) = νT x(1) (F ), that is, the measures νx T −1 and νT x are identical and hence the above integral is
(1) CT proving the lemma. (2)
νu (F ) dνT x (u) = νT x (F ), 2 186 CHAPTER 9. CHANNELS AND CODES Example 9.4.16: Communication System
A communication system consists of a source [A, µ], a sequence encoder f :
AT → B T (a deterministic channel), a channel [B, ν, B ], and a sequence deT
ˆ
coder g : B → AT . The overall distribution r is speciﬁed by its values on
rectangles as
r(F1 × F2 × F3 × F4 ) = dµ(x)νf (x) (F3
F1 T f −1 (F g −1 (F4 )). 2) Denoting the source by {Xn }, the encoded source or channel input process by
ˆ
{Un }, the channel output process by {Yn }, and the decoded process by {Xn },
ˆ n }. If we let X ,U ,Y , and
then r is the distribution of the process {Xn , Un , Yn , X
ˆ
X denote the corresponding sequences, then observe that X → U → Y and
ˆ
U → Y → X are Markov chains. We abbreviate a communication system to
[µ, f, ν, g ].
It is straightforward from Lemma 9.4.7 to show that if the source, channel,
and coders are stationary, then so is the overall process.
The following is a basic property of a communication system: If the communication system is stationary, then the mutual information rate between the
overall input and output cannot that exceed that over the channel. The result
is often called the data processing theorem.
Lemma 9.4.8 Suppose that a communication system is stationary in the sense
ˆ
that the process {Xn , Un , Yn , Xn } is stationary. Then
˜
¯
¯
ˆ
I (U ; Y ) ≥ I (X ; Y ) ≥ I (X ; X ). (9.8) If {Un } has a ﬁnite alphabet or if it has has the K gap information property
(6.14) and I (U K , Y ) < ∞, then
¯
ˆ
¯
I (X ; X ) ≤ I (U ; Y ).
ˆ
Proof: Since {Xn } is a stationary deterministic encoding of the {Yn }
¯
ˆ
I (X ; X ) ≤ I ∗ (X ; Y ).
¯
From Theorem 6.4.1 the right hand side is bounded above by I (X ; Y ). For each
n
I (X n ; Y n ) ≤ I ((X n , U ); Y n )
= I (Y n ; U ) + I (X n ; Y n U ) = I (Y n ; U ),
where U = {Un , n ∈ T } and we have used the fact that X → U → Y is
a Markov chain and hence so is X N → U → Y K and hence the conditional
mutual information is 0 (Lemma 5.5.2). Thus
¯
˜
I (X ; Y ) ≤ lim I (Y n ; U ) = I (Y ; U ).
n→∞ 9.5. THE ROHLINKAKUTANI THEOREM 187 Applying Theorem 6.4.1 then proves that
¯
ˆ
˜
I (X ; X ) ≤ I (Y ; U ).
If {Un } has ﬁnite alphabet or has the K gap information property and I (U K , Y ) <
˜
¯
∞, then from Theorems 6.4.1 or 6.4.3, respectively, I (Y ; U ) = I ((Y ; U ), completing the proof.
2
The lemma can be easily extended to block stationary processes.
Corollary 9.4.4 Suppose that the process of the previous lemma is not stationN
K
K
ˆN
ary, but is (N, K )stationary in the sense that the vector process {XnN , UnK , YnK , XnN }
is stationary. Then
K¯
¯
ˆ
I (X ; X ) ≤ I (U ; Y ).
N
Proof: Apply the previous lemma to the stationary vector sequence to ﬁnd that
¯
ˆ
¯
I (X N ; X N ) ≤ I (U K ; Y K ).
But 1
¯
ˆ
ˆ
I (X N ; X N ) = lim I (X nN ; X nN )
n→∞ n
which is the limit of the expectation of the information densities n−1 iX nN ,X n N
ˆ
which is N times a subsequence of the densities n−1 iX n ,X n , whose expectation
ˆ
¯
converges to I (X ; Y ). Thus
¯
¯
ˆ
I (X N ; X N ) = N I (X ; X ).
¯
A similar manipulation for I (U K ; Y K ) completes the proof. 9.5 2 The RohlinKakutani Theorem The punctuation sequences of the previous section provide a means for converting a block code into a sliding block code. Suppose, for example, that {Xn }
is a source with alphabet A and γN is a block code, γN : AN → B N . (The
dimensions of the input and output vector are assumed equal to simplify the
discussion.) Typically B is binary. As has been argued, block codes are not stationary. One way to stationarize a block code is to use a procedure similar to that
used to stationarize a block memoryless channel: Send long sequences of blocks
with occasional random spacing to make the overall encoded process stationary.
Thus, for example, one could use a sliding block code to produce a punctuation
sequence {Zn } as in Corollary 9.4.2 which produces isolated 0’s followed by KN
1’s and occasionally produces 2’s. The sliding block code uses γN to encode a
N
N
N
sequence of K source blocks Xn , Xn+N , · · · , Xn+(K −1)N if and only if Zn = 0.
For those rare times l when Zl = 2, the sliding block code produces an arbitrary
symbol b∗ ∈ B . The resulting sliding block code inherits many of the properties of the original block code, as will be demonstrated when proving theorems 188 CHAPTER 9. CHANNELS AND CODES for sliding block codes constructed in this manner. In fact this construction
suﬃces for source coding theorems, but an additional property will be needed
when treating the channel coding theorems. The shortcoming of the results of
Lemma 9.4.4 and Corollary 9.4.2 is that important source events can depend
on the punctuation sequence. In other words, probabilities can be changed by
conditioning on the occurrence of Zn = 0 or the beginning of a block code word.
In this section we modify the simple construction of Lemma 9.4.4 to eﬀectively
obtain a new punctuation sequence that is approximately independent of certain
prespeciﬁed events. The result is a variation of the RohlinKakutani theorem
of ergodic theory [128] [71]. The development here is patterned after that in
Shields [133].
We begin by recasting the punctuation sequence result in diﬀerent terms.
Given a stationary and ergodic source {Xn } with a process distribution µ and
a punctuation sequence {Zn } as in Section 9.4, deﬁne the set F = {x : ZN (x) =
0}, where x ∈ A∞ is a twosided sequence x = (· · · , x−1 , x0 , x1 , · · · ). Let T
denote the shift on this sequence space. Restating Corollary 9.4.2 yields the
following.
Lemma 9.5.1 Given δ > 0 and an integer N , an L suﬃciently large and a
set F of sequences that is measurable with respect to (X−L , · · · , XL ) with the
following properties:
(A) The sets T i F , i = 0, 1, · · · , N − 1 are disjoint.
(B)
1
1−δ
≤ µ(F ) ≤ .
N
N
(C)
N −1 T i F ). 1 − δ ≤ µ(
i=0 So far all that has been done is to rephrase the punctuation result in more
ergodic theory oriented terminology. One can think of the lemma as representing sequence space as a “base” S together with its disjoint shifts T i S ; i =
1, 2, · · · , N − 1, which make up most of the space, together with whatever is left
N −1
over, a set G = i=0 T i F , a set which has probability less than δ which will be
called the “garbage set.” This picture is called a tower. The basic construction
is pictured in Figure 9.1.
Next consider a partition P = {Pi ; i = 0, 1, · · · , P − 1} of A∞ . One
example would be the partition of a ﬁnite alphabet sequence space into its
possible outputs at time 0, that is, Pi = {x : x0 = ai } for i = 0, 1, · · · , A − 1.
Another partition would be according to the output of a sliding block coding of
x. The most important example, however, will be when there is a ﬁnite collection
of important events that we wish to force to be approximately independent of
the punctuation sequence and P is chosen so that the important events are
unions of atoms of P .
We now can state the main result of this section. 9.5. THE ROHLINKAKUTANI THEOREM 189
G TNF
.
.
. .
.
.
T 3F 6
T 2F
6
TF
6
F
Figure 9.1: RohlinKakutani Tower
Lemma 9.5.2 Given the assumptions of Lemma 9.5.1, L and F can be chosen
so that in addition to properties (A)(C) it is also true that
(D)
µ(Pi F ) = µ(Pi T l F ); l = 1, 2, · · · , N − 1, (9.9) N −1 T kF ) µ(Pi F ) = µ(Pi  (9.10) k=0 and
µ(Pi F) ≤ 1
µ(Pi ).
N (9.11) Comment: Eq. (9.11) can be interpreted as stating that Pi and F are approximately independent since 1/N is approximately the probability of F . Only the
upper bound is stated as it is all we need. Eq. (9.9) also implies that µ(Pi F )
is bound below by (µ(Pi ) − δ )µ(F ).
Proof: Eq. (9.10) follows from (9.9) since
N −1 T lF ) µ(Pi  = l=0 = µ(Pi
µ(
N −1
l=0 N −1 l
l=0 T F )
N −1 l
l=0 T F ) = N −1
T lF )
l=0 µ(Pi
N −1
l
l=0 µ(T F ) µ(Pi T l F )µ(T l F )
1
=
N µ(F )
N = µ(Pi  F ) N −1 µ(Pi T l F )
l=0 190 CHAPTER 9. CHANNELS AND CODES Eq. (9.11) follows from (9.10) since
N −1 µ(Pi F) T k F )µ(F ) = µ(Pi F )µ(F ) = µ(Pi 
k=0
N −1 N −1 T kF ) = µ(Pi 
k=0 = 1
µ(
N T k F ))
k=0 N −1 1
µ(Pi
N T kF ) ≤
k=0 1
µ(Pi )
N since the T k F are disjoint and have equal probability, The remainder of this
section is devoted to proving (9.9). We begin by reviewing and developing some
needed notation.
Given a partition P , we deﬁne the label function
P −1 labelP (x) = i1Pi (x),
i=0 where as usual 1P is the indicator function of a set P . Thus the label of a
sequence is simply the index of the atom of the partition into which it falls.
As P partitions the input space into which sequences belong to atoms of P ,
T −i P partitions the space according to which shifted sequences T i x belong to
atoms of P , that is, x ∈ T −i Pl ∈ T −i P is equivalent to T i x ∈ Pl and hence
labelP (T i x) = l. The join
N −1 T −i P PN =
i=0 partitions the space into sequences sharing N labels in the following sense: Each
atom Q of P N has the form
Q = {x : labelP (x) = k0 , labelP (T x) = k1 , · · · , labelP (T N −1 x) = kN − 1}
for some N tuple of integers k = (k0 , · · · , kN − 1). For this reason we will index
the atoms of P N as Qk . Thus P N breaks up the sequence space into groups of
sequences which have the same labels for N shifts.
We ﬁrst construct using Lemma 9.5.1 a huge tower of size KN
N , the
height of the tower to be produced for this lemma. Let S denote the base of
this original tower and let by the probability of the garbage set. This height
KN tower with base S will be used to construct a new tower of height N and
a base F with the additional desired property. First consider the restriction of
the partition P N to F deﬁned by P N F = {Qk F ; all KN tuples k with
coordinates taking values in {0, 1, · · · , P −1}}. P N F divides up the original
base according to the labels of N K shifts of base sequences. For each atom
Qk F in this base partition, the sets {T l (Qk F ); k = 0, 1, · · · , KN − 1} are
disjoint and together form a column of the tower {T l F ; k = 0, 1, · · · , KN − 1}. 9.5. THE ROHLINKAKUTANI THEOREM 191 A set of the form T l (Qk F ) is called the lth level of the column containing it.
Observe that if y ∈ T l (Qk F ), then y = T l u for some u ∈ Qk F and T l u has
label kl . Thus we consider kl to be the label of the column level T l (Qk F ).
This complicated structure of columns and levels can be used to recover the
original partition by
T l (Qk Pj = F) ( Pj G), (9.12) l,k:kl =j that is, Pj is the union of all column levels with label j together with that part
of Pj in the garbage. We will focus on the pieces of Pj in the column levels as
the garbage has very small probability.
We wish to construct a new tower with base F so that the probability of Pi
for any of N shifts of F is the same. To do this we form F dividing each column
of the original tower into N equal parts. We collect a group of these parts to
form F so that F will contain only one part at each level, the N shifts of F will
be disjoint, and the union of the N shifts will almost contain all of the original
tower. By using the equal probability parts the new base will have conditional
probabilities for Pj given T l equal for all l, as will be shown.
Consider the atom Q = Qk S in the partition P N S of the base of the
original tower. If the source is aperiodic in the sense of placing zero probability
on individual sequences, then the set Q can be divided into N disjoint sets of
equal probability, say W0 , W1 , · · · , WN −1 . Deﬁne the set FQ by
(K −2)N FQ = ( (K −2)N T iN W0 ) ( i=0 (K −2)N T 1+iN W1 ) i=0 T N −1+iN WN −1 ) ···(
i=0 N −1 (K −2)N T l+iN Wl . =
i=0 l=0 FQ contains (K − 2) N shifts of W0 , of T W1 , · · · of T l Wl , · · · and of T N −1 WN −1 .
Because it only takes N shifts of each small set and because it does not include
the top N levels of the original column, shifting FQ fewer than N times causes
no overlap, that is, T l FQ are disjoint for j = 0, 1, · · · , N − 1. The union of these
sets contains all of the original column of the tower except possibly portions of
the top and bottom N − 1 levels (which the construction may not include). The
new base F is now deﬁned to be the union of all of the FQk T S . The sets T l F
are then disjoint (since all the pieces are) and contain all of the levels of the
original tower except possibly the top and bottom N − 1 levels. Thus
(K −1)N −1 N −1 T lF ) ≥ µ( i=N l=0 ≥ (K −1)N −1 T iS) = µ(
K −2 µ(S )
i=N 1−
2
1−
=
−
.
KN
N
KN By choosing = δ/2 and K large this can be made larger than 1 − δ . Thus the
new tower satisﬁes conditions (A)(C) and we need only verify the new condition 192 CHAPTER 9. CHANNELS AND CODES (D), that is, (9.9). We have that
µ(Pi T l F ) = µ(Pi T l F )
.
µ(F ) Since the denominator does not depend on l, we need only show the numerator
does not depend on l. From (9.12) applied to the original tower we have that
µ(Pi T lF ) = µ(T j (Qk S) T l F ), j,k:kj =i that is, the sum over all column levels (old tower) labeled i of the probability
of the intersection of the column level and the lth shift of the new base F . The
intersection of a column level in the j th level of the original tower with any shift
of F must be an intersection of that column level with the j th shift of one of
the sets W0 , · · · , WN −1 (which particular set depends on l). Whichever set is
chosen, however, the probability within the sum has the form
µ(T j (Qk S) T lF ) = µ(T j (Qk
= µ((Qk S)
S) T j Wm )
Wm ) = µ(Wm ), where the ﬁnal step follows since Wm was originally chosen as a subset of Qk S .
Since these subsets were all chosen to have equal probability, this last probability
does not depend on m and hence on l and
µ(T j (Qk S) T lF ) = and hence
µ(Pi T lF ) =
j,k:kj =i 1
µ(Qk
N 1
µ(Qk
N S) S ), which proves (9.9) since there is no dependence on l. This completes the proof
of the lemma.
2 Chapter 10 Distortion
10.1 Introduction We now turn to quantiﬁcation of various notions of the distortion between random variables, vectors and processes. A distortion measure is not a “measure”
in the sense used so far; it is an assignment of a nonnegative real number which
indicates how bad an approximation one symbol or random object is of another;
the smaller the distortion, the better the approximation. If the two objects correspond to the input and output of a communication system, then the distortion
provides a measure of the performance of the system. Distortion measures need
not have metric properties such as the triangle inequality and symmetry, but
such properties can be exploited when available. We shall encounter several
notions of distortion and a diversity of applications, with eventually the most
important application being a measure of the performance of a communications system by an average distortion between the input and output. Other
applications include extensions of ﬁnite memory channels to channels which approximate ﬁnite memory channels and diﬀerent characterizations of the optimal
performance of communications systems. 10.2 Distortion and Fidelity Criteria Given two measurable spaces (A, BA ) and (B, BB ), a distortion measure on
A × B is a nonnegative measurable mapping ρ : A × B → [0, ∞) which assigns
a real number ρ(x, y ) to each x ∈ A and y ∈ B which can be thought of as the
cost of reproducing x and y . The principal practical goal is to have a number by
which the goodness or badness of communication systems can be compared. For
example, if the input to a communication system is a random variable X ∈ A
and the output is Y ∈ B , then one possible measure of the quality of the system
is the average distortion Eρ(X, Y ). Ideally one would like a distortion measure
to have three properties:
193 194 CHAPTER 10. DISTORTION
• It should be tractable so that one can do useful theory.
• It should be computable so that it can be measured in real systems.
• It should be subjectively meaningful in the sense that small (large) distortion corresponds to good (bad) perceived quality. Unfortunately these requirements are often inconsistent and one is forced
to compromise between tractability and subjective signiﬁcance in the choice of
distortion measures. Among the most popular choices for distortion measures
are metrics or distances, but many practically important distortion measures
are not metrics, e.g., they are not symmetric in their arguments or they do not
satisfy a triangle inequality. An example of a metric distortion measure that
will often be emphasized is that given when the input space A is a Polish space,
a complete separable metric space under a metric ρ, and B is either A itself
or a Borel subset of A. In this case the distortion measure is fundamental to
the structure of the alphabet and the alphabets are standard since the space is
Polish.
Suppose next that we have a sequence of product spaces An and B n for
n = 1, 2, · · · . A ﬁdelity criterion ρn , n = 1, 2, · · · is a sequence of distortion
measures on An × B n . If one has a pair random process, say {Xn , Yn }, then it
will be of interest to ﬁnd conditions under which there is a limiting per symbol
distortion in the sense that
ρ∞ (x, y ) = lim n→∞ 1
ρn (xn , y n )
n exists. As one might guess, the distortion measures in the sequence often are
interrelated. The simplest and most common example is that of an additive or
singleletter ﬁdelity criterion which has the form
n−1 ρn (xn , y n ) = ρ1 (xi , yi ).
i=0 Here if the pair process is AMS, then the limiting distortion will exist and
it is invariant from the ergodic theorem. By far the bulk of the information
theory literature considers only singleletter ﬁdelity criteria and we will share
this emphasis. We will point out, however, other examples where the basic
methods and results apply. For example, if ρn is subadditive in the sense that
n
ρn (xn , y n ) ≤ ρk (xk , y k ) + ρn−k (xn−k , yk −k ),
k then stationarity of the pair process will ensure that n−1 ρn converges from the
subadditive ergodic theorem. For example, if d is a distortion measure on A × B ,
then
1/p n−1
n n p ρn ( x , y ) = d(xi , yi )
i=0 10.3. PERFORMANCE 195 for p > 1 is subadditive from Minkowski’s inequality.
As an even simpler example, if d is a distortion measure on A × B , then the
following ﬁdelity criterion converges for AMS pair processes:
1
1
ρn (xn , y n ) =
n
n n−1 f (d(xi , yi )).
i=0 This form often arises in the literature with d being a metric and f being a
nonnegative nondecreasing function (sometimes assumed convex).
The ﬁdelity criteria introduced here all are contextfree in that the distortion
between n successive input/output samples of a pair process does not depend
on samples occurring before or after these nsamples. Some work has been
done on contextdependent distortion measures (see, e.g., [94]), but we do not
consider their importance suﬃcient to merit the increased notational and technical diﬃculties involved. Hence we shall consider only contextfree distortion
measures. 10.3 Performance As a ﬁrst application of the notion of distortion, we deﬁne a performance measure of a communication system. Suppose that we have a communication system
ˆ
[µ, f, ν, g ] such that the overall input/output process is {Xn , Xn }. For the moment let p denote the corresponding distribution. Then one measure of the
quality (or rather the lack thereof) of the communication system is the long
term time average distortion per symbol between the input and output as determined by the ﬁdelity criterion. Given two sequences x and x and a ﬁdelity
ˆ
criterion ρn ; n = 1, 2, · · · , deﬁne the limiting sample average distortion or sequence distortion by
ρ∞ (x, y ) = lim sup
n→∞ 1
ρn ( xn , y n ) .
n Deﬁne the performance of a communication system by the expected value of the
limiting sample average distortion:
∆(µ, f, ν, g ) = Ep ρ∞ = Ep lim sup
n→∞ 1
ˆ
ρn ( X n , X n ) .
n (10.1) We will focus on two important special cases. The ﬁrst is that of AMS systems and additive ﬁdelity criteria. A large majority of the information theory
literature is devoted to additive distortion measures and this bias is reﬂected
here. We also consider the case of subadditive distortion measures and systems
that are either twosided and AMS or are onesided and stationary. Unhappily
the overall AMS onesided case cannot be handled as there is not yet a subadditive ergodic theorem for that case. In all of these cases we have that if ρ1 is 196 CHAPTER 10. DISTORTION integrable with respect to the stationary mean process p, then
¯
ρ∞ (x, y ) = lim n→∞ 1
ρn (xn , y n ); p − a.e.,
n (10.2) and ρ∞ is an invariant function of its two arguments, i.e.,
ρ∞ (TA x, TA y ) = ρ∞ (x, y ); p − a.e..
ˆ (10.3) When a system and ﬁdelity criterion are such that (10.2) and (10.3) are
satisﬁed we say that we have a convergent ﬁdelity criterion. We henceforth
make this assumption.
Since ρ∞ is invariant, we have from Lemma 6.3.1 of [50] that
∆ = Ep ρ∞ = Ep ρ∞ .
¯ (10.4) If the ﬁdelity criterion is additive, then we have from the stationarity of p
¯
that the performance is given by
∆ = Ep ρ1 (X0 , Y0 ).
¯ (10.5) If the ﬁdelity criterion is subadditive, then this is replaced by
∆ = inf
N 1
Ep ρN (X N , Y N ).
¯
N (10.6) Assume for the remainder of this section that ρn is an additive ﬁdelity criterion. Suppose now that we now that p is N stationary; that is, if T = TA × TA
ˆ
ˆ
denotes the shift on the input/output space AT × AT , then the overall process
is stationary with respect to T N . In this case
∆= 1
ˆ
EρN (XN , XN ).
N (10.7) We will have this N stationarity, for example, if the source and channel are
stationary and the coders are N stationary, e.g., are length N block codes. More
generally, the source could be N stationary, the ﬁrst sequence coder (N, K )stationary, the channel K stationary (e.g., stationary), and the second sequence
coder (K, N )stationary.
We can also consider the behavior of the N shift more generally when the
system is only AMS This will be useful when considering block codes. Suppose
now that p is AMS with stationary mean p. Then from Theorem 7.3.1 of [50],
¯
p is also T N AMS with an N stationary mean, say pN . Applying the ergodic
¯
theorem to the N shift then implies that if ρN is pN integrable, then
¯
1
n→∞ n n−1
N
ρN (xN , yiN ) = ρ(N )
iN
∞ lim (10.8) i=0
(N ) exists pN (and hence also p) almost everywhere. In addition, ρ∞ is N invariant
¯
and
Ep ρ(N ) = EpN ρ(N ) = EpN ρN (X N , Y N ).
(10.9)
¯
¯
∞
∞ 10.4. THE RHOBAR DISTORTION 197
(N ) Comparison of (10.2) and (10.9) shows that ρ∞ = N ρ∞ pa.e. and hence
1
1
Ep ρN (X N , Y N ) = Ep ρ(N ) = Ep ρ1 (X0 , Y0 ).
(10.10)
¯
¯
∞
NN
N
Given a notion of the performance of a communication system, we can now
deﬁne the optimal performance achievable for trying to communicate a given
source {Xn } with distribution µ over a channel ν : Suppose that E is some class
of sequence coders f : AT → B T . For example, E might consist of all sequence
coders generated by block codes with some constraint or by ﬁnitelength sliding
T
ˆ
block codes. Similarly let D denote a class of sequence coders g : B → AT .
Deﬁne the optimal performance theoretically achievable or OPTA function for
the source µ, channel ν , and code classes E and D by
∆= ∆∗ (µ, ν, E , D) = inf f ∈E ,g ∈D ∆([µ, f, ν, g ]). (10.11) The goal of the coding theorems of information theory is to relate the OPTA
function to (hopefully) computable functions of the source and channel. 10.4 The rhobar distortion In the previous sections it was pointed out that if one has a distortion measure
ρ on two random objects X and Y and a joint distribution on the two random
objects (and hence also marginal distributions for each), then a natural notion of
the diﬀerence between the processes or the poorness of their mutual approximation is the expected distortion Eρ(X, Y ). We now consider a diﬀerent question:
What if one does not have a joint probabilistic description of X and Y , but
instead knows only their marginal distributions. What then is a natural notion of the distortion or poorness of approximation of the two random objects?
In other words, we previously measured the distortion between two random
variables whose stochastic connection was determined, possibly by a channel, a
code, or a communication system. We now wish to ﬁnd a similar quantity for
the case when the two random objects are only described as individuals. One
possible deﬁnition is to ﬁnd the smallest possible distortion in the old sense
consistent with the given information, that is, to minimize Eρ(X, Y ) over all
joint distributions consistent with the given marginal distributions. Note that
this will necessarily give a lower bound to the distortion achievable when any
speciﬁc joint distribution is speciﬁed.
To be precise, suppose that we have random variables X and Y with distributions PX and PY and alphabets A and B , respectively. Let ρ be a distortion
measure on A × B . Deﬁne the ρdistortion (pronounced ρbar) between the
¯
random variables X and Y by
ρ(PX , PY ) = inf Ep ρ(X, Y ),
¯
p∈P Where P = P (PX , PY ) is the collection of all measures on (A × B, BA × BB )
with PX and PY as marginals; that is,
p(A × F ) = PY (F ); F ∈ BB , 198 CHAPTER 10. DISTORTION and
p(G × B ) = PX (G); G ∈ BA .
Note that P is not empty since, for example, it contains the product measure
PX × PY .
Distortion measures of this type have a long history and have been rediscovered many times (see, e.g., R¨schendorf [129]). The original idea was developed
u
for compact metric spaces by Kantorovich [72] and formed a fundamental part
of the origins of linear programming. Levenshtein [95] and Vasershtein [146]
studied this quantity for the special case where A and B are the real line and ρ
is the Euclidean distance.
When as in these cases the distortion is a metric or distance, the ρdistortion
¯
is called the ρdistance. Ornstein [117] developed the distance and many of its
¯
properties for the special case where A and B were common discrete spaces and
¯
ρ was the Hamming distance. In this case the ρdistance is called the ddistance.
¯
R. L. Dobrushin suggested that because of the common suﬃx in the names of
its originators, this distance between distributions should be called the shtein
or stein distance.
The ρdistortion can be extended to processes in a natural way. Suppose
¯
now that {Xn } is a process with process distribution mX and that {Yn } is a
process with process distribution mY . Let PX n and PY n denote the induced
ﬁnite dimensional distributions. A ﬁdelity criterion provides the distortion ρn
between these n dimensional alphabets. Let ρn denote the corresponding ρ
¯
¯
distortion between the n dimensional distributions. Then
ρ(mX , mY ) = sup
¯
n 1
ρn (PX n , PY n );
¯
n that is, the ρdistortion between two processes is the maximum of the ρdistortions
¯
¯
per symbol between ntuples drawn from the process. The properties of the ρ
¯
distance are developed in [57] [120] and a detailed development may be found
in [50] . The following theorem summarizes the principal properties.
Theorem 10.4.1 Suppose that we are given an additive ﬁdelity criterion ρn
with a pseudometric perletter distortion ρ1 and suppose that both distributions
mX and mY are stationary and have the same standard alphabet. Then
(a) limn→∞ n−1 ρn (PX n , PY n ) exists and equals supn n−1 ρn (PX n , PY n ).
¯
¯
(b) ρn and ρ are pseudometrics. If ρ1 is a metric, then ρn and ρ are metrics.
¯
¯
¯
¯
(c) If mX and mY are both i.i.d., then ρ(mX , mY ) = ρ1 (PX0 , PY0 ).
¯
¯
(d) Let Ps = Ps (mX , mY ) denote the collection of all stationary distributions
pXY having mX and mY as marginals, that is, distributions on {Xn , Yn }
with coordinate processes {Xn } and {Yn } having the given distributions.
Deﬁne the process distortion measure ρ
¯
ρ (mX , mY ) =
¯ inf pXY ∈Ps EpXY ρ(X0 , Y0 ). 10.5. DBAR CONTINUOUS CHANNELS 199 Then
ρ(mX , mY ) = ρ (mX , mY );
¯
¯
that is, the limit of the ﬁnite dimensional minimizations is given by a
minimization over stationary processes.
(e) Suppose that mX and mY are both stationary and ergodic. Deﬁne Pe =
Pe (mX , mY ) as the subset of Ps containing only ergodic processes, then
ρ(mX , mY ) =
¯ inf pXY ∈Pe EpXY ρ(X0 , Y0 ), (f ) Suppose that mX and mY are both stationary and ergodic. Let GX denote a
collection of generic sequences for mX in the sense of Section 8.3 of [50].
Generic sequences are those along which the relative frequencies of a set of
generating events all converge and hence by measuring relative frequencies
on generic sequences one can deduce the underlying stationary and ergodic
measure that produced the sequence. An AMS process produces generic
sequences with probability 1. Similarly let GY denote a set of generic
sequences for mY . Deﬁne the process distortion measure
ρ (mX , mY ) =
¯ inf x∈GX ,y ∈GY lim sup
n→∞ 1
n n−1 ρ1 (x0 , y0 ).
i=0 Then
ρ(mX , mY ) = ρ (mX , mY );
¯
¯
that is, the ρ distance gives the minimum long term time average distortion
¯
obtainable between generic sequences from the two sources.
(g) The inﬁma deﬁning ρn and ρ are actually minima.
¯
¯ 10.5 dbar Continuous Channels We can now generalize some of the notions of channels by using the ρdistance
¯
to weaken the deﬁnitions. The ﬁrst deﬁnition is the most important for channel coding applications. We now conﬁne interest to the dbar distance, the
ρdistance for the special case of the Hamming distance:
ρ1 (x, y ) = d1 (x, y ) = 0 if x = y
1 if x = y. n
Suppose that [A, ν, B ] is a discrete alphabet channel and let νx denote the
n
restriction of the channel to B , that is, the output distribution on Y n given
¯
an input sequence x. The channel is said to be dcontinuous if for any > 0
¯n (ν n , ν n ) ≤ whenever xi = x i for
there is an n0 such that for all n > n0 d x x
¯
i = 0, 1, · · · , n. Alternatively, ν is dcontinuous if lim sup sup sup n→∞ an ∈An x,x ∈c(an ) ¯nn
dn (νx , νx ) = 0, 200 CHAPTER 10. DISTORTION where c(an ) is the rectangle deﬁned as all x with xi = ai ; i = 0, 1, · · · , n − 1.
¯
dcontinuity implies the distributions on output ntuples Y n given two input
sequences are very close provided that the input sequences are identical over the
same time period and that n is large. This generalizes the notions of 0 or ﬁnite
input memory and anticipation since the distributions need only approximate
each other and do not have to be exactly the same.
More generally we could consider ρcontinuous channels in a similar manner,
¯
¯
but we will focus on the simpler discrete dcontinuous channel.
¯continuous channels possess continuity properties that will be useful for
d
proving block and sliding block coding theorems. They are “continuous” in the
sense that knowing the input with suﬃciently high probability for a suﬃciently
long time also speciﬁes the output with high probability. The following two
lemmas make these ideas precise.
Lemma 10.5.1 Suppose that x, x ∈ c(an ) and
¯
¯n n
d(νx , νx ) ≤ δ 2 .
¯
¯
This is the case, for example, if the channel is d continuous and n is chosen
suﬃciently large. Then
n
n
νx (Gδ ) ≥ νx (G) − δ
¯
and hence
inf x∈c(an ) n
n
νx (Gδ ) ≥ sup νx (G) − δ.
x∈c(an ) ¯
Proof: From Theorem 10.4.1 the inﬁma deﬁning the d distance are actually
minima and hence there is a pmf p on B n × B n such that
n
p(y n , bn ) = νx (y n )
bn ∈B n and
n
p(bn , y n ) = νx (y n );
¯
bn ∈ B n
n
n
that is, p has νx and νx as marginals, and
¯ 1
¯n n
¯
Ep dn (Y n , Y n ) = d(νx , νx ).
¯
n
Using the Markov inequality we can write
n
νx (Gδ ) =
=
≥
≥ ¯
¯
p(Y n ∈ Gδ ) ≥ p(Y n ∈ G and dn (Y n , Y n ) ≤ nδ )
¯
¯
1 − p(Y n ∈ G or dn (Y n , Y n ) > nδ )
¯
¯
1 − p(Y n ∈ G) − p(dn (Y n , Y n ) > nδ )
1
n
n
¯
νx (G) − E (n−1 dn (Y n , Y n )) ≥ νx (G) − δ
¯
¯
δ proving the ﬁrst statement. The second statement follows from the ﬁrst. 2 10.5. DBAR CONTINUOUS CHANNELS 201 Next suppose that [G, µ, U ] is a stationary source, f is a stationary encoder
which could correspond to a ﬁnite length sliding block encoder or to an inﬁnite
length one, ν is a stationary channel, and g is a length m sliding block decoder.
The probability of error for the resulting hookup is deﬁned by
ˆ
Pe (µ, ν, f, g ) = Pr(U0 = U0 ) = µν (E ) = dµ(u)νf (u) (Eu ), where E is the error event {u, y : u0 = gm (Y− q m )} and Eu = {y : (u, y ) ∈ E } is
the section of E at u.
Lemma 10.5.2 Given a stationary channel ν , a stationary source [G, µ, U ], a
length m sliding block decoder, and two encoders f and φ, then for any positive
integer r
Pe (µ, ν, f, g ) − Pe (µ, ν, φ, g ) ≤
m
¯rr
+ r Pr(f = φ) + m maxr sup dr (νx , νx ).
ar ∈A x,x ∈c(ar )
r
Proof: Deﬁne Λ = {u : f (u) = φ(u)} and
r −1 Λr = {u : f (T i u) = φ(T i u); i = 0, 1 · · · , r − 1} = T i Λ.
i=0 From the union bound
µ(Λc ) ≤ rµ(Λc ) = rPr(f = φ).
r (10.12) m
From stationarity, if g = gm (Y−q ) then m
dµ(u)νf (u) (y : gm (y−q ) = u0 ) Pe (µ, ν, f, g ) = = ≤ m1
+
r
r 1
r r −1
m
dµ(u)νf (u) (y : gm (yi−q ) = u0 ) i=0
r −q
r
m
dµ(u)νf (u) (y r : gm (yi−q ) = ui ) + µ(Λc ).
r
i=q ¯r
Fix u ∈ Λr and let pu yield dr (νf (u),φ(u) ); that is,
r
r
r
r
y r pu (y , w ) = νφ(u) (w ), and
1
r (10.13) Λr wr r
pu (y r , wr ) = νf (u) (y r ), r −1 ¯r
pu (y r , wr : yi = wi ) = dr (νf (u),φ(u) ).
i=0 (10.14) 202 CHAPTER 10. DISTORTION We have that
1
r r −q
r
m
νf (u) (y r : gm (yi−q ) = ui )
i=q = ≤ ≤ ≤ 1
r
1
r
1
r
1
r r −q
m
pu (y r , wr : gm (yi−q ) = ui )
i=q
r −q
m
m
pu (y r , wr : gm (yi−q ) = wi−q ) +
i=q 1
r r −q
m
pu (y r , wr : gm (wi−q ) = ui )
i=q r −q
r
r
pu (y r , wr : yi−q = wi−q ) + Pe (µ, ν, φ, g )
i=q
r −q i−q +m pu (y r , wr : yj = wj ) + Pe (µ, ν, φ, g )
i=q j =i−q r
¯r
≤ mdr (νf (u) , νφ(u) ) + Pe (µ, ν, φ, g ), which with (10.12)(10.14) proves the lemma.
2
The following corollary states that the probability of error using sliding block
¯
codes over a dcontinuous channel is a continuous function of the encoder as
measured by the metric on encoders given by the probability of disagreement of
the outputs of two encoders.
¯
Corollary 10.5.1 Given a stationary dcontinuous channel ν and a ﬁnite length
decoder gm : B m → A, then given > 0 there is a δ > 0 so that if f and φ are
two stationary encoders such that Pr(f = g ) ≤ δ , then
Pe (µ, ν, f, g ) − Pe (µ, ν, φ, g ) ≤ .
Proof: Fix > 0 and choose r so large that
max
r
a sup
x,x ∈c(ar ) ¯rr
dr (νx , νx ) ≤
m
r ≤ 3m
3 , and choose δ = /(3r). Then Lemma 10.5.2 implies that
Pe (µ, ν, f, g ) − Pe (µ, ν, φ, g ) ≤ .
2
Given an arbitrary channel [A, ν, B ], we can deﬁne for any block length
N a closely related CBI channel [A, ν , B ] as the CBI channel with the same
˜
probabilities on output N blocks, that is, the same conditional probabilities for
N
YkN given x, but having conditionally independent blocks. We shall call ν the
˜
N CBI approximation to ν . A channel ν is said to be conditionally almost block 10.6. THE DISTORTIONRATE FUNCTION 203 independent or CABI if given there is an N0 such that for any N ≥ N0 there
is an M0 such that for any x and any N CBI approximation ν to ν
˜
¯ νM M
d(˜x , νx ) ≤ , all M ≥ M0 ,
M
N
where νx denotes the restriction of νx to BB , that is, the output distribution on
N
Y given x. A CABI channel is one such that the output distribution is close (in
¯
a d sense) to that of the N CBI approximation provided that N is big enough.
CABI channels were introduced by Neuhoﬀ and Shields [111] who provided
several examples alternative characterizations of the class. In particular they
¯
showed that ﬁnite memory channels are both dcontinuous and CABI. Their
¯
principal result, however, requires the notion of the d distance between channels.
¯
Given two channels [A, ν, B ] and [A, ν , B ], deﬁne the d distance between the
channels to be
¯
¯n N
d(ν, ν ) = lim sup sup d(νx , ν x ).
n→∞ x Neuhoﬀ and Shields [111] showed that the class of CABI channels is exactly
¯
the class of primitive channels together with the d limits of such channels. 10.6 The DistortionRate Function We close this chapter on distortion, approximation, and performance with the
introduction and discussion of Shannon’s distortionrate function. This function
(or functional) of the source and distortion measure will play a fundamental role
in evaluating the OPTA functions. In fact, it can be considered as a form of
information theoretic OPTA. Suppose now that we are given a source [A, µ]
ˆ
ˆ
and a ﬁdelity criterion ρn ; n = 1, 2, · · · deﬁned on A × A, where A is called
the reproduction alphabet. Then the Shannon distortion rate function (DRF) is
deﬁned in terms of a nonnegative parameter called rate by
D(R, µ) = lim sup
N →∞ 1
DN (R, µN )
N where
DN (R, µN ) = inf pN ∈RN (R,µN ) EpN ρN (X N , Y N ), where RN (R, µN ) is the collection of all distributions pN for the coordinate
N
N
ˆ
random vectors X N and Y N on the space (AN × AN , BA × BA ) with the
ˆ
properties that
ˆ
(1) pN induces the given marginal µN ; that is, pN (AN × F ) = µN (F ) for all
N
F ∈ BA , and
(2) the mutual information satisﬁes
1
ˆ
I N (X N ; X N ) ≤ R.
Np 204 CHAPTER 10. DISTORTION If RN (R, µN ) is empty, then DN (R, µN ) is ∞. DN is called the Nth order
distortionrate function.
Lemma 10.6.1 DN (R, µ) and D(R, µ) are nonnegative convex
R and hence are continuous in R for R > 0. functions of Proof: Nonnegativity is obvious from the nonnegativity of distortion. Suppose
that pi ∈ RN (Ri , µN ); i = 1, 2 yields
Epi ρN (X N , Y N ) ≤ DN (Ri , µ) + .
From Corollary 5.5.5 mutual information is a convex
function of the conditional distribution and hence if p = λp1 + (1 − λ)p2 , then
¯
Ip ≤ λIp1 + (1 − λ)Ip2 ≤ λR1 + (1 − λ)R2
¯
and hence p ∈ RN (λR1 + (1 − λ)R2 ) and therefore
¯
DN (λR1 + (1 − λ)R2 ) ≤ Ep ρN (X N , Y N )
¯ = λEp1 ρN (X N , Y N ) + (1 − λ)Ep2 ρN (X N , Y N ) ≤ λDN (R1 , µ) + (1 − λ)DN (R2 , µ). Since D(R, µ) is the limit of DN (R, µ), it too is convex. It is well known from
real analysis that convex functions are continuous except possibly at their end
points.
2
The following lemma shows that when the underlying source is stationary
and the ﬁdelity criterion is subadditive (e.g., additive), then the limit deﬁning
D(R, µ) is an inﬁmum.
Lemma 10.6.2 If the source µ is stationary and the ﬁdelity criterion is subadditive, then
1
D(R, µ) = lim DN (R, µ) = inf DN (R, µ).
N →∞
NN
Proof: Fix N and n < N and let pn ∈ Rn (R, µn ) yield
Epn ρn (X n , Y n ) ≤ Dn (R, µn ) + 2 and let pN −n ∈ RN −n (R, µN −n ) yield
EpN −n ρN −n (X N −n , Y N −n ) ≤ DN −n (R, µN −n ) + .
2
n
pn together with µn implies a regular conditional probability q (F xn ), F ∈ BA .
ˆ
Similarly pN −n and µN −n imply a regular conditional probability r(GxN −n ).
Deﬁne now a regular conditional probability t(·xN ) by its values on rectangles
as
N
n
t(F × GxN ) = q (F xn )r(GxN −n ); F ∈ BA , G ∈ BA −n .
ˆ
n
ˆ 10.6. THE DISTORTIONRATE FUNCTION 205 Note that this is the ﬁnite dimensional analog of a block memoryless channel
with two blocks. Let pN = µN t be the distribution induced by µ and t. Then
exactly as in Lemma 9.4.2 we have because of the conditional independence that
N
N
IpN (X N ; Y N ) ≤ IpN (X n ; Y n ) + IpN (Xn −n ; Yn −n ) and hence from stationarity
IpN (X N ; Y N ) ≤ Ipn (X n ; Y n ) + IpN −n (X N −n ; Y N −n )
≤ nR + (N − n)R = N R
N so that p N ∈ RN (R, µ ). Thus DN (R, µN ) ≤ EpN ρN (X N , Y N )
N
N
≤ EpN ρn (X n , Y n ) + ρN −n (Xn −n , Yn −n ) = Epn ρn (X n , Y n ) + EpN −n ρN −n (X N −n , Y N −n )
≤ Dn (R, µn ) + DN −n (R, µN −n ) + .
Thus since is arbitrary we have shown that if dn = Dn (R, µn ), then
dN ≤ dn + dN −n ; n ≤ N ; that is, the sequence dn is subadditive. The lemma then follows immediately
from Lemma 7.5.1 of [50].
2
As with the ρ distance, there are alternative characterizations of the distortion¯
rate function when the process is stationary. The remainder of this section is
devoted to developing these results. The idea of an SBM channel will play
an important role in relating nth order distortionrate functions to the process
deﬁnitions. We henceforth assume that the input source µ is stationary and
we conﬁne interest to additive ﬁdelity criteria based on a perletter distortion
ρ = ρ1 .
The basic process DRF is deﬁned by
¯
Ds (R, µ) = inf ¯
p∈Rs (R,µ) Ep ρ(X0 , Y0 ), ¯
where Rs (R, µ) is the collection of all stationary processes p having µ as an
¯
¯
input distribution and having mutual information rate Ip = Ip (X ; Y ) ≤ R. The
original idea of a process ratedistortion function was due to Kolmogorov and
his colleagues [88] [45] (see also [23]). The idea was later elaborated by Marton
[102] and Gray, Neuhoﬀ, and Omura [55].
Recalling that the L1 ergodic theorem for information density holds when
¯p = I ∗ ; that is, the two principal deﬁnitions of mutual information rate yield
I
p
the same value, we also deﬁne the process DRF
∗
Ds (R, µ) = inf p∈R∗ (R,µ)
s Ep ρ(X0 , Y0 ), 206 CHAPTER 10. DISTORTION where R∗ (R, µ) is the collection of all stationary processes p having µ as an
s
∗
¯
¯
input distribution, having mutual information rate Ip ≤ R, and having Ip = Ip .
If µ is both stationary and ergodic, deﬁne the corresponding ergodic process
DRF’s by
¯
De (R, µ) =
inf
Ep ρ(X0 , Y0 ),
¯
p∈Re (R,µ) ∗
De (R, µ) = inf p∈R∗ (R,µ)
e Ep ρ(X0 , Y0 ), ¯
¯
where Re (R, µ) is the subset of Rs (R, µ) containing only ergodic measures and
∗
∗
Re (R, µ) is the subset of Rs (R, µ) containing only ergodic measures.
Theorem 10.6.1 Given a stationary source which possesses a reference letter
ˆ
in the sense that there exists a letter a∗ ∈ A such that
Eµ ρ(X0 , a∗ ) ≤ ρ∗ < ∞. (10.15) Fix R > 0. If D(R, µ) < ∞, then
∗
¯
D(R, µ) = Ds (R, µ) = Ds (R, µ). If in addition µ is ergodic, then also
∗
¯
D(R, µ) = De (R, µ) = De (R, µ). The proof of the theorem depends strongly on the relations among distortion
and mutual information for vectors and for SBM channels. These are stated
and proved in the following lemma, the proof of which is straightforward but
somewhat tedious. The theorem is proved after the lemma.
Lemma 10.6.3 Let µ be the process distribution of a stationary source {Xn }.
Let ρn ; n = 1, 2, · · · be a subadditive (e.g., additive) ﬁdelity criterion. Suppose
ˆ
that there is a reference letter a∗ ∈ A for which (10.15) holds. Let pN be a
N
N
ˆ
measure on (AN × AN , BA × BA ) having µN as input marginal; that is, pN (F ×
ˆ
N
ˆ
AN ) = µN (F ) for F ∈ BA . Let q denote the induced conditional probability
N
measure; that is, qxN (F ), xN ∈ AN , F ∈ BA , is a regular conditional probability
ˆ
measure. (This exists because the spaces are standard.) We abbreviate this
relationship as pN = µN q . Let X N , Y N denote the coordinate functions on
ˆ
AN × AN and suppose that
EpN 1
ρN ( X N , Y N ) ≤ D
N (10.16) and 1
I N (X N ; Y N ) ≤ R.
(10.17)
Np
If ν is an (N, δ ) SBM channel induced by q as in Example 9.4.11 and if p = µν
is the resulting hookup and {Xn , Yn } the input/output pair process, then
1
E p ρN ( X N , Y N ) ≤ D + ρ∗ δ
N (10.18) 10.6. THE DISTORTIONRATE FUNCTION 207 and
∗
¯
Ip (X ; Y ) = Ip (X ; Y ) ≤ R; (10.19) that is, the resulting mutual information rate of the induced stationary process
satisﬁes the same inequality as the vector mutual information and the resulting
distortion approximately satisﬁes the vector inequality provided δ is suﬃciently
small. Observe that if the ﬁdelity criterion is additive, the (10.18) becomes
Ep ρ1 (X0 , Y0 ) ≤ D + ρ∗ δ.
Proof: We ﬁrst consider the distortion as it is easier to handle. Since the SBM
channel is stationary and the source is stationary, the hookup p is stationary
and
1
1
Ep ρn (X n , Y n ) =
dmZ (z )Epz ρn (X n , Y n ),
n
n
where pz is the conditional distribution of {Xn , Yn } given {Zn }. Note that the
above formula reduces to Ep ρ(X0 , Y0 ) if the ﬁdelity criterion is additive because
n
of the stationarity. Given z , deﬁne J0 (z ) to be the collection of indices of z n
for which zi is not in an N cell. (See the discussion in Example 9.4.11.) Let
n
J1 (z ) be the collection of indices for which zi begins an N cell. If we deﬁne
n
the event G = {z : z0 begins an N − cell}, then i ∈ J1 (z ) if T i z ∈ G. From
−1
Corollary 9.4.3 mZ (G) ≤ N . Since µ is stationary and {Xn } and {Zn } are
mutually independent,
Epz ρ(Xi , a∗ ) + N nEpz ρn (X n , Y n ) ≤
n
i∈J0 (z ) n−1 n−1 1Gc (T i z )ρ∗ + = N
Epz ρ(Xi , YiN )
n
i∈J1 (z ) i=0 EpN ρN 1G (T i z ).
i=0 Since mZ is stationary, integrating the above we have that
Ep ρ1 (X0 , Y0 ) = ρ∗ mZ (Gc ) + N mZ (G)EpN ρN ≤ ρ∗ δ + EpN ρN ,
proving (10.18).
ˆ
Let rm and tm denote asymptotically accurate quantizers on A and A; that
is, as in Corollary 6.2.1 deﬁne
ˆ
X n = rm (X )n = (rm (X0 ), · · · , rm (Xn−1 ))
ˆ
and similarly deﬁne Y n = tm (Y )n . Then
I (rm (X )n ; tm (Y )n ) → I (X n ; Y n )
m→∞ and
¯
I (rm (X ); tm (Y )) → I ∗ (X ; Y ).
m→∞ 208 CHAPTER 10. DISTORTION We wish to prove that
¯
I (X ; Y ) =
=
= 1
I (rm (X )n ; tm (Y )n )
n
1
lim lim I (rm (X )n ; tm (Y )n )
m→∞ n→∞ n
I ∗ (X ; Y )
lim lim n→∞ m→∞ ¯
Since I ≥ I ∗ , we must show that
lim lim n→∞ m→∞ 1
1
I (rm (X )n ; tm (Y )n ) ≤ lim lim I (rm (X )n ; tm (Y )n ).
m→∞ n→∞ n
n We have that
ˆˆ
ˆ
ˆ
ˆˆ
I (X n ; Y n ) = I ((X n , Z n ); Y n ) − I (Z n , Y n X n )
and
ˆ
ˆ
ˆˆ
ˆ
ˆˆ
I ((X n , Z n ); Y n ) = I (X n ; Y n Z n ) + I (Y n ; Z n ) = I (X n ; Y n Z n )
ˆ
since X n and Z n are independent. Similarly,
ˆˆ
I (Z n ; Y n X n ) ˆ
ˆˆ
= H (Z n X n ) − H (Z n X n , Y n )
ˆˆ
ˆˆ
= H (Z n ) − H (Z n X n , Y n ) = I (Z n ; (X n , Y n )). Thus we need to show that
1
1
I (rm (X )n ; tm (Y )n Z n ) − I (Z n , (rm (X )n , tm (Y )n )) ≤
n
n
1
1
lim lim
I (rm (X )n ; tm (Y )n Z n ) − I (Z n , (rm (X )n , tm (Y )n )) .
m→∞ n→∞ n
n lim lim n→∞ m→∞ Since Zn has a ﬁnite alphabet, the limits of n−1 I (Z n , (rm (X )n , tm (Y )n )) are
¯
the same regardless of the order from Theorem 6.4.1. Thus I will equal I ∗ if we
can show that
1
¯
I (X ; Y Z ) = lim lim I (rm (X )n ; tm (Y )n Z n )
n→∞ m→∞ n
1
≤ lim lim I (rm (X )n ; tm (Y )n Z n ) = I ∗ (X ; Y Z ). (10.20)
m→∞ n→∞ n
This we now proceed to do. From Lemma 5.5.7 we can write
I (rm (X )n ; tm (Y )n Z n ) = I (rm (X )n ; tm (Y )n Z n = z n ) dPZ n (z n ). ˆˆ
Abbreviate I (rm (X )n ; tm (Y )n Z n = z n ) to Iz (X n ; Y n ). This is simply the
ˆ n and Y n under the distribution for (X n , Y n )
ˆ
ˆˆ
mutual information between X
given a particular random blocking sequence z . We have that
ˆˆ
ˆ
ˆˆ
Iz (X n ; Y n ) = Hz (Y n ) − Hz (Y n X n ). 10.6. THE DISTORTIONRATE FUNCTION 209 n
n
Given z , let J0 (z ) be as before. Let J2 (z ) denote the collection of all indices i
of zi for which zi begins an N cell except for the ﬁnal such index (which may
n
n
begin an N cell not completed within z n ). Thus J2 (z ) is the same as J1 (z )
except that the largest index in the latter collection may have been removed
if the resulting N cell was not completed within the ntuple. We have using
standard entropy relations that ˆˆ
Iz (X n ; Y n ) ≥ ˆˆ
ˆˆ ˆ
Hz (Yi Y i ) − Hz (Yi Y i , X i+1 )
n
i∈J0 (z ) ˆˆ
ˆˆˆ
Hz (YiN Y i ) − Hz (YiN Y i , X i+N ) . (10.21) +
n
i∈J2 (z ) n
For i ∈ J0 (z ), however, Yi is a∗ with probability one and hence
ˆˆ
ˆ
Hz (Yi Y i ) ≤ Hz (Yi ) ≤ Hz (Yi ) = 0 and
ˆˆ ˆ
ˆ
Hz (Yi Y i , X i+1 ) ≤ Hz (Yi ) ≤ Hz (Yi ) = 0.
Thus we have the bound
ˆˆ
Iz (X n ; Y n ) ≥ ˆˆ
ˆˆˆ
Hz (YiN Y i ) − Hz (YiN Y i , X i+N ) .
n
i∈J2 (z ) ˆ
ˆˆ
ˆ
ˆ
Iz (YiN ; (Y i , X i + N )) − Iz (YiN ; Y i ) =
n
i∈J2 (z ) ˆ
ˆN
ˆ
ˆ
Iz (YiN ; Xi ) − Iz (YiN ; Y i ) , ≥ (10.22) n
i∈J2 (z ) where the last inequality follows from the fact that I (U ; (V, W )) ≥ I (U ; V ).
n
For i ∈ J2 (z ) we have by construction and the stationarity of µ that
ˆˆ
ˆN ˆ
(10.23)
Iz (Xi ; YiN ) = IpN (X N ; Y N ).
n
As before let G = {z : z0 begins an N − cell}. Then i ∈ J2 (z ) if T i z ∈ G and
i < n − N and we can write 1
ˆˆ
Iz (X n ; Y n ) ≥
n
1
ˆˆ
I N (X N ; Y N )
np n−N −1 1G (T i z ) −
i=0 1
n n−N −1 ˆ
ˆ
Iz (YiN ; Y i )1G (T i z ).
i=0 All of the above terms are measurable functions of z and are nonnegative. Hence
they are integrable (although we do not yet know if the integral is ﬁnite) and
we have that
1 ˆn ˆn
I (X ; Y ) ≥
n
ˆˆ
Ipn (X N ; Y N )mZ (G) n−N
1
−
n
n n−N −1 ˆ
ˆ
dmZ (z )Iz (YiN ; Y i )1G (T i z ).
i=0 210 CHAPTER 10. DISTORTION To continue we use the fact that since the processes are stationary, we can
consider it to be a two sided process (if it is one sided, we can imbed it in a two
sided process with the same probabilities on rectangles). By construction
ˆ
ˆ
ˆ
Iz (YiN ; Y i ) = IT i z (Y0N ; (Y−i , · · · , Y−1 ))
and hence since mZ is stationary we can change variables to obtain
1 ˆn ˆn
I (X ; Y ) ≥
n
ˆˆ
Ipn (X N ; Y N )mZ (G) n−N 1
−
n
n n−N −1 ˆ
ˆ
ˆ
dmZ (z )Iz (Y0N ; (Y−i , · · · , Y−1 ))1G (z ).
i=0 We obtain a further bound from the inequalities
ˆ
ˆ
ˆ
Iz (Y0N ; (Y−i , · · · , Y−1 )) ≤ Iz (Y0N ; (Y−i , · · · , Y−1 )) ≤ Iz (Y0N ; Y − )
where Y − = (· · · , Y−2 , Y−1 ). Since Iz (Y0N ; Y − ) is measurable and nonnegative,
its integral is deﬁned and hence
lim n→∞ 1 ˆn ˆn n
ˆˆ
I (X ; Y Z ) ≥ Ipn (X N ; Y N )mZ (G) −
n dmZ (z )Iz (Y0N ; Y − ).
G We can now take the limit as m → ∞ to obtain
I ∗ (X ; Y Z ) ≥ Ipn (X N ; Y N )mZ (G) − dmZ (z )Iz (Y0N ; Y − ). (10.24) G This provides half of what we need.
Analogous to (10.21) we have the upper bound
ˆˆ
Iz (X n ; Y n ) ≤ ˆ
ˆˆ
ˆ
ˆ
Iz (YiN ; (Y i , X i+N )) − Iz (YiN ; Y i ) (10.25) n
i∈ J 1 ( z ) We note in passing that the use of J1 here assumes that we are dealing with a
one sided channel and hence there is no contribution to the information from
any initial symbols not contained in the ﬁrst N cell. In the two sided case time
0 could occur in the middle of an N cell and one could ﬁx the upper bound by
adding the ﬁrst index less than 0 for which zi begins an N cell to the above
sum. This term has no aﬀect on the limits. Taking the limits as m → ∞ using
Lemma 5.5.1 we have that
Iz (X n ; Y n ) ≤ Iz (YiN ; (Y i , X i+N )) − Iz (YiN ; Y i ) .
n
i∈J1 (z ) N
n
Given Z n = z n and i ∈ J1 (z ), (X i , Y i ) → Xi → YiN forms a Markov chain
because of the conditional independence and hence from Lemma 5.5.2 and Corollary 5.5.3
N
Iz (YiN , (Y i , X i+N )) = Iz (Xi ; YiN ) = IpN (X N ; Y N ). 10.6. THE DISTORTIONRATE FUNCTION 211 Thus we have the upper bound
n−1 1
1
1
Iz (X n ; Y n ) ≤ IpN (X N ; Y N )
1G (T i z ) −
n
n
n
i=0 n−1 Iz (YiN ; Y i )1G (T i z ).
i=0 Taking expectations and using stationarity as before we ﬁnd that
1
I (X n ; Y n Z n ) ≤ IpN (X N ; Y N )mZ (G)
n
− 1
n n−1 dmZ (z )Iz (Y0N ; (Y−i , · · · , Y−1 )).
i=0 G Taking the limit as n → ∞ using Lemma 5.6.1 yields
¯
I (X ; Y Z ) ≤ IpN (X N ; Y N )mZ (G) − dmZ (z )Iz (Y0N ; Y − ). (10.26) G Combining this with (10.24) proves that
¯
I (X ; Y Z ) ≤ I ∗ (X ; Y Z )
and hence that
¯
I (X ; Y ) = I ∗ (X ; Y ).
It also proves that
¯
I (X ; Y ) =
≤ ¯
¯
¯
I (X ; Y Z ) − I (Z ; (X, Y )) ≤ I (X ; Y Z )
1
IpN (X N ; Y N )mZ (G) ≤ IpN (X N ; Y N )
N using Corollary 9.4.3 to bound mX (G). This proves (10.19). 2 Proof of the theorem: We have immediately that
¯
R∗ (R, µ) ⊂ R∗ (R, µ) ⊂ Rs (R, µ)
e
s
and
¯
¯
R∗ (R, µ) ⊂ Re (R, µ) ⊂ Rs (R, µ),
e
and hence we have for stationary sources that
∗
¯
Ds (R, µ) ≤ Ds (R, µ) (10.27) ∗
∗
¯
Ds (R, µ) ≤ Ds (R, µ) ≤ De (R, µ) (10.28) ∗
¯
¯
Ds (R, µ) ≤ De (R, µ) ≤ De (R, µ). (10.29) and for ergodic sources that and 212 CHAPTER 10. DISTORTION We next prove that
¯
Ds (R, µ) ≥ D(R, µ). (10.30) ¯
If Ds (R, µ) is inﬁnite, the inequality is obvious. Otherwise ﬁx > 0 and choose
¯
¯
a p ∈ Rs (R, µ) for which Ep ρ1 (X0 , Y0 ) ≤ Ds (R, µ) + and ﬁx δ > 0 and choose
m so large that for n ≥ m we have that
¯
n−1 Ip (X n ; Y n ) ≤ Ip (X ; Y ) + δ ≤ R + δ.
For n ≥ m we therefore have that pn ∈ Rn (R + δ, µn ) and hence
¯
Ds (R, µ) + = Epn ρn ≥ Dn (R + δ, µ) ≥ D(R + δ, µ).
From Lemma 10.6.1 D(R, µ) is continuous in R and hence (10.30) is proved.
Lastly, ﬁx > 0 and choose N so large and pN ∈ RN (R, µN ) so that
EpN ρN ≤ DN (R, µN ) + 3 ≤ D(R, µ) + 2
.
3 Construct the corresponding (N, δ )SBM channel as in Example 9.4.11 with δ
small enough to ensure that δρ∗ ≤ /3. Then from Lemma 10.6.2 we have
∗
¯
that the resulting hookup p is stationary and that Ip = Ip ≤ R and hence
¯
p ∈ R∗ (R, µ) ⊂ Rs (R, µ). Furthermore, if µ is ergodic then so is p and hence
s
¯
p ∈ R∗ (R, µ) ⊂ Re (R, µ). From Lemma 10.6.2 the resulting distortion is
e
Ep ρ1 (X0 , Y0 ) ≤ EpN ρN + ρ∗ δ ≤ D(R, µ) + .
Since > 0 this implies the exisitence of a p ∈ R∗ (R, µ) (p ∈ R∗ (R, µ) if
e
s
µ is ergodic) yielding Ep ρ1 (X0 , Y0 ) arbitrarily close to D(R, µ. Thus for any
stationary source
∗
Ds (R, µ) ≤ D(R, µ)
and for any ergodic source
∗
De (R, µ) ≤ D(R, µ). With (10.27)–(10.30) this completes the proof. 2 The previous lemma is technical but important. It permits the construction
of a stationary and ergodic pair process having rate and distortion near that
of that for a ﬁnite dimensional vector described by the original source and a
ﬁnitedimensional conditional probability. Chapter 11 Source Coding Theorems
11.1 Source Coding and Channel Coding In this chapter and the next we develop the basic coding theorems of information
theory. As is traditional, we consider two important special cases ﬁrst and then
later form the overall result by combining these special cases. In the ﬁrst case
we assume that the channel is noiseless, but it is constrained in the sense that
it can only pass R bits per input symbol to the receiver. Since this is usually
insuﬃcient for the receiver to perfectly recover the source sequence, we attempt
to code the source so that the receiver can recover it with as little distortion as
possible. This leads to the theory of source coding or source coding subject to
a ﬁdelity criterion or data compression, where the latter name reﬂects the fact
that sources with inﬁnite or very large entropy are “compressed” to ﬁt across the
given communication link. In the next chapter we ignore the source and focus
on a discrete alphabet channel and construct codes that can communicate any of
a ﬁnite number of messages with small probability of error and we quantify how
large the message set can be. This operation is called channel coding or error
control coding. We then develop joint source and channel codes which combine
source coding and channel coding so as to code a given source for communication
over a given channel so as to minimize average distortion. The ad hoc division
into two forms of coding is convenient and will permit performance near that of
the OPTA function for the codes considered. 11.2 Block Source Codes for AMS Sources We ﬁrst consider a particular class of codes: block codes. For the time being
we also concentrate on additive distortion measures. Extensions to subadditive
distortion measures will be considered later. Let {Xn } be a source with a
standard alphabet A. Recall that an (N, K ) block code of a source {Xn } maps
N
successive nonoverlapping input vectors {XnN } into successive channel vectors
K
N
UnK = α(XnN ), where α : AN → B K is called the source encoder. We assume
213 214 CHAPTER 11. SOURCE CODING THEOREMS that the channel is noiseless, but that it is constrained in the sense that N source
time units corresponds to the same amount of physical time as K channel time
units and that
K log B 
≤ R,
N
where the inequality can be made arbitrarily close to equality by taking N and
K large enough subject to the physical stationarity constraint. R is called the
source coding rate or resolution in bits or nats per input symbol. We may wish
to change the values of N and K , but the rate is ﬁxed.
A reproduction or approximation of the original source is obtained by a
source decoder, which we also assume to be a block code. The decoder is a
ˆ
ˆ
ˆN
mapping β : B K → AN which forms the reproduction process {Xn } via XnN =
K
β (UnK ); n = 1, 2, . . .. In general we could have a reproduction dimension
diﬀerent from that of the input vectors provided they corresponded to the same
amount of physical time and a suitable distortion measure was deﬁned. We will
make the simplifying assumption that they are the same, however.
Because N source symbols are mapped into N reproduction symbols, we
will often refer to N alone as the block length of the source code. Observe that
the resulting sequence coder is N stationary. Our immediate goal is now the
following: Let E and D denote the collection of all block codes with rate no
greater than R and let ν be the given channel. What is the OPTA function
∆(µ, E , ν, D) for this system? Our ﬁrst step toward evaluating the OPTA is to
ﬁnd a simpler and equivalent expression for the current special case.
Given a source code consisting of encoder α and decoder β , deﬁne the codebook to be
C = { all β (uK ); uK ∈ B K },
that is, the collection of all possible reproduction vectors available to the receiver. For convenience we can index these words as
C = {yi ; i = 1, 2, . . . , M },
−1 where N log M ≤ R by construction. Observe that if we are given only
a decoder β or, equivalently, a codebook, and if our goal is to minimize the
average distortion for the current block, then no encoder can do better than
the encoder α∗ which maps an input word xN into the minimum distortion
available reproduction word, that is, deﬁne α∗ (xN ) to be the uK minimizing
ρN (xN , β (uK )), an assignment we denote by
α∗ (xN ) = min−1 ρN (xN , β (uK )).
uK Observe that by construction we therefore have that
ρN (xN , β (α∗ (xN ))) = min ρN (xN , y )
y ∈C and the overall mapping of xN into a reproduction is a minimum distortion or
nearest neighbor mapping. Deﬁne
ρN (xN , C ) = min ρN (xN , y ).
y ∈C 11.2. BLOCK SOURCE CODES FOR AMS SOURCES 215 To formally prove that this is the best decoder, observe that if the source µ is
AMS and p is the joint distribution of the source and reproduction, then p is also
AMS. This follows since the channel induced by the block code is N stationary
and hence also AMS with respect to T N . This means that p is AMS with respect
to T N which in turn implies that it is AMS with respect to T (Theorem 7.3.1 of
[50]). Letting p denote the stationary mean of p and pN denote the N stationary
¯
¯
mean, we then have from (10.10) that for any block codes with codebook C
1
1
EpN ρN (X N , Y N ) ≥ EpN ρN (X N , C ),
¯
¯
N
N
with equality if the minimum distortion encoder is used. For this reason we can
conﬁne interest to block codes speciﬁed by a codebook: the encoder produces
the index of the minimum distortion codeword for the observed vector and the
decoder is a table lookup producing the codeword being indexed. A code of this
type is also called a vector quantizer or block quantizer. Denote the performance
of the block code with codebook C on the source µ by
∆= ρ(C , µ) = ∆ = Ep ρ∞ .
Lemma 11.2.1 Given an AMS source µ and a block length N code book C ,
let µN denote the N stationary mean of µ (which exists from Corollary 7.3.1
¯
of [50]), let p denote the induced input/output distribution, and let p and pN
¯
¯
denote its stationary mean and N stationary mean, respectively. Then
ρ(C , µ) 1
Ep ρN (X N , Y N )
¯
NN = Ep ρ1 (X0 , Y0 ) =
¯ = 1
Eµ ρN (X N , C ) = ρ(C , µN ).
¯
¯
NN Proof: The ﬁrst two equalities follow from (10.10), the next from the use of the
minimum distortion encoder, the last from the deﬁnition of the performance of
a block code.
2
It need not be true in general that ρ(C , µ) equal ρ(C , µ). For example, if µ
¯
produces a single periodic waveform with period N and C consists of a single
period, then ρ(C , µ) = 0 and ρ(C , µ) > 0. It is the N stationary mean and not
¯
the stationary mean that is most useful for studying an N stationary code.
We now deﬁne the operational distortionrate function (DRF) for block codes
to be
δ (R, µ) = δN (R, µ) = ∆∗ (µ, ν, E , D) = inf δN (R, µ),
N inf C∈K(N,R) ρ(C , µ), where ν is the noiseless channel as described previously, E and D are classes
of block codes for the channel, and K(N, R) is the class of all block length N
codebooks C with
1
log C ≤ R.
N
δ (R, µ) is called the operational block coding distortionrate function (DRF) 216 CHAPTER 11. SOURCE CODING THEOREMS Corollary 11.2.1 Given an AMS source µ, then for any N and i = 0, 1, . . . , N −
1
δN (R, µT −i ) = δN (R, µN T −i ).
¯
Proof: For i = 0 the result is immediate from the lemma. For i = 0 it follows
from the lemma and the fact that the N stationary mean of µT −i is µN T −i (as
¯
is easily veriﬁed from the deﬁnitions).
2 Reference Letters
Many of the source coding results will require a technical condition that is
a generalization of reference letter condition of Theorem 10.6.1 for stationary
ˆ
sources. An AMS source µ is said to have a reference letter a∗ ∈ A with respect
ˆ
to a distortion measure ρ = ρ1 on A × A if
sup EµT −n ρ(X0 , a∗ ) = sup Eµ ρ(Xn , a∗ ) = ρ∗ < ∞,
n (11.1) n that is, there exists a letter for which Eµ ρ(X n , a∗ ) is uniformly bounded above.
If we deﬁne for any k the vector a∗ k = (a∗ , a∗ , · · · , a∗ ) consisting of k a∗ ’s, then
(11.1) implies that
1
sup EµT −n ρk (X k , a∗ k ) ≤ ρ∗ < ∞.
k
n (11.2) We assume for convenience that any block code of length N contains the
reference vector a∗ N . This ensures that ρN (xN , C ) ≤ ρN (xN , a∗ N ) and hence
that ρN (xN , C ) is bounded above by a µintegrable function and hence is itself
µintegrable. This implies that
δ (R, µ) ≤ δN (R, µ) ≤ ρ∗ . (11.3) The reference letter also works for the stationary mean source µ since
¯
1
n→∞ n n−1 ρ(xi , a∗ ) = ρ∞ (x, a∗ ), lim i=0 µa.e. and µa.e., where a∗ denotes an inﬁnite sequence of a∗ . Since ρ∞ is
¯
invariant we have from Lemma 6.3.1 of [50] and Fatou’s lemma that
∗ Eµ ρ(X0 , a )
¯ = Eµ 1
lim
n→∞ n ≤ lim inf
n→∞ 1
n n−1 ρ(Xi , a∗ )
i=0 n−1 Eµ ρ(Xi , a∗ ) ≤ ρ∗ .
i=0 11.2. BLOCK SOURCE CODES FOR AMS SOURCES 217 Performance and distortionrate functions
We next develop several basic properties of the performance and the operational
DRFs for block coding AMS sources with additive ﬁdelity criteria.
Lemma 11.2.2 Given two sources µ1 and µ2 and λ ∈ (0, 1), then for any block
code C
ρ(C , λµ1 + (1 − λ)µ2 ) = λρ(C , µ1 ) + (1 − λ)ρ(C , µ2 )
and for any N
δN (R, λµ1 + (1 − λ)µ2 ) ≥ λδN (R, µ1 ) + (1 − λ)δN (R, µ2 )
and
δ (R, λµ1 + (1 − λ)µ2 ) ≥ λδ (R, µ1 ) + (1 − λ)δ (R, µ2 ).
Thus performance is linear in the source and the operational DRFs are convex
. Lastly,
δN (R + 1
, λµ1 + (1 − λ)µ2 ) ≤ λδN (R, µ1 ) + (1 − λ)δN (R, µ2 ).
N Proof: The equality follows from the linearity of expectation since ρ(C , µ) =
Eµ ρ(X N , C ). The ﬁrst inequality follows from the equality and the fact that
the inﬁmum of a sum is bounded below by the sum of the inﬁma. The next
inequality follows similarly. To get the ﬁnal inequality, let Ci approximately
yield δN (R, µi ); that is,
ρ(Ci , µi ) ≤ δN (R, µi ) + .
Form the union code C = C1 C2 containing all of the words in both of the
codes. Then the rate of the code is
1
log C
N =
≤
= 1
log(C1  + C2 )
N
1
log(2N R + 2N R )
N
1
R+ .
N This code yields performance
ρ(C , λµ1 + (1 − λ)µ2 ) = λρ(C , µ1 ) + (1 − λ)ρ(C , µ2 )
≤ λρ(C1 , µ1 ) + (1 − λ)ρ(C2 , µ2 ) ≤ λδN (R, µ1 ) + λ + (1 − λ)δN (R, µ2 ) + (1 − λ) .
Since the leftmost term in the above equation can be no smaller than δN (R +
1/N, λµ1 + (1 − λ)µ2 ), the lemma is proved.
2
The ﬁrst and last inequalities in the lemma suggest that δN is very nearly
an aﬃne function of the source and hence perhaps δ is as well. We will later
pursue this possibility, but we are not yet equipped to do so. 218 CHAPTER 11. SOURCE CODING THEOREMS Before developing the connection between the distortion rate functions of
AMS sources and those of their stationary mean, we pause to develop some
additional properties for operational DRFs in the special case of stationary
sources. These results follow Kieﬀer [77].
Lemma 11.2.3 Suppose that µ is a stationary source. Then
δ (R, µ) = lim δN (R, µ).
N →∞ Thus the inﬁmum over block lengths is given by the limit so that longer codes
can do better.
ˆ
ˆ
Proof: Fix an N and an n < N and choose codes Cn ⊂ An and CN −n ⊂ AN −n
for which
ρ(Cn , µ) ≤ δn (R, µ) + 2 ρ(CN −n , µ) ≤ δN −n (R, µ) + .
2
Form the block length N code C = Cn × CN −n . This code has rate no greater
than R and has distortion
N ρ(C , µ) = E min ρN (X N , y )
y ∈C N
= Eyn ∈Cn ρn (X n , y n ) + EvN −n ∈CN −n ρN −n (Xn −n , v N −n ) = Eyn ∈Cn ρn (X n , y n ) + EvN −n ∈CN −n ρN −n (X N −n , v N −n )
= nρ(Cn , µ) + (N − n)ρ(CN −n , µ)
≤ nδn (R, µ) + (N − n)δN −n (R, µ) + , (11.4) where we have made essential use of the stationarity of the source. Since is
arbitrary and since the leftmost term in the above equation can be no smaller
than N δN (R, µ), we have shown that
N δN (R, µ) ≤ nδn (R, µ) + (N − n)δN −n (R, µ)
and hence that the sequence N δN is subadditive. The result then follows immediately from Lemma 7.5.1 of [50].
2
Corollary 11.2.2 If µ is a stationary source, then δ (R, µ) is a convex
tion of R and hence is continuous for R > 0. func Proof: Pick R1 > R2 and λ ∈ (0, 1). Deﬁne R = λR1 + (1 − λ)R2 . For large n
deﬁne n1 = λn be the largest integer less than λn and let n2 = n − n1 . Pick
ˆ
codebooks Ci ⊂ Ani with rate Ri with distortion
ρ(Ci , µ) ≤ δni (Ri , µ) + . 11.2. BLOCK SOURCE CODES FOR AMS SOURCES 219 Analogous to (11.4), for the product code C = C1 × C2 we have
nρ(C , µ) = n1 ρ(C1 , µ) + n2 ρ(C2 , µ)
≤ n1 δn1 (R1 , µ) + n2 δn2 (R2 , µ) + n . The rate of the product code is no greater than R and hence the leftmost term
above is bounded below by nδn (R, µ). Dividing by n we have since is arbitrary
that
n1
n2
δn (R, µ) ≤
δn1 (R1 , µ) +
δn (R2 , µ).
n
n2
Taking n → ∞ we have using the lemma and the choice of ni that
δ (R, µ) ≤ λδ (R1 , µ) + (1 − λ)δ (R2 , µ),
proving the claimed convexity. 2 Corollary 11.2.3 If µ is stationary, then δ (R, µ) is an aﬃne function of µ.
Proof: From Lemma 11.2.2 we need only prove that
δ (R, λµ1 + (1 − λ)µ2 ) ≤ λδ (R, µ1 ) + (1 − λ)δ (R, µ2 ).
From the same lemma we have that for any N
δN (R + 1
, λµ1 + (1 − λ)µ2 ) ≤ λδN (R, µ1 ) + (1 − λ)δN (R, µ2 )
N For any K ≤ N we have since δN (R, µ) is nonincreasing in R that
δN (R + 1
, λµ1 + (1 − λ)µ2 ) ≤ λδN (R, µ1 ) + (1 − λ)δN (R, µ2 ).
K Taking the limit as N → ∞ yields from Lemma 11.2.3 that
δ (R + 1
, µ) ≤ λδ (R, µ1 ) + (1 − λ)δ (R, µ2 ).
K From Corollary 11.2.2, however, δ is continuous in R and the result follows by
letting K → ∞.
2
The following lemma provides the principal tool necessary for relating the
operational DRF of an AMS source with that of its stationary mean. It shows
that the DRF of an AMS source is not changed by shifting or, equivalently, by
redeﬁning the time origin.
Lemma 11.2.4 Let µ be an AMS source with a reference letter. Then for any
integer i δ (R, µ) = δ (R, µT −i ). 220 CHAPTER 11. SOURCE CODING THEOREMS Proof: Fix > 0 and let CN be a rate R block length N codebook for which
ρ(CN , µ) ≤ δ (R, µ) + /2. For 1 ≤ i ≤ N − 1 choose J large and deﬁne the block
length K = JN code CK (i) by
J −2 CK (i) = a∗ (N −i) × × CN × a∗ i ,
j =0 where a∗ l is an ltuple containing all a∗ ’s. CK (i) can be considered to be a code
consisting of the original code shifted by i time units and repeated many times,
with some ﬁller at the beginning and end. Except for the edges of the long
product code, the eﬀect on the source is to use the original code with a delay.
The code has at most (2N R )J −1 = 2KR 2−N R words; the rate is no greater than
R.
(i) For any K block xK the distortion resulting from using Ck is given by
KρK (xK , CK (i)) ≤ (N − i)ρN −i (xN −i , a∗ (N −i) ) + iρi (xi −i , a∗ i ).
K (11.5) Let {xn } denote the encoded process using the block code CK (i). If n is a
ˆ
multiple of K , then
nρn (xn , xn ) ≤
ˆ
n
K ((N − i)ρN −i (xN −i , a∗ (N −i) ) + iρi (xik+1)K −i , a∗ i ))
(
kK
k=0
n
K J −1 N ρN (xN −i+kN , CN ).
N +
k=0 If n is not a multiple of K we can further overbound the distortion by including
the distortion contributed by enough future symbols to complete a K block,
that is,
nρn (xn , xn ) ≤ nγn (x, x)
ˆ
ˆ
n
K +1 (N − i)ρN −i (xN −i , a∗ (N −i ) + iρi (xik+1)K −i , a∗ i )
(
kK =
k=0 ( n
K +1)J −1 N ρN (xN −i+kN , CN ).
N +
k=0 11.2. BLOCK SOURCE CODES FOR AMS SOURCES 221 Thus
N −i 1
ρn (x , x ) ≤
ˆ
K n/K
n n
K +1 ρN −i (X N −i (T kK x), a∗ (N −i ) n k=0
n
K i1
+
K n/K +1 ρi (X i (T (k+1)K −i x, a∗ i )
k=0 1
+
n/N ( n
K +1)J −1 ρN (X N (T (N −i)+kN x), CN ).
k=0 Since µ is AMS these quantities all converge to invariant functions:
lim ρn (xn , xn ) ≤
ˆ n→∞ + 1
N −i
lim
K m→∞ m
i
1
lim
K m→∞ m m−1 ρN −i (X N −i (T kK x), a∗ (N −i )
k=0 m−1 ρi (X i (T (k+1)K −i x, a∗ i )
k=0 1
m→∞ m m−1 ρN (X N (T (N −i)+kN x), CN ). + lim k=0 We now apply Fatou’s lemma, a change of variables, and Lemma 11.2.1 to
obtain
δ (R, µT −i ) ≤ ρ(CK (i), µT −i )
N −i
1
≤
lim sup
K
m
m→∞
+ i
1
lim
K m→∞ m m EµT −i ρN −i (X N i T kK , a∗ (N −i) )
k=0 m−1 EµT −i ρi (X i T (k+1)K −i , a∗ i )
k=0 1
m→∞ m m−1 ρN (X N T (N −i)+kN ), CN ). + EµT −i lim
≤ k=0 i
N −i ∗
1
ρ + ρ∗ + Eµ lim
m→∞ m
K
K m−1 ρN (X N T kN CN )
k=1 ≤ N∗
ρ + ρ(CN , µ).
K Thus if J and hence K are chosen large enough to ensure that N/K ≤ /2, then
δ (R, µT −i ) ≤ δ (R, µ),
which proves that δ (R, µT −i ) ≤ δ (R, µ). The reverse implication is found in
a similar manner: Let CN be a codebook for µT −i and construct a codebook 222 CHAPTER 11. SOURCE CODING THEOREMS CK (N − i) for use on µ. By arguments nearly identical to those above the reverse
inequality is found and the proof completed.
2
Corollary 11.2.4 Let µ be an AMS source with a reference letter. Fix N and
let µ and µN denote the stationary and N stationary means. Then for R > 0
¯
¯
δ (R, µ) = δ (R, µN T −i ); i = 0, 1, . . . , N − 1.
¯
¯
Proof: It follows from the previous lemma that the δ (R, µN T −i ) are all equal
¯
and hence it follows from Lemma 11.2.2, Theorem 7.3.1 of [50], and Corollary
7.3.1 of [50] that
δ (R, µ) ≥
¯ N −1 1
N δ (R, µN T −i ) = δ (R, µN ).
¯
¯
i=0 To prove the reverse inequality, take µ = µN in the previous lemma and
¯
construct the codes CK (i) as in the previous proof. Take the union code
N −1
CK = i=0 CK (i) having block length K and rate at most R + K −1 log N .
We have from Lemma 11.2.1 and (11.5) that
ρ(CK , µ)
¯ = ≤ 1
N
1
N N −1 ρ(CK , µN T −i )
¯
i=0
N −1 ρ(CK (i), µN T −i ) ≤
¯
i=0 N∗
ρ + ρ(CN , µN )
¯
K and hence as before
δ (R + 1
log N, µ) ≤ δ (R, µN ).
¯
¯
JN From Corollary 11.2.1 δ (R, µ) is continuous in R for R > 0 since µ is stationary.
¯
¯
Hence taking J large enough yields δ (R, µ) ≤ δ (R, µN ). This completes the
¯
¯
proof since from the lemma δ (R, µN T −i ) = δ (R, µN ).
¯
¯
2
We are now prepared to demonstrate the fundamental fact that the block
source coding operational distortion rate function for an AMS source with an additive ﬁdelity criterion is the same as that of the stationary mean process. This
will allow us to assume stationarity when proving the actual coding theorems.
Theorem 11.2.1 If µ is an AMS source and {ρn } an additive ﬁdelity criterion
with a reference letter, then for R > 0
δ (R, µ) = δ (R, µ).
¯
Proof: We have from Corollaries 11.2.1 and 11.2.4 that
δ (R, µ) ≤ δ (R, µN ) ≤ δN (R, µN ) = δN (R, µ).
¯
¯
¯ 11.3. BLOCK CODING STATIONARY SOURCES 223 Taking the inﬁmum over N yields
δ (R, µ) ≤ δ (R, µ).
¯
Conversely, ﬁx > 0 let CN be a block length N codebook for which ρ(CN , µ)
¯
≤ δ (R, µ) + . From Lemma 11.2.1, Corollary 11.2.1, and Lemma 11.2.4
¯
δ (R, µ) + ≤ ρ(CN , µ)
¯
¯ = ≥ ≥ 1
N
1
N
1
N which completes the proof since N −1 ρ(CN , µN T −i )
¯
i=0
N −1 δN (R, µN T −i ) =
¯
i=0 1
N N −1 δN (R, µT −i )
i=0 N −1 δ (R, µT −i ) = δ (R, µ),
i=0 is arbitrary. 2 Since the DRFs are the same for an AMS process and its stationary mean,
this immediately yields the following corollary from Corollary 11.2.2:
Corollary 11.2.5 If µ is AMS, then δ (R, µ) is a convex function of R and
hence a continuous function of R for R > 0. 11.3 Block Coding Stationary Sources We showed in the previous section that when proving block source coding theorems for AMS sources, we could conﬁne interest to stationary sources. In this
section we show that in an important special case we can further conﬁne interest to only those stationary sources that are ergodic by applying the ergodic
decomposition. This will permit us to assume that sources are stationary and
ergodic in the next section when the basic Shannon source coding theorem is
proved and then extend the result to AMS sources which may not be ergodic.
As previously we assume that we have a stationary source {Xn } with distribution µ and we assume that {ρn } is an additive distortion measure and there
exists a reference letter. For this section we now assume in addition that the
alphabet A is itself a Polish space and that ρ1 (r, y ) is a continuous function of
ˆ
r for every y ∈ A. If the underlying alphabet has a metric structure, then it
is reasonable to assume that forcing input symbols to be very close in the underlying alphabet should force the distortion between either symbol and a ﬁxed
output to be close also. The following theorem is the ergodic decomposition of
the block source coding operational distortion rate function.
Theorem 11.3.1 Suppose that µ is the distribution of a stationary source and
that {ρn } is an additive ﬁdelity criterion with a reference letter. Assume also 224 CHAPTER 11. SOURCE CODING THEOREMS that ρ1 (·, y ) is a continuous function for all y . Let {µx } denote the ergodic
decomposition of µ. Then
δ (R, µ) = dµ(x)δ (R, µx ), that is, δ (R, µ) is the average of the operational DRFs of its ergodic components.
Proof: Analogous to the ergodic decomposition of entropy rate of Theorem 2.4.1,
we need to show that δ (R, µ) satisﬁes the conditions of Theorem 8.9.1 of [50].
We have already seen (Corollary 11.2.3) that it is an aﬃne function. We next see
that it is upper semicontinuous. Since the alphabet is Polish, choose a distance
dG on the space of stationary processes having this alphabet with the property
that G is constructed as in Section 8.2 of [50]. Pick an N large enough and a
length N codebook C so that
δ (R, µ) ≥ δN (R, µ) − 2 ≥ ρN (C , µ) − . ρN (xN , y ) is by assumption a continuous function of xN and hence so is ρN (xN , C ) =
miny∈C ρ(xN , y ). Since it is also nonnegative, we have from Lemma 8.2.4 of [50]
that if µn → µ then
lim sup Eµn ρN (X N , C ) ≤ Eµ ρN (X N , C ).
n→∞ The left hand side above is bounded below by
lim sup δN (R, µn ) ≥ lim sup δ (R, µn ).
n→∞ Thus since n→∞ is arbitrary,
lim sup δ (R, µn ) ≤ δ (R, µ)
n→∞ and hence δ (R, µ) upper semicontinuous in µ and hence also measurable. Since
the process has a reference letter, δ (R, µx ) is integrable since
δ (R, µX ) ≤ δN (R, µx ) ≤ Eµx ρ1 (X0 , a∗ )
which is integrable if ρ1 (x0 , a∗ ) is from the ergodic decomposition theorem.
Thus Theorem 8.9.1 of [50] yields the desired result.
2
The theorem was ﬁrst proved by Kieﬀer [77] for bounded continuous additive
distortion measures. The above extension removes the requirement that ρ1 be
bounded. 11.4 Block Coding AMS Ergodic Sources We have seen that the block source coding operational DRF of an AMS source
is given by that of its stationary mean. Hence we will be able to concentrate on
stationary sources when proving the coding theorem. 11.4. BLOCK CODING AMS ERGODIC SOURCES 225 Theorem 11.4.1 Let µ be an AMS ergodic source with a standard alphabet and
{ρn } an additive distortion measure with a reference letter. Then
δ (R, µ) = D(R, µ),
¯
where µ is the stationary mean of µ.
¯
Proof: From Theorem 11.2.1 δ (R, µ) = δ (R, µ) and hence we will be done if we
¯
can prove that
δ (R, µ) = D(R, µ).
¯
¯
This will follow if we can show that δ (R, µ) = D(R, µ) for any stationary ergodic
source with a reference letter. Henceforth we assume that µ is stationary and
ergodic.
We ﬁrst prove the negative or converse half of the theorem. First suppose
that we have a codebook C such that
ρN (C, µ) = Eµ min ρN (X N , y ) = δN (R, µ) + .
y ∈C ˆ
If we let XN denote the resulting reproduction random vector and let pN denote
ˆ
the resulting joint distribution of the input/output pair, then since X N has a
ﬁnite alphabet, Lemma 5.5.6 implies that
ˆ
ˆ
I (X N ; X N ) ≤ H (X N ) ≤ N R
and hence pN ∈ RN (R, µN ) and hence
ˆ
δN (R, µ) + ≥ EpN ρN (X N ; X N ) ≥ DN (R, µ).
Taking the limits as N → ∞ proves the easy half of the theorem:
δ (R, µ) ≥ D(R, µ).
(Recall that both operational DRF and the Shannon DRF are given by limits
if the source is stationary.)
The fundamental idea of Shannon’s positive source coding theorem is this:
for a ﬁxed block size N , choose a code at random according to a distribution
implied by the distortionrate function. That is, perform 2N R independent random selections of blocks of length N to form a codebook. This codebook is then
used to encode the source using a minimum distortion mapping as above. We
compute the average distortion over this doublerandom experiment (random
codebook selection followed by use of the chosen code to encode the random
source). We will ﬁnd that if the code generation distribution is properly chosen,
then this average will be no greater than D(R, µ) + . If the average over all
randomly selected codes is no greater than D(R, µ) + , however, than there
must be at least one code such that the average distortion over the source distribution for that one code is no greater than D(R, µ) + . This means that
there exists at least one code with performance not much larger than D(R, µ). 226 CHAPTER 11. SOURCE CODING THEOREMS Unfortunately the proof only demonstrates the existence of such codes, it does
not show how to construct them.
To ﬁnd the distribution for generating the random codes we use the ergodic process deﬁnition of the distortionrate function. From Theorem 10.6.1
(or Lemma 10.6.3) we can select a stationary and ergodic pair process with
distribution p which has the source distribution µ as one coordinate and which
has
1
Ep ρ(X0 , Y0 ) = EpN ρN (X N , Y N ) ≤ D(R, µ) +
(11.6)
N
and which has
¯
Ip (X ; Y ) = I ∗ (X ; Y ) ≤ R (11.7) (and hence information densities converge in L1 from Theorem 6.3.1). Denote
the implied vector distributions for (X N , Y N ), X N , and Y N by pN , µN , and
η N , respectively.
For any N we can generate a codebook C at random according to η N as
described above. To be precise, consider the random codebook as a large random
vector C = (W0 , W1 , · · · , WM ), where M = eN (R+ ) (where natural logarithms
are used in the deﬁnition of R), where W0 is the ﬁxed reference vector a∗ N and
where the remaining Wn are independent, and where the marginal distributions
for the Wn are given by η N . Thus the distribution for the randomly selected
code can be expressed as
M PC = × η N .
i=1 This codebook is then used with the optimal encoder and we denote the resulting
average distortion (over codebook generation and the source) by
∆N = Eρ(C , µ) = dPC (W )ρ(W , µ) (11.8) where
ρ(W , µ) = 1
1
EρN (X N , W ) =
N
N dµN (xN )ρN (xN , W ), and where
ρN (xN , C ) = min ρN (xN , y ).
y ∈C Choose δ > 0 and break up the integral over x into two pieces: one over a
set GN = {x : N −1 ρN (xN , a∗ N ) ≤ ρ∗ + δ } and the other over the complement
of this set. Then
∆N ≤
Gc
N 1
ρN (xN , a∗ N ) dµN (xN )
N
+ 1
N dµN (xN )ρN (xN , W ), (11.9) dPC (W )
GN 11.4. BLOCK CODING AMS ERGODIC SOURCES 227 where we have used the fact that ρN (xN , mW ) ≤ ρN (xN , a∗ N ). Fubini’s theorem implies that because
dµN (xN )ρN (xN , a∗ N ) < ∞
and
ρN (xN , W ) ≤ ρN (xN , a∗ N ),
the limits of integration in the second integral of (11.9) can be interchanged to
obtain the bound
∆N ≤ 1
N ρN (xN , a∗ N )dµN (xN )
Gc
N + 1
N dµN (xN ) dPC (W )ρN (xN , W ) (11.10) GN The rightmost term in (11.10) can be bound above by observing that
1
N dµN (xN )[ dPC (W )ρN (xN , W )] GN = 1
N dµN (xN )[
+ 1
N dPC (W )ρN (xN , W )
C :ρN (xN ,C )≤N (D +δ ) GN dPC (W )ρN (xN , W )]
W :ρN (xN ,W )>N (D +δ ) dµN (xN )[D + δ + ≤
GN 1∗
(ρ + δ )
N dpC (W )]
W :ρN (xN ,W )>N (D +δ ) where we have used the fact that for x ∈ G the maximum distortion is given by
ρ∗ + δ . Deﬁne the probability
P (N −1 ρN (xN , C ) > D + δ xN ) = dpC (W )
W :ρN (xN ,W )>N (D +δ ) and summarize the above bounds by
∆N ≤ D + δ + ( ρ ∗ + δ ) 1
N dµN (xN )P (N −1 ρN (xN , C ) > D + δ xN ) + 1
N dµN (xN )ρN (xN , a∗ N ). (11.11)
Gc
N The remainder of the proof is devoted to proving that the two integrals above
go to 0 as N → ∞ and hence
lim sup ∆N ≤ D + δ.
N →∞ (11.12) 228 CHAPTER 11. SOURCE CODING THEOREMS
Consider ﬁrst the integral aN = 1
N dµN (xN )ρN (xN , a∗ N ) =
Gc
N dµN (xN )1Gc (xN )
N 1
ρN (xN , a∗ N ).
N We shall see that this integral goes to zero as an easy application of the ergodic
theorem. The integrand is dominated by N −1 ρN (xN , a∗ N ) which is uniformly
integrable (Lemma 4.7.2 of [50]) and hence the integrand is itself uniformly
integrable (Lemma 4.4.4 of [50]). Thus we can invoke the extended Fatou lemma
to conclude that
lim sup aN ≤ N →∞ ≤ 1
ρN (xN , a∗ N )
N
N →∞
1
dµN (xN )(lim sup 1Gc (xN ))(lim sup ρN (xN , a∗ N )).
N
N →∞
N →∞ N dµN (xN ) lim sup 1Gc (xN )
N We have, however, that lim supN →∞ 1Gc (xN ) is 0 unless xN ∈ Gc i.o. But this
N
N
set has measure 0 since with µN probability 1, an x is produced so that
1
N →∞ N N −1 ρ(xi , a∗ ) = ρ∗ lim i=0 exists and hence with probability one one gets an x which can yield
N −1 ρN (xN , a∗ N ) > ρ∗ + δ
at most for a ﬁnite number of N . Thus the above integral of the product of a
function that is 0 a.e. with a dominated function must itself be 0 and hence
lim sup aN = 0. (11.13) N →∞ We now consider the second integral in (11.11):
bN = (ρ∗ + δ ) 1
N dµN (xN )P (N −1 ρN (xN , C ) > D + δ xN ). Recall that P (ρN (xN , C ) > D + δ xN ) is the probability that for a ﬁxed input
block xN , a randomly selected code will result in a minimum distortion codeword
larger than D + δ . This is the probability that none of the M words (excluding
the reference code word) selected independently at random according to to the
distribution η N lie within D + δ of the ﬁxed input word xN . This probability
is bounded above by
P( 1
1
ρN (xN , C ) > D + δ xN ) ≤ [1 − η N ( ρN (xN , Y N ) ≤ D + δ )]M
N
N where
ηN ( 1
ρN (xN , Y N ) ≤ D + δ )) =
N dη N (y N ).
1
y N : N ρN (xN ,y N )≤D +δ 11.4. BLOCK CODING AMS ERGODIC SOURCES 229 Now mutual information comes into the picture. The above probability can be
bounded below by adding a condition:
ηN ( 1
ρN (xN , Y N ) ≤ D + δ )
N
1
1
≥ η N ( ρN (xN , Y N ) ≤ D + δ and iN (xN , Y N ) ≤ R + δ ),
N
N where 1
1
iN (xN , y N ) =
ln fN (xN , y N ),
N
N where
fN (xN , y N ) = dpN (xN , y N )
,
d(µN × η N )(xN , y N ) the RadonNikodym derivative of pN with respect to the product measure µN ×
η N . Thus we require both the distortion and the sample information be less
than slightly more than their limiting value. Thus we have in the region of
integration that
1
1
iN (xN ; y N ) =
ln fN (xN , y N ) ≤ R + δ
N
N
and hence
ηN (ρN (xN , Y N ) ≤ D + δ ) ≥ dη N (y N )
y N :ρN (xN ,y N )≤D +δ,fN (xN ,y N )≤eN (R+δ) ≥ e−N (R+δ) dη N (y N )fN (xN , y N )
y N :ρN (xN ,y N )≤D +δ,fN (xN ,y N )≤eN (R+δ) which yields the bound
P( 1
1
ρN (xN , C ) > D + δ xN ) ≤ [1 − η N ( ρN (xN , Y N ) ≤ D + δ )]M
N
N ≤ [1−e−N (R+δ) dη N (y N )fN (xN , y N )]M ,
1
1
y N : N ρN (xN ,y N )≤D +δ, N iN (xN ,y N )≤R+δ Applying the inequality
(1 − αβ )M ≤ 1 − β + e−M α
for α, β ∈ [0, 1] yields
P( 1
ρN (xN , C ) > D + δ xN ) ≤
N
dη N (y N ) × fN (xN , y N ) 1−
1
1
y N : N ρN (xN ,y N )≤D +δ, N iN (xN ,y N )≤R+δ + e[−M e −N (R+δ ) . 230 CHAPTER 11. SOURCE CODING THEOREMS
Averaging with respect to the distribution µN yields bN
=
ρ∗ + δ
≤ dµN (xN )P (ρN (xN , C ) > D + δ xN ) dµN (xN ) 1 − dη N (y N )
1
y N :ρN (xN ,y N )≤N (D +δ ), N iN (xN ,y N )≤R+δ ×fN (xN , y N ) + e−M e −N (R+δ ) d(µN × η N )(xN , y N ) =1−
1
1
y N : N ρN (xN ,y N )≤D +δ, N iN (xN ,y N )≤R+δ × f N ( x N , y N ) + e −M e
= 1 + e−M e −N (R+δ ) −N (R+δ ) dpN (xN , y N ) −
1
1
y N : N ρN (xN ,y N )≤D +δ, N iN (xN ,y N )≤R+δ = 1 + e −M e
− pN (y N : −N (R+δ ) 1
1
ρN (xN , y N ) ≤ D + δ, iN (xN , y N ) ≤ R + δ ). (11.14)
N
N Since M is bounded below by eN (R+ ) 1, the exponential term is bounded
above by
(N (R+ ) −N (R+δ )
N ( −δ )
e
+e−N (R+δ) ]
+e−N (R+δ) ]
e[−e
= e[−e
.
> δ , this term goes to 0 as N → ∞.
The probability term in (11.14) goes to 1 from the mean ergodic theorem
applied to ρ1 and the mean ergodic theorem for information density since mean
convergence (or the almost everywhere convergence proved elsewhere) implies
convergence in probability. This implies that If lim sup bN = 0
n→∞ which with (11.13) gives (11.12). Choosing an N so large that ∆N ≤ δ , we
have proved that there exists a block code C with average distortion less than
D(R, µ) + δ and rate less than R + and hence
δ (R + , µ) ≤ D(R, µ) + δ. (11.15) Since and δ can be chosen as small as desired and since D(R, µ) is a continuous
function of R (Lemma 10.6.1), the theorem is proved.
2
The source coding theorem is originally due to Shannon [131] [132], who
proved it for discrete i.i.d. sources. It was extended to stationary and ergodic
discrete alphabet sources and Gaussian sources by Gallager [43] and to stationary and ergodic sources with abstract alphabets by Berger [10] [11], but an
error in the information density convergence result of Perez [124] (see Kieﬀer
[75]) left a gap in the proof, which was subsequently repaired by Dunham [35].
The result was extended to nonergodic stationary sources and metric distortion 11.5. SUBADDITIVE FIDELITY CRITERIA 231 measures and Polish alphabets by Gray and Davisson [53] and to AMS ergodic
processes by Gray and Saadat [61]. The method used here of using a stationary
and ergodic measure to construct the block codes and thereby avoid the block
ergodic decomposition of Nedoma [107] used by Gallager [43] and Berger [11]
was suggested by Pursley and Davisson [29] and developed in detail by Gray
and Saadat [61]. 11.5 Subadditive Fidelity Criteria In this section we generalize the block source coding theorem for stationary
sources to subadditive ﬁdelity criteria. Several of the interim results derived
previously are no longer appropriate, but we describe those that are still valid
in the course of the proof of the main result. Most importantly, we now consider only stationary and not AMS sources. The result can be extended to
AMS sources in the twosided case, but it is not known for the onesided case.
Source coding theorems for subadditive ﬁdelity criteria were ﬁrst developed by
Mackenthun and Pursley [97].
Theorem 11.5.1 Let µ denote a stationary and ergodic distribution of a source
{Xn } and let {ρn } be a subadditive ﬁdelity criterion with a reference letter, i.e.,
ˆ
there is an a∗ ∈ A such that
Eρ1 (X0 , a∗ ) = ρ∗ < ∞.
Then the operational DRF for the class of block codes of rate less than R is
given by the Shannon distortionrate function D(R, µ).
Proof: Suppose that we have a block code of length N , e.g., a block encoder
ˆ
α : AN → B K and a block decoder β : B K → AN . Since the source is stationary,
the induced input/output distribution is then N stationary and the performance
resulting from using this code on a source µ is
∆N = Ep ρ∞ = 1
ˆ
Ep ρN (X N , X N ),
N ˆ
where {X N } is the resulting reproduction process. Let δN (R, µ) denote the
inﬁmum over all codes of length N of the performance using such codes and let
δ (R, µ) denote the inﬁmum of δN over all N , that is, the operational distortion
rate function. We do not assume a codebook/minimum distortion structure
because the distortion is now eﬀectively context dependent and it is not obvious
that the best codes will have this form. Assume that given an > 0 we have
chosen for each N a length N code such that
δN (R, µ) ≥ ∆N − .
As previously we assume that
K log B 
≤ R,
N 232 CHAPTER 11. SOURCE CODING THEOREMS where the constraint R is the rate of the code. As in the proof of the converse
coding theorem for an additive distortion measure, we have that for the resulting
ˆ
process I (X N ; X N ) ≤ RN and hence
∆N ≥ DN (R, µ).
From Lemma 10.6.2 we can take the inﬁmum over all N to ﬁnd that
δ (R, µ) = inf δN (R, µ) ≥ inf DN (R, µ) − = D(R, µ) − .
N N Since is arbitrary, δ (R, µ) ≤ D(R, µ), proving the converse theorem.
To prove the positive coding theorem we proceed in an analogous manner
to the proof for the additive case, except that we use Lemma 10.6.3 instead of
Theorem 10.6.1. First pick an N large enough so that
DN (R, µ) ≤ D(R, µ) + δ
2 and then select a pN ∈ RN (R, µN ) such that
EpN 1
δ
ρN (X N , Y N ) ≤ DN (R, µ) + ≤ D(R, µ) + δ.
N
2 Now then construct as in Lemma 10.6.3 a stationary and ergodic process p
which will have (10.6.4) and (10.6.5) satisﬁed (the right N th order distortion
and information). This step taken, the proof proceeds exactly as in the additive
case since the reference vector yields the bound
1
1
ρN (xN , a∗ N ) ≤
N
N N −1 ρ1 (xi , a∗ ),
i=0 which converges, and since N −1 ρN (xN , y N ) converges as N → ∞ with p probability one from the subadditive ergodic theorem. Thus the existence of a code
satisfying (11.15) can be demonstrated (which uses the minimum distortion encoder) and this implies the result since D(R, µ) is a continuous function of R
(Lemma 10.6.1).
2 11.6 Asynchronous Block Codes The block codes considered so far all assume block synchronous communication,
that is, that the decoder knows where the blocks begin and hence can deduce
the correct words in the codebook from the index represented by the channel
block. In this section we show that we can construct asynchronous block codes
with little loss in performance or rate; that is, we can construct a block code
so that a decoder can uniquely determine how the channel data are parsed and
hence deduce the correct decoding sequence. This result will play an important
role in the development in the next section of sliding block coding theorems. 11.6. ASYNCHRONOUS BLOCK CODES 233 Given a source µ let δasync (R, µ) denote the operational distortion rate
function for block codes with the added constraint that the decoder be able to
synchronize, that is, correctly parse the channel codewords. Obviously
δasync (R, µ) ≥ δ (R, µ)
since we have added a constraint. The goal of this section is to prove the
following result:
Theorem 11.6.1 Given an AMS source with an additive ﬁdelity criterion and
a reference letter,
δasync (R, µ) = δ (R, µ),
that is, the operational DRF for asynchronous codes is the same as that for
ordinary codes.
Proof: A simple way of constructing a synchronized block code is to use a preﬁx
code: Every codeword begins with a short preﬁx or source synchronization word
or, simply, sync word, that is not allowed to appear anywhere else within a word
or as any part of an overlap of the preﬁx and a piece of the word. The decoder
than need only locate the preﬁx in order to decode the block begun by the preﬁx.
The insertion of the sync word causes a reduction in the available number of
codewords and hence a loss in rate, but ideally this loss can be made negligible if
properly done. We construct a code in this fashion by ﬁnding a good codebook
of slightly smaller rate and then indexing it by channel K tuples with this preﬁx
property.
Suppose that our channel has a rate constraint R, that is, if source N tuples
are mapped into channel K tuples then
K log B 
≤ R,
N
where B is the channel alphabet. We assume that the constraint is achievable
on the channel in the sense that we can choose N and K so that the physical
stationarity requirement is met (N source time units corresponds to K channel
time units) and such that
B K ≈ eN R ,
(11.16)
at least for large N .
If K is to be the block length of the channel code words, let δ be small and
deﬁne k (K ) = δ K + 1 and consider channel codewords which have a preﬁx
of k (K ) occurrences of a single channel letter, say b, followed by a sequence of
K − k (K ) channel letters which have the following constraint: no k (K )tuple
beginning after the ﬁrst symbol can be bk(K ) . We permit b’s to occur at the end
of a K tuple so that a k (K )tuple of b’s may occur in the overlap of the end of
a codeword and the new preﬁx since this causes no confusion, e.g., if we see an
elongated sequence of b’s, the actual code information starts at the right edge.
Let M (K ) denote the number of distinct channel K tuples of this form. Since 234 CHAPTER 11. SOURCE CODING THEOREMS M (K ) is the number of distinct reproduction codewords that can be indexed
by channel codewords, the codebooks will be constrained to have rate
RK = ln M (K )
.
N We now study the behavior of RK as K gets large. There are a total of
B K −k(K ) K tuples having the given preﬁx. Of these, no more than (K −
k (K ))B K −2k(K ) have the sync sequence appearing somewhere within the
word (there are fewer than K − k (K ) possible locations for the sync word
and for each location the remaining K − 2k (K ) symbols can be anything).
Lastly, we must also eliminate those words for which the ﬁrst i symbols are b
for i = 1, 2, . . . , k (K ) − 1 since this will cause confusion about the right edge of
the sync sequence. These terms contribute
k(K )−1 B K −k(K )−i
i=1 bad words. Using the geometric progression formula to sum the above series we
have that it is bounded above by
B K −k(K )−1
.
1 − 1/B 
Thus the total number of available channel vectors is at least
M (K ) ≥ B K −k(K ) − (K − k (K ))B K −2k(K ) − B K −k(K )−1
.
1 − 1/B  Thus
RK = = 1
1
1
ln B K −k(K ) +
ln 1 − (K − k (K ))B −k(K ) −
N
N
B  − 1 K − k (K )
1
ln B  +
ln
N
N B  − 2
− (K − k (K ))B −k(K ) .
B  − 1 ≥ (1 − δ )R + o(N ),
where o(N ) is a term that goes to 0 as N (and hence K ) goes to inﬁnity. Thus
given a channel with rate constraint R and given > 0, we can construct for N
suﬃciently large a collection of approximately eN (R− ) channel K tuples (where
K ≈ N R) which are synchronizable, that is, satisfy the preﬁx condition.
We are now ready to construct the desired code. Fix δ > 0 and then choose
> 0 small enough to ensure that
δ (R(1 − ), µ) ≤ δ (R, µ) + δ
3 11.7. SLIDING BLOCK SOURCE CODES 235 (which we can do since δ (R, µ) is continuous in R). Then choose an N large
enough to give a preﬁx channel code as above and to yield a rate R − codebook
C so that
δ
3
2δ
≤ δ (R − , µ) +
≤ δ (R, µ) + δ.
3 ρN (C , µ) ≤ δN (R − , µ) + (11.17) The resulting code proves the theorem. 11.7 2 Sliding Block Source Codes We now turn to sliding block codes. For simplicity we consider codes which
map blocks into single symbols. For example, a sliding block encoder will be a
ˆ
mapping f : AN → B and the decoder will be a mapping g : B K → A. In the
case of onesided processes, for example, the channel sequence would be given
by
N
Un = f (Xn )
and the reproduction sequence by
L
ˆ
Xn = g (Un ). When the processes are twosided, it is more common to use memory as well
as delay. This is often done by having an encoder mapping f : A2N +1 → B ,
ˆ
a decoder g : B 2L+1 → A, and the channel and reproduction sequences being
deﬁned by
Un
ˆ
Xn = f (X−N , · · · , X0 , · · · , XN ),
= g (U−L , · · · , U0 , · · · , UN ). We shall emphasize the twosided case.
The ﬁnal output can be viewed as a sliding block coding of the input:
ˆ
Xn = g (f (Xn−L−N , · · · , Xn−L+N ), · · · , f (Xn+L−N , · · · , Xn+L+N ))
= gf (Xn−(N +L) , · · · , Xn+(N +L) ), where we use gf to denote the overall coding, that is, the cascade of g and f .
Note that the delay and memory of the overall code are the sums of those for
the encoder and decoder. The overall window length is 2(N + L) + 1
Since one channel symbol is sent for every source symbol, the rate of such a
code is given simply by R = log B  bits per source symbol. The obvious problem with this restriction is that we are limited to rates which are logarithms of
integers, e.g., we cannot get fractional rates. As previously discussed, however,
we could get fractional rates by appropriate redeﬁnition of the alphabets (or,
equivalently, of the shifts on the corresponding sequence spaces). For example, 236 CHAPTER 11. SOURCE CODING THEOREMS regardless of the code window lengths involved, if we shift l source symbols to
produce a new group of k channel symbols (to yield an (l, k )stationary encoder)
and then shift a group of k channel symbols to produce a new group of k source
symbols, then the rate is
k
R = log B 
l
bits or nats per source symbol and the overall code f g is lstationary. The
added notation to make this explicit is signiﬁcant and the generalization is
straightforward; hence we will stick to the simpler case.
We can deﬁne the sliding block operational DRF for a source and channel in
the natural way. Suppose that we have an encoder f and a decoder g . Deﬁne
the resulting performance by
ρ(f g, µ) = Eµf g ρ∞ ,
where µf g is the input/output hookup of the source µ connected to the deterministic channel f g and where ρ∞ is the sequence distortion. Deﬁne
δSBC (R, µ) = inf ρ(f g, µ) = ∆∗ (µ, E , ν, D),
f,g where E is the class of all ﬁnite length sliding block encoders and D is the
collection of all ﬁnite length sliding block decoders. The rate constraint R is
determined by the channel.
Assume as usual that µ is AMS with stationary mean µ. Since the cas¯
cade of stationary channels f g is itself stationary (Lemma 9.4.7), we have from
Lemma 9.3.2 that µf g is AMS with stationary mean µf g . This implies from
¯
(10.10) that for any sliding block codes f and g
Eµf g ρ∞ = Eµf g ρ∞
¯
and hence
δSBC (R, µ) = δSBC (R, µ).
¯
A fact we now formalize as a lemma.
Lemma 11.7.1 Suppose that µ is an AMS source with stationary mean µ and
¯
let {ρn } be an additive ﬁdelity criterion. Let δSBC (R, µ) denote the sliding block
coding operational distortion rate function for the source and a channel with
rate constraint R. Then
δSBC (R, µ) = δSBC (R, µ).
¯
The lemma permits us to concentrate on stationary sources when quantifying
the optimal performance of sliding block codes.
The principal result of this section is the following: 11.7. SLIDING BLOCK SOURCE CODES 237 Theorem 11.7.1 Given an AMS and ergodic source µ and an additive ﬁdelity
criterion with a reference letter,
δSBC (R, µ) = δ (R, µ),
that is, the class of sliding block codes is capable of exactly the same performance
as the class of block codes. If the source is only AMS and not ergodic, then at
least
δSBC (R, µ) ≥ δ (R, µ), (11.18) Proof: The proof of (11.18) follows that of Shields and Neuhoﬀ [135] for the ﬁnite
alphabet case, except that their proof was for ergodic sources and coded only
typical input sequences. Their goal was diﬀerent because they measured the rate
of a sliding block code by the entropy rate of its output, eﬀectively assuming
that further almostnoiseless coding was to be used. Because we consider a ﬁxed
channel and measure the rate in the usual way as a coding rate, this problem
does not arise here. From the previous lemma we need only prove the result for
stationary sources and hence we henceforth assume that µ is stationary. We ﬁrst
prove that sliding block codes can perform no better than block codes, that is,
ˆ
(11.18) holds. Fix δ > 0 and suppose that f : A2N +1 → B and g : B 2L+1 → A
are ﬁnitelength sliding block codes for which
ρ(f g, µ) ≤ δSBC (R, µ) + δ.
ˆ
This yields a cascade sliding block code f g : A2(N +L)+1 → A which we use to
construct a block codebook. Choose K large (to be speciﬁed later). Observe
an input sequence xn of length n = 2(N + L) + 1 + K and map it into a
reproduction sequence xn as follows: Set the ﬁrst and last (N + L) symbols
ˆ
+L
to the reference letter a∗ , that is, xN +L = xN−N −L = a∗ (N +L) . Complete the
0
n
remaining reproduction symbols by sliding block coding the source word using
the given codes, that is,
2(N +L)+1 xi = f g (xi−(N +L) ); i = N + L + 1, · · · , K + N + L.
ˆ
Thus the long block code is obtained by sliding block coding, except at the
edges where the sliding block code is not permitted to look at previous or future
source symbols and hence are ﬁlled with a reference symbol. Call the resulting
codebook C . The rate of the block code is less than R = log B  because n
channel symbols are used to produce a reproduction word of length n and hence
the codebook can have no more that B n possible vectors. Thus the rate
is log B  since the codebook is used to encode a source ntuple. Using this
codebook with a minimum distortion rule can do no worse (except at the edges)
ˆ
than if the original sliding block code had been used and therefore if Xi is the
reproduction process produced by the block code and Yi that produced by the 238 CHAPTER 11. SOURCE CODING THEOREMS sliding block code, we have (invoking stationarity) that
N +L−1 ρ(Xi , a∗ ))+ nρ(C , µ) ≤ E (
i=0 K +2(L+N ) K +N +L E( ρ(Xi , a∗ )) ρ(Xi , Yi )) + E (
i=N +L i=K +N +L+1 ≤ 2(N + L)ρ∗ + K (δSBC (R, µ) + δ )
and hence
δ (R, µ) ≤ K
2(N + L)
ρ∗ +
(δSBC (R, µ) + δ ).
2(N + L) + K
2(N + L) + K By choosing δ small enough and K large enough we can make make the right
hand side arbitrarily close to δSBC (R, µ), which proves (11.18).
We now proceed to prove the converse inequality,
δ (R, µ) ≥ δSBC (R, µ), (11.19) which involves a bit more work.
Before carefully tackling the proof, we note the general idea and an “almost
proof” that unfortunately does not quite work, but which may provide some
insight. Suppose that we take a very good block code, e.g., a block code C of
block length N such that
ρ(C , µ) ≤ δ (R, µ) + δ
for a ﬁxed δ > 0. We now wish to form a sliding block code for the same channel
with approximately the same performance. Since a sliding block code is just a
stationary code (at least if we permit an inﬁnite window length), the goal can be
viewed as “stationarizing” the nonstationary block code. One approach would
be the analogy of the SBM channel: Since a block code can be viewed as a deterministic block memoryless channel, we could make it stationary by inserting
occasional random spacing between long sequences of blocks. Ideally this would
then imply the existence of a sliding block code from the properties of SBM
channels. The problem is that the SBM channel so constructed would no longer
be a deterministic coding of the input since it would require the additional input
of a random punctuation sequence. Nor could one use a random coding argument to claim that there must be a speciﬁc (nonrandom) punctuation sequence
which could be used to construct a code since the deterministic encoder thus
constructed would not be a stationary function of the input sequence, that is,
it is only stationary if both the source and punctuation sequences are shifted
together. Thus we are forced to obtain the punctuation sequence from the
source input itself in order to get a stationary mapping. The original proofs
that this could be done used a strong form of the RohlinKakutani theorem
of Section 9.5given by Shields [133]. [56] [58]. The RohlinKakutani theorem 11.7. SLIDING BLOCK SOURCE CODES 239 demonstrates the existence of a punctuation sequence with the property that
the punctuation sequence is very nearly independent of the source. Lemma 9.5.2
is a slightly weaker result than the strong form considered by Shields.
The code construction described above can therefore be approximated by
using a coding of the source instead of an independent process. Shields and
Neuhoﬀ [135] provided a simpler proof of a result equivalent to the RohlinKakutani theorem and provided such a construction for ﬁnite alphabet sources.
Davisson and Gray [27] provided an alternative heuristic development of a similar construction. We here adopt a somewhat diﬀerent tack in order to avoid
some of the problems arising in extending these approaches to general alphabet sources and to nonergodic sources. The principal diﬀerence is that we do
not try to prove or use any approximate independence between source and the
punctuation process derived from the source (which is code dependent in the
case of continuous alphabets). Instead we take a good block code and ﬁrst produce a much longer block code that is insensitive to shifts or starting positions
using the same construction used to relate block coding performance of AMS
processes and that of their stationary mean. This modiﬁed block code is then
made into a sliding block code using a punctuation sequence derived from the
source. Because the resulting block code is little aﬀected by starting time, the
only important property is that most of the time the block code is actually
in use. Independence of the punctuation sequence and the source is no longer
required. The approach is most similar to that of Davisson and Gray [27], but
the actual construction diﬀers in the details. An alternative construction may
be found in Kieﬀer [80].
Given δ > 0 and > 0, choose for large enough N an asynchronous block
code C of block length N such that
1
log C ≤ R − 2
N
and
ρ(C , µ) ≤ δ (R, µ) + δ. (11.20) The continuity of the block operational distortion rate function and the theorem for asynchronous block source coding ensure that we can do this. Next
we construct a longer block code that is more robust against shifts. For i =
0, 1, . . . , N − 1 construct the codes CK (i) having length K = JN as in the proof
of Lemma 11.2.4. These codebooks look like J − 1 repetitions of the codebook
C starting from time i with the leftover symbols at the beginning and end being
ﬁlled by the reference letter. We then form the union code CK = i CK (i) as in
the proof of Corollary 11.2.4 which has all the shifted versions. This code has
rate no greater than R − 2 + (JN )−1 log N . We assume that J is large enough
to ensure that
1
log N ≤
(11.21)
JN
so that the rate is no greater than R − and that
3∗
ρ ≤ δ.
J (11.22) 240 CHAPTER 11. SOURCE CODING THEOREMS We now construct a sliding block encoder f and decoder g from the given block
code. From Corollary 9.4.2 we can construct a ﬁnite length sliding block code
of {Xn } to produce a twosided (N J, γ )random punctuation sequence {Zn }.
From the lemma P (Z0 = 2) ≤ γ and hence by the continuity of integration
(Corollary 4.4.2 of [50]) we can choose γ small enough to ensure that
ρ(X0 , a∗ ) ≤ δ. (11.23) x:Z0 (x)=2 Recall that the punctuation sequence usually produces 0’s followed by N J − 1
1’s with occasional 2’s interspersed to make things stationary. The sliding block
encoder f begins with time 0 and scans backward N J time units to ﬁnd the ﬁrst
0 in the punctuation sequence. If there is no such 0, then put out an arbitrary
channel symbol b. If there is such a 0, then the block codebook CK is applied
to the input K tuple xKn to produce the minimum distortion codeword
−
uK = min −1 ρK (xKn , y )
−
y ∈CK and the appropriate channel symbol, un , produced by the channel. The sliding
block encoder thus has length at most 2N J + 1.
The decoder sliding block code g scans left N symbols to see if it ﬁnds a
codebook sync sequence (remember the codebook is asynchronous and begins
with a unique preﬁx or sync sequence). If it does not ﬁnd one, it produces a
reference letter. (In this case it is not in the middle of a code word.) If it
does ﬁnd one starting in position −n, then it produces the corresponding length
N codeword from C and then puts out the reproduction symbol in position n.
Note that the decoder sliding block code has a ﬁnite window length of at most
2N + 1.
We now evaluate the average distortion resulting from use of this sliding
block code. As a ﬁrst step we mimic the proof of Lemma 10.6.3 up to the
assumption of mutual independence of the source and the punctuation process
(which is not the case here) to get that for a long source sequence of length n if
the punctuation sequence is z , then
ρ(xi , a∗ ) + ρn (xn , xn ) =
ˆ
n
i∈J0 (z ) ρN J (xN J , xN J ),
ˆi
i
n
i∈J1 (z ) n
where J0 (z ) is the collection of all i for which zi is not in an N J cell (and hence
n
ﬁller is being sent) and J1 (z ) is the collection of all i for which zi is 0 and hence
begins an N J cell and hence an N J length codeword. Each one of these length
N J codewords contains at most N reference letters at the beginning and N
references letters at the end the end and in the middle it contains all shifts of
n
sequences of length N codewords from C . Thus for any i ∈ J1 (z ), we can write
that
i
N ρN J (xN J , xN J )
ˆi
i ≤ ∗N ρN (xN , a
i )+ ∗N ρN (xN N J −N , a
i+ +JN −1 ρN (xN , C ).
j )+
j= i
N 11.7. SLIDING BLOCK SOURCE CODES 241 This yields the bound
1
1
ρn (xn , xn ) ≤
ˆ
n
n
+ 1
n ρ(xi , a∗ )
n
i∈J0 (z ) ρN (xN , a∗ N ) + ρN (xN N J −N , a∗ N )
i
i+
n
i∈J1 (z ) 1
+
n n
N ρN (xN , C ) =
jN
j =0 1
n n−1 12 (zi )ρ(xi , a∗ )
i=0 n−1 1
+
n ∗N 10 (zi ) ρN (xN , a
i ∗N ) + ρN (xN N J −N , a
i+ i=0 1
)+
n n
N ρN (xN , C ),
jN
j =0 where 1a (zi ) is 1 if zi = a and 0 otherwise. Taking expectations above we have
that
1
1
ˆ
E ( ρn (X n , X n )) ≤
n
n
1
+
n n−1 E (12 (Zi )ρ(Xi , a∗ ))
i=0 n−1 (10 (Zi ) N
ρN (Xi , a∗ N ) + N
ρN (Xi+N J −N , a∗ N ) i=0 1
)+
n n
N N
ρN (XjN , C ).
j =0 Invoke stationarity to write
1
ˆ
E ( ρn (X n , X n )) ≤ E (12 (Z0 )ρ(X0 , a∗ ))
n
1
1
+
E (10 (Z0 )ρ2 N + 1(X 2N +1 , a∗ (2N +1) )) + ρN (X N , C ).
NJ
N
The ﬁrst term is bounded above by δ from (11.23). The middle term can be
bounded above using (11.22) by
1
1
E (10 (Z0 )ρ2N +1 (X 2N +1 , a∗ (2N +1) ) ≤
Eρ2N +1 (X 2N +1 , a∗ (2N +1) )
JN
JN
1
2
=
(2N + 1)ρ∗ ≤ ( + 1)ρ∗ ≤ δ.
JN
J
Thus we have from the above and (11.20) that
Eρ(X0 , Y0 ) ≤ ρ(C , µ) + 3δ.
This proves the existence of a ﬁnite window sliding block encoder and a ﬁnite
window length decoder with performance arbitrarily close to that achievable by
block codes.
2
The only use of ergodicity in the proof of the theorem was in the selection
of the source sync sequence used to imbed the block code in a sliding block 242 CHAPTER 11. SOURCE CODING THEOREMS code. The result would extend immediately to nonergodic stationary sources
(and hence to nonergodic AMS sources) if we could somehow ﬁnd a single source
sync sequence that would work for all ergodic components in the ergodic decomposition of the source. Note that the source synch sequence aﬀects only the
encoder and is irrelevant to the decoder which looks for asynchronous codewords
preﬁxed by channel synch sequences (which consisted of a single channel letter
repeated several times). Unfortunately, one cannot guarantee the existence of a
single source sequence with small but nonzero probability under all of the ergodic
components. Since the components are ergodic, however, an inﬁnite length sliding block encoder could select such a source sequence in a simple (if impractical)
way: Proceed as in the proof of the theorem up to the use of Corollary 9.4.2.
Instead of using this result, we construct by brute force a punctuation sequence
for the ergodic component in eﬀect. Suppose that G = {Gi ; i = 1, 2, . . .} is a
countable generating ﬁeld for the input sequence space. Given δ , the inﬁnite
length sliding block encoder ﬁrst ﬁnds the smallest value of i for which
1
n→∞ n n−1 1Gi (T k x), 0 < lim
and
1
n→∞ n k=0 n−1 1Gi (T k x)ρ(xk , a∗ ) ≤ δ, lim k=0 that is, we ﬁnd a set with strictly positive relative frequency (and hence strictly
positive probability with respect to the ergodic component in eﬀect) which occurs rarely enough to ensure that the sample average distortion between the
symbols produced when Gi occurs and the reference letter is smaller than δ .
Given N and δ there must exist an i for which these relations hold (apply the
proof of Lemma 9.4.4 to the ergodic component in eﬀect with γ chosen to satisfy (11.23) for that component and then replace the arbitrary set G by a set
in the generating ﬁeld having very close probability). Analogous to the proof of
Lemma 9.4.4 we construct a punctuation sequence {Zn } using the event Gi in
place of G. The proof then follows in a like manner except that now from the
dominated convergence theorem we have that
E (12 (Z0 )ρ(X0 , a∗ )) = = 1
n→∞ n n−1 E (12 (Zi )ρ(Xi , a∗ ) lim i=0 1
n→∞ n n−1 12 (Zi )ρ(Xi , a∗ )) ≤ δ E ( lim i=0 by construction.
The above argument is patterned after that of Davisson and Gray [27] and
extends the theorem to stationary nonergodic sources if inﬁnite window sliding
block encoders are allowed. We can then approximate this encoder by a ﬁnitewindow encoder, but we must make additional assumptions to ensure that the 11.7. SLIDING BLOCK SOURCE CODES 243 resulting encoder yields a good approximation in the sense of overall distortion.
Suppose that f is the inﬁnite window length encoder and g is the ﬁnite windowlength (say 2L + 1) encoder. Let G denote a countable generating ﬁeld of
rectangles for the input sequence space. Then from Corollary 4.2.2 applied
to G given > 0 we can ﬁnd for suﬃciently large N a ﬁnite window sliding
block code r : A2N +1 → B such that Pr(r = f ) ≤ /(2L + 1), that is, the two
encoders produce the same channel symbol with high probability. The issue is
when does this imply that ρ(f g, µ) and ρ(rg, µ) are therefore also close, which
would complete the proof. Let r : AT → B denote the inﬁnitewindow sliding
¯
block encoder induced by r, i.e., r(x) = r(x2N +1 ). Then
¯
−N
ˆ
ρ(f g, µ) = E (ρ(X0 , X0 )) = dµ(x)ρ(x0 , g (b)),
b∈B 2L+1 x∈Vf (b) where
Vf (b) = {x : f (x)2L+1 = b},
where f (x)2L+1 is shorthand for f (xi ), i = −L, . . . , L, that is, the channel
(2L + 1)tuple produced by the source using encoder x. We therefore have that
ρ(¯g, µ) ≤
r dµ(x)ρ(x0 , g (b))
b∈B 2L+1 x∈Vf (b) dµ(x)ρ(x0 , g (b)) +
b∈B 2L+1 x∈Vr (b)−Vf (b)
¯ = ρ(f, µ) + dµ(x)ρ(x0 , g (b))
b∈B 2L+1 x∈Vr (b)−Vf (b)
¯ b∈B 2L+1 x∈Vr (b)∆Vf (b)
¯ ≤ ρ(f, µ) + dµ(x)ρ(x0 , g (b)). By making N large enough, however, we can make
µ(Vr (f )∆Vf (b))
¯
ˆ
arbitrarily small simultaneously for all b ∈ A2L + 1 and hence force all of the
integrals above to be arbitrarily small by the continuity of integration. With
Lemma 11.7.1 and Theorem 11.7.1 this completes the proof of the following
theorem.
Theorem 11.7.2 Theorem 11.7.2: Given an AMS source µ and an additive
ﬁdelity criterion with a reference letter,
δSBC (R, µ) = δ (R, µ),
that is, the class of sliding block codes is capable of exactly the same performance
as the class of block codes. 244 CHAPTER 11. SOURCE CODING THEOREMS The sliding block source coding theorem immediately yields an alternative
coding theorem for a code structure known as trellis encoding source codes
wherein the sliding block decoder is kept but the encoder is replaced by a tree
or trellis search algorithm such as the Viterbi algorithm [41]. The details of
inferring the trellis encoding source coding theorem from the slidingblock source
coding theorem can be found in [52]. 11.8 A geometric Interpretation of operational
DRFs We close this chapter on source coding theorems with a geometric interpretation
of the operational DRFs in terms of the ρ distortion between sources. Suppose
¯
that µ is a stationary and ergodic source and that {ρn } is an additive ﬁdelity
criterion with a ﬁdelity criterion. Suppose that we have a nearly optimal sliding
block encoder and decoder for µ and a channel with rate R, that is, if the overall
ˆ
process is {Xn , Xn } and
ˆ
Eρ(X0 , X0 ) ≤ δ (R, µ) + δ.
If the overall hookup (source/encoder/channel/decoder) yields a distribution p
ˆ
ˆ
on {Xn , Xn } and distribution η on the reproduction process {Xn }, then clearly
ρ(µ, η ) ≤ δ (R, µ) + δ.
¯
Furthermore, since the channel alphabet is B the channel process must have
entropy rate less than R = log B  and hence the reproduction process must
also have entropy rate less than B from Corollary 4.2.5. Since δ is arbitrary,
δ (R, µ) ≥ inf ¯
η :H (η )≤R ρ(µ, η ).
¯ ¯
Suppose next that p, µ and η are stationary and ergodic and that H (η ) ≤ R.
Choose a stationary p having µ and η as coordinate processes such that
Ep ρ(X0 , Y0 ) ≤ ρ(µ, ν ) + δ.
¯
¯
¯
We have easily that I (X ; Y ) ≤ H (η ) ≤ R and hence the left hand side is bounded
¯
below by the process distortion rate function Ds (R, µ). From Theorem 10.6.1
and the block source coding theorem, however, this is just the operational distortion rate function. We have therefore proved the following:
Theorem 11.8.1 Let µ be a stationary and ergodic source and let {ρn } be an
additive ﬁdelity criterion with a reference letter. Then
δ (R, µ) = inf ¯
η :H (η )≤R ρ(µ, η ),
¯ that is, the operational DRF (and hence the distortionrate function) of a stationary ergodic source is just the “distance” in the ρ sense to the nearest station¯
ary and ergodic process with the speciﬁed reproduction alphabet and with entropy
rate less than R. 11.8. A GEOMETRIC INTERPRETATION OF OPERATIONAL DRFS 245
This result originated in [55]. 246 CHAPTER 11. SOURCE CODING THEOREMS Chapter 12 Coding for noisy channels
12.1 Noisy Channels In the treatment of source coding the communication channel was assumed to
be noiseless. If the channel is noisy, then the coding strategy must be diﬀerent.
Now some form of error control is required to undo the damage caused by the
channel. The overall communication problem is usually broken into two pieces:
A source coder is designed for a noiseless channel with a given resolution or rate
and an error correction code is designed for the actual noisy channel in order
to make it appear almost noiseless. The combination of the two codes then
provides the desired overall code or joint source and channel code. This division
is natural in the sense that optimizing a code for a particular source may suggest
quite diﬀerent structure than optimizing it for a channel. The structures must
be compatible at some point, however, so that they can be used together.
This division of source and channel coding is apparent in the subdivision of
this chapter. We shall begin with a basic lemma due to Feinstein [38] which is the
basis of traditional proofs of coding theorems for channels. It does not consider
a source at all, but ﬁnds for a given conditional distribution the maximum
number of inputs which lead to outputs which can be distinguished with high
probability. Feinstein’s lemma can be thought of as a channel coding theorem for
a channel which is used only once and which has no past or future. The lemma
immediately provides a coding theorem for the special case of a channel which
has no input memory or anticipation. The diﬃculties enter when the conditional
distributions of output blocks given input blocks depend on previous or future
inputs. This diﬃculty is handled by imposing some form of continuity on the
channel with respect to its input, that is, by assuming that if the channel input
is known for a big enough block, then the conditional probability of outputs
during the same block is known nearly exactly regardless of previous or future
¯
inputs. The continuity condition which we shall consider is that of dcontinuous
channels. Joint source and channel codes have been obtained for more general
channels called weakly continuous channels (see, e.g., Kieﬀer [81] [82]), but these
247 248 CHAPTER 12. CODING FOR NOISY CHANNELS results require a variety of techniques not yet considered here and do not follow
as a direct descendent of Feinstein’s lemma.
Block codes are extended to slidingblock codes in a manner similar to that
for source codes: First it is shown that asynchronous block codes can be synchronized and then that the block codes can be “stationarized” by the insertion
of random punctuation. The approach to synchronizing channel codes is based
on a technique of Dobrushin [33].
We consider stationary channels almost exclusively, thereby not including
interesting nonstationary channels such as ﬁnite state channels with an arbitrary starting state. We will discuss such generalizations and we point out that
they are straightforward for twosided processes, but the general theory of AMS
channels for onesided processes is not in a satisfactory state. Lastly, we emphasize ergodic channels. In fact, for the sliding block codes the channels are also
required to be totally ergodic, that is, ergodic with respect to all block shifts.
As previously discussed, we emphasize digital, i.e., discrete, channels. A
few of the results, however, are as easily proved under somewhat more general
conditions and hence we shall do so. For example, given the background of this
book it is actually easier to write things in terms of measures and integrals than
in terms of sums over probability mass functions. This additional generality
will also permit at least a description of how the results extend to continuous
alphabet channels. 12.2 Feinstein’s Lemma Let (A, BA ) and (B, BB ) be measurable spaces called the input space and the
output space, respectively. Let PX denote a probability distribution on (A, BA )
and let ν (F x), F ∈ BB , x ∈ B denote a regular conditional probability distribution on the output space. ν can be thought of as a “channel” with random
variables as input and output instead of sequences. Deﬁne the hookup PX ν =
PXY by
PXY (F ) = dPX (x)ν (Fx x). Let PY denote the induced output distribution and let PX × PY denote the
resulting product distribution. Assume that PXY << (PX × PY ) and deﬁne the
RadonNikodym derivative
f= dPXY
d(PX × PY ) and the information density
i(x, y ) = ln f (x, y ). (12.1) 12.2. FEINSTEIN’S LEMMA 249 We use abbreviated notation for densities when the meanings should be clear
from context, e.g., f instead of fXY . Observe that for any set F
dPY (y )f (x, y ) dPX (x) d(PX × PY )(x, y )f (x, y ) =
F ×B F dPXY (x, y ) = PX (B ) ≤ 1 =
F ×B and hence
dPY (y )f (x, y ) ≤ 1; PX − a.e. (12.2) Feinstein’s lemma shows that we can pick M inputs {xi ∈ A; i = 1, 2, . . . , M },
and a corresponding collection of M disjoint output events {Γi ∈ BB ; i =
1, 2, . . . , M }, with the property that given an input xi with high probability
the output will be in Γi . We call the collection C = {xi , Γi ; i = 1, 2, . . . , M } a
code with codewords xi and decoding regions Γi . We do not require that the Γi
exhaust B .
The generalization of Feinstein’s original proof for ﬁnite alphabets to general
measurable spaces is due to Kadota [70] and the following proof is based on his.
Lemma 12.2.1 Given an integer M and a > 0 there exist xi ∈ A; i = 1, . . . , M
and a measurable partition F = {Γi ; i = 1, . . . , M } of B such that
ν (Γc xi ) ≤ M e−a + PXY (i ≤ a).
i
Proof: Deﬁne G = {x, y : i(x, y ) > a} Set = M e−a + PXY (i ≤ a) = M e−a +
PXY (Gc ). The result is obvious if ≥ 1 and hence we assume that < 1 and
hence also that
PXY (Gc ) ≤ < 1
and therefore that
PXY (i > a) = PXY (G) = dPX (x)ν (Gx x) > 1 − > 0. ˜
This implies that the set A = {x : ν (Gx x) > 1 − and (12.2) holds} must have
positive measure under PX We now construct a code consisting of input points
˜
xi and output sets Γxi . Choose an x1 ∈ A and deﬁne Γx1 = Gx1 . Next choose
˜ for which ν (Gx − Γx x2 ) > 1 − . Continue in this
if possible a point x2 ∈ A
2
1
˜
way until either M points have been selected or all the points in A have been
exhausted. In particular, given the pairs {xj , Γj }; j = 1, 2, . . . , i − 1, satisfying
the condition, ﬁnd an xi for which
ν (Gxi − Γxj xi ) > 1 − . (12.3) j <i If the procedure terminates before M points have been collected, denote the
ﬁnal point’s index by n. Observe that
ν (Γxi c xi ) ≤ ν (Gxi c xi ) ≤ ; i = 1, 2, . . . , n 250 CHAPTER 12. CODING FOR NOISY CHANNELS and hence the lemma will be proved if we can show that necessarily n cannot
be strictly less than M . We do this by assuming the contrary and ﬁnding a
contradiction.
Suppose that the selection has terminated at n < M and deﬁne the set
n
F = i=1 Γxi ∈ BB . Consider the probability
(A × F c )). (A × F )) + PXY (G PXY (G) = PXY (G (12.4) The ﬁrst term can be bounded above as
n PXY (G (A × F )) ≤ PXY (A × F ) = PY (F ) = PY (Γxi ).
i=1 We also have from the deﬁnitions and from (12.2) that
PY (Γxi ) dPY (y ) ≤ =
≤
Gxi dPY (y )
Gxi Γxi f ( xi , y )
dPY (y )
ea ≤ e −a dPY (y )f (xi , y ) ≤ e−a and hence
(A × F )) ≤ ne−a . PXY (G (12.5) Consider the second term of (12.3):
(A × F c )) = dPX (x)ν ((G (A × F c ))x x) (12.6) = dPX (x)ν (Gx F c x) (12.7) = PXY (G dPX (x)ν (Gx − n Γi x). (12.8) i=1 We must have, however, that
n ν (Gx − Γi x) ≤ 1 −
i=1 with PX probability 1 or there would be a point xn+1 for which
n+1 ν (Gxn+1 − Γi xn+1 ) > 1 − ,
i=1 that is, (12.3) would hold for i = n + 1, contradicting the deﬁnition of n as the
largest integer for which (12.3) holds. Applying this observation to (12.8) yields
PXY (G (A × F c )) ≤ 1 − 12.3. FEINSTEIN’S THEOREM 251 which with (12.4) and (12.5) implies that
PXY (G) ≤ ne−a + 1 − . (12.9) From the deﬁnition of , however, we have also that
PXY (G) = 1 − PXY (Gc ) = 1 − + M e−a
which with (12.9) implies that M ≤ n, completing the proof. 12.3 2 Feinstein’s Theorem Given a channel [A, ν, B ] an (M, n, ) block channel code for ν is a collection
n
{wi , Γi }; i = 1, 2, . . . , M , where wi ∈ An , Γi ∈ BB , all i, with the property that
sup n
max νx (Γi ) ≤ , x∈c(wi ) i=1,...,M (12.10) n
n
where c(an ) = {x : xn = an } and where νx is the restriction of νx to BB .
−1
The rate of the code is deﬁned as n log M . Thus an (n, M, ) channel code
is a collection of M input ntuples and corresponding output cells such that
regardless of the past or future inputs, if the input during time 1 to n is a
channel codeword, then the output during time 1 to n is very likely to lie in
the corresponding output cell. Channel codes will be useful in a communication
system because they permit nearly error free communication of a select group
of messages or codewords. A communication system can then be constructed
for communicating a source over the channel reliably by mapping source blocks
into channel codewords. If there are enough channel codewords to assign to all
of the source blocks (at least the most probable ones), then that source can
be reliably reproduced by the receiver. Hence a fundamental issue for such an
application will be the number of messages M or, equivalently, the rate R of a
channel code.
Feinstein’s lemma can be applied fairly easily to obtain something that resembles a coding theorem for a noisy channel. Suppose that [A, ν, B ] is a channel
and [A, µ] is a source and that [A × B, p = µν ] is the resulting hookup. Denote the resulting pair process by {Xn , Yn } For any integer K let pK denote
K
K
the restriction of p to (AK × B K , BA × BB ), that is, the distribution on input/output K tuples (X K , Y K ). The joint distribution pK together with the
input distribution µK induce a regular conditional probability ν K deﬁned by
ˆ
ν K (F xK ) = Pr(Y K ∈ F X K = xK ). In particular,
ˆ ν K (GaK ) = Pr(Y K ∈ GX K = aK ) =
ˆ 1
µK (aK ) c(aK ) K
νx (G)dµ(x). (12.11) where c(aK ) = {x : xK = aK } is the rectangle of all sequences with a common K dimensional output. We call ν K the induced Kdimensional channel
ˆ
of the channel ν and the source µ. It is important to note that the induced 252 CHAPTER 12. CODING FOR NOISY CHANNELS channel depends on the source as well as on the channel, a fact that will cause
some diﬃculty in applying Feinstein’s lemma. An exception to this case which
proves to be an easy application is that of a channel without input memory and
anticipation, in which case we have from the deﬁnitions that
ν K (F aK ) = νx (Y K ∈ F ); x ∈ c(aK ),
ˆ
Application of Feinstein’s lemma to the induced channel yields the following
result, which was proved by Feinstein for stationary ﬁnite alphabet channels
and is known as Feinstein’s theorem:
Lemma 12.3.1 Suppose that [A × B, µν ] is an AMS and ergodic hookup of a
¯
¯
source µ and channel ν . Let Iµν = Iµν (X ; Y ) denote the average mutual infor∗
¯
mation rate and assume that Iµν = Iµν is ﬁnite (as is the case if the alphabets
are ﬁnite (Theorem 6.4.1) or have the ﬁnitegap information property (Theo¯
rem 6.4.3)). Then for any R < Iµν and any > 0 there exists for suﬃciently
n
n
large n a code {wi ; Γi ; i = 1, 2, . . . , M }, where M = enR , wi ∈ An , and
n
Γi ∈ BB , with the property that
n
ν n (Γc wi ) ≤ , i = 1, 2, . . . , M.
ˆ
i (12.12) Comment: We shall call a code {wi , Γi ; i = 1, 2, . . . , M } which satisﬁes (12.12)
for a channel input process µ a (µ, M, n, )Feinstein code. The quantity n−1 log M
is called the rate of the Feinstein code.
Proof: Let η denote the output distribution induced by µ and ν . Deﬁne the
information density
dpn
in =
(dµn × η n )
and deﬁne
δ= ¯
Iµν − R
> 0.
2 Apply Feinstein’s lemma to the ndimensional hookup (µν )n with M = enR
and a = n(R + δ ) to obtain a code {wi , Γi }; i = 1, 2, . . . , M with
n
max ν n (Γc wi ) ≤
ˆ
i
i = M e−n(R+δ) + pn (in ≤ n(R + δ ))
1
enR e−n(R+δ) + p( in (X n ; Y n ) ≤ R + δ ) (12.13)
n and hence
1
n
¯
max ν n (Γc wi ) ≤ e−nδ + p( in (X n ; Y n ) ≤ Iµν − δ ).
ˆ
i
i
n (12.14) ¯
From Theorem 6.3.1 n−1 in converges in L1 to Iµν and hence it also converges
in probability. Thus given we can choose an n large enough to ensure that the
right hand side of (12.13) is smaller than , which completes the proof of the
theorem.
2 12.3. FEINSTEIN’S THEOREM 253 We said that the lemma “resembled” a coding theorem because a real coding
theorem would prove the existence of an (M, n, ) channel code, that is, it would
concern the channel ν itself and not the induced channel ν , which depends on a
ˆ
channel input process distribution µ. The diﬀerence between a Feinstein code
and a channel code is that the Feinstein code has a similar property for an
induced channel which in general depends on a source distribution, while the
channel code has this property independent of any source distribution and for
any past or future inputs.
Feinstein codes will be used to construct block codes for noisy channels. The
simplest such construction is presented next.
Corollary 12.3.1 Suppose that a channel [A, ν, B ] is input memoryless and
input nonanticipatory (see Section 9.4).Then a (µ, M, n, )Feinstein code for
some channel input process µ is also an (M, n, )code.
Proof: Immediate since for a channel without input memory and anticipation
n
n
we have that νx (F ) = νu (F ) if xn = un .
2
The principal idea of constructing channel codes from Feinstein codes for
more general channels will be to place assumptions on the channel which ensure
n
that for suﬃciently large n the channel distribution νx and the induced ﬁnite
n
n
dimensional channel ν (·x ) are close. This general idea was proposed by
ˆ
McMillan [104] who suggested that coding theorems would follow for channels
that were suﬃciently continuous in a suitable sense.
The previous results did not require stationarity of the channel, but in a
sense stationarity is implicit if the channel codes are to be used repeatedly (as
they will be in a communication system). Thus the immediate applications of
the Feinstein results. will be to stationary channels.
The following is a rephrasing of Feinstein’s theorem that will be useful.
Corollary 12.3.2 Suppose that [A × B, µν ] is an AMS and ergodic hookup
¯
¯
of a source µ and channel ν . Let Iµν = Iµν (X ; Y ) denote the average mutual
∗
¯
¯
information rate and assume that Iµν = Iµν is ﬁnite. Then for any R < Iµν and
nR
any > 0 there exists an n0 such that for all n ≥ n0 there are (µ, e
, n, )Feinstein codes.
As a ﬁnal result of the Feinstein variety, we point out a variation that applies
to nonergodic channels.
Corollary 12.3.3 Suppose that [A × B, µν ] is an AMS hookup of a source µ
and channel ν . Suppose also that the information density converges a.e. to a
limiting density
1
i∞ = lim in (X n ; Y n ).
n→∞ n
(Conditions for this to hold are given in Theorem 8.5.1.) Then given > 0
and δ > 0 there exists for suﬃciently large n a [µ, M, n, + µν (i∞ ≤ R + δ )]
Feinstein code with M = enR . 254 CHAPTER 12. CODING FOR NOISY CHANNELS Proof: Follows from the lemma and from Fatou’s lemma which implies that
1
lim sup p( in (X n ; Y n ) ≤ a) ≤ p(i∞ ≤ a).
n
n→∞
2 12.4 Channel Capacity The form of the Feinstein lemma and its corollaries invites the question of how
large R (and hence M ) can be made while still getting a code of the desired
form. From Feinstein’s theorem it is seen that for an ergodic channel R can be
¯
any number less than I (µν ) which suggests that if we deﬁne the quantity
CAMS, e = ¯
Iµν , sup (12.15) AMS and ergodic µ
∗
¯
then if Iµν = Iµν (e.g., the channel has ﬁnite alphabet), then we can construct
for some µ a Feinstein code for µ with rate R arbitrarily near CAMS, e . CAMS, e is
an example of a quantity called an information rate capacity or, simply, capacity
of a channel. We shall encounter a few variations on this deﬁnition just as there
were various ways of deﬁning distortionrate functions for sources by considering
either vectors or processes with diﬀerent constraints. In this section a few of
these deﬁnitions are introduced and compared.
A few possible deﬁnitions of information rate capacity are ¯
CAMS = sup Iµν , (12.16) AMS µ Cs = ¯
Iµν , sup (12.17) stationary µ Cs, e = ¯
Iµν , sup (12.18) stationary and ergodic µ Cns = ¯
Iµν , sup (12.19) n−stationary µ Cbs = sup
block stationary µ ¯
Iµν = sup sup ¯
Iµν . (12.20) n n−stationary µ Several inequalities are obvious from the deﬁnitions:
CAMS ≥ Cbs ≥ Cns ≥ Cs ≥ Cs,
CAMS ≥ CAMS,e ≥ Cs, e . e (12.21)
(12.22) In order to relate these deﬁnitions we need a variation on Lemma 12.3.1 described in the following lemma.
Lemma 12.4.1 Given a stationary ﬁnitealphabet channel [A, ν, B ], let µ be
the distribution of a stationary channel input process and let {µx } be its ergodic
decomposition. Then
¯
¯
Iµν = dµ(x)Iµ ν .
(12.23)
x 12.4. CHANNEL CAPACITY 255 Proof: We can write
¯
Iµν = h1 (µ) − h2 (µ)
where 1
Hη (Y n )
n
is the entropy rate of the output, where η is the output measure induced by µ
and ν , and where
¯
h1 (µ) = Hη (Y ) = inf
n 1
¯
h2 (µ) = Hµν (Y X ) = lim Hµν (Y n X n )
n→∞ n
is the conditional entropy rate of the output given the input. If µk → µ on any
ﬁnite dimensional rectangle, then also ηk → η and hence
Hηk (Y n ) → Hη (Y n )
and hence it follows as in the proof of Corollary 2.4.1 that h1 (µ) is an upper
¯
semicontinuous function of µ. It is also aﬃne because Hη (Y ) is an aﬃne function
of η (Lemma 2.4.2) which is in turn a linear function of µ. Thus from Theorem
8.9.1 of [50]
h1 (µ) = dµ(x)h1 (µx ). ¯
h2 (µ) is also aﬃne in µ since h1 (µ) is aﬃne in µ and Iµν is aﬃne in µ (since it
is aﬃne in µν from Lemma 6.2.2). Hence we will be done if we can show that
h2 (µ) is upper semicontinuous in µ since then Theorem 8.9.1 of [50] will imply
that
h2 (µ) = dµ(x)h2 (µx ) which with the corresponding result for h1 proves the lemma. To see this observe
that if µk → µ on ﬁnite dimensional rectangles, then
Hµk ν (Y n X n ) → Hµν (Y n X n ). (12.24) Next observe that for stationary processes
n
H (Y n X n ) ≤ H (Y m X n ) + H (Ym−m X n )
n
n
≤ H (Y m X m ) + H (Ym−m Xm−m ) = H (Y m X m ) + H (Y n−m X n−m )
which as in Section 2.4 implies that H (Y n X n ) is a subadditive sequence and
hence
1
1
lim H (Y n X n ) = inf H (Y n X n ).
nn
n→∞ n
Coupling this with (12.24) proves upper semicontinuity exactly as in the proof
of Corollary 2.4.1, which completes the proof of the lemma.
2 256 CHAPTER 12. CODING FOR NOISY CHANNELS Lemma 12.4.2 If a channel ν has a ﬁnite alphabet and is stationary, then all
of the above information rate capacities are equal.
¯
Proof: From Theorem 6.4.1 I = I ∗ for ﬁnite alphabet processes and hence from
Lemma 6.2.2 and Lemma 9.3.2 we have that if µ is AMS with stationary mean
µ, then
¯
¯¯
¯
¯
Iµν = Iµν = Iµν
and thus the supremum over AMS sources must be the same as that over stationary sources. The fact that Cs ≤ Cs, e follows immediately from the previous
lemma since the best stationary source can do no better than to put all of
its measure on the ergodic component yielding the maximum information rate.
Combining these facts with (12.21)–(12.22) proves the lemma.
2
Because of the equivalence of the various forms of information rate capacity
for stationary channels, we shall use the symbol C to represent the information
rate capacity of a stationary channel and observe that it can be considered as
the solution to any of the above maximization problems.
Shannon’s original deﬁnition of channel capacity applied to channels without
input memory or anticipation. We pause to relate this deﬁnition to the process
deﬁnitions. Suppose that a channel [A, ν, B ] has no input memory or anticipation and hence for each n there are regular conditional probability measures
n
ν n (Gxn ); x ∈ An , G ∈ BB , such that
ˆ
n
νx (G) = ν n (Gxn ).
ˆ Deﬁne the ﬁnitedimensional capacity of the ν n by
ˆ
Cn (ˆn ) = sup Iµn ν n (X n ; Y n ),
ν
ˆ
µn where the supremum is over all vector distributions µn on An . Deﬁne the
Shannon capacity of the channel µ by
CShannon = lim n→∞ 1nn
C (ˆ )
ν
n if the limit exists. Suppose that the Shannon capacity exists for a channel ν
without memory or anticipation. Choose N large enough so that CN is very
close to CShannon and let µN approximately yield CN . Then construct a block
memoryless source using µN . A block memoryless source is AMS and hence if
the channel is AMS we must have an information rate
1
1
¯
Iµν (X ; Y ) = lim Iµν (X n ; Y n ) = lim
Iµν (X kN ; Y kN ).
n→∞ n
k→∞ kN
Since the input process is block memoryless, we have from Lemma 9.4.2 that
k I (X kN ;Y kN N
N
I (XiN ; YiN ). )≥
i=0 12.4. CHANNEL CAPACITY 257 If the channel is stationary then {Xn , Yn } is N stationary and hence if
1
I N ˆN (X N ; Y N ) ≥ CShannon − ,
Nµν
then 1
I (X kN ; Y kN ) ≥ CShannon − .
kN
Taking the limit as k → ∞ we have that
1
I (X kN ; Y kN ) ≥ CShannon −
k→∞ kN ¯
CAMS = C ≥ I (X ; Y ) = lim
and hence C ≥ CShannon .
Conversely, pick a stationary source µ which nearly yields C = Cs , that is,
¯
Iµν ≥ Cs − .
Choose n0 suﬃciently large to ensure that
1
¯
Iµν (X n ; Y n ) ≥ Iµν − ≥ Cs − 2 .
n
This implies, however, that for n ≥ n0
Cn ≥ Cs − 2 ,
and hence application of the previous lemma proves the following lemma.
Lemma 12.4.3 Given a ﬁnite alphabet stationary channel ν with no input
memory or anticipation,
C = CAMS = Cs = Cs, e = CShannon . The Shannon capacity is of interest because it can be numerically computed
while the process deﬁnitions are not always amenable to such computation.
With Corollary 12.3.2 and the deﬁnition of channel capacity we have the
following result.
Lemma 12.4.4 If ν is an AMS and ergodic channel and R < C , then there is
an n0 suﬃciently large to ensure that for all n ≥ n0 there exist (µ, enR , n, )
Feinstein codes for some channel input process µ.
Corollary 12.4.1 Suppose that [A, ν, B ] is an AMS and ergodic channel with
no input memory or anticipation. Then if R < C , the information rate capacity or Shannon capacity, then for > 0 there exists for suﬃciently large n a
( enR , n, ) channel code. 258 CHAPTER 12. CODING FOR NOISY CHANNELS Proof: Follows immediately from Corollary 12.3.3 by choosing a stationary and
¯
ergodic source µ with Iµν ∈ (R, C ).
2
There is another, quite diﬀerent, notion of channel capacity that we introduce for comparison and to aid the discussion of nonergodic stationary channels.
Deﬁne for an AMS channel ν and any λ ∈ (0, 1) the quantile
C ∗ (λ) = sup sup{r : µν (i∞ ≤ r) < λ)},
AMS µ where the supremum is over all AMS channel input processes and i∞ is the
limiting information density (which exists because µν is AMS and has ﬁnite
alphabet). Deﬁne the information quantile capacity C ∗ by
C ∗ = lim C ∗ (λ).
λ→0 ∗ The limit is well deﬁned since the C (λ) are bounded and nonincreasing. The
information quantile capacity was introduced by Winkelbauer [151] and its properties were developed by him and by Kieﬀer [76]. Fix an R < C ∗ and deﬁne
δ = (C ∗ − R)/2. Given > 0 we can ﬁnd from the deﬁnition of C ∗ an AMS channel input process µ for which µν (i∞ ≤ R + δ ) ≤ . Applying Corollary 12.3.3
with this δ and /2 then yields the following result for nonergodic channels.
Lemma 12.4.5 If ν is an AMS channel and R < C ∗ , then there is an n0 sufﬁciently large to ensure that for all n ≥ n0 there exist (µ, f enR f, n, ) Feinstein
codes for some channel input process µ.
We close this section by relating C and C ∗ for AMS channels.
Lemma 12.4.6 Given an AMS channel ν ,
C ≥ C ∗.
Proof: Fix λ > 0. If r < C ∗ (λ) there is a µ such that λ > µν (i∞ ≤ r) =
¯
1 − µν (i∞ > r) ≥ 1Iµν /r, where we have used the Markov inequality. Thus for
¯
all r < C ∗ we have that Iµν ≥ r(1 − µν (i∞ ≤ r)) and hence
¯
C ≥ Iµν ≥ C ∗ (λ)(1 − λ) → C ∗ .
λ→0 2
It can be shown that if a stationary channel is also ergodic, then C = C ∗ by
using the ergodic decomposition to show that the supremum deﬁning C (λ) can
be taken over ergodic sources and then using the fact that for ergodic µ and ν ,
¯
i∞ equals Iµν with probability one. (See Kieﬀer [76].) 12.5 Robust Block Codes Feinstein codes immediately yield channel codes when the channel has no input memory or anticipation because the induced vector channel is the same 12.5. ROBUST BLOCK CODES 259 with respect to vectors as the original channel. When extending this technique
to channels with memory and anticipation we will try to ensure that the induced channels are still reasonable approximations to the original channel, but
the approximations will not be exact and hence the conditional distributions
considered in the Feinstein construction will not be the same as the channel
conditional distributions. In other words, the Feinstein construction guarantees
a code that works well for a conditional distribution formed by averaging the
channel over its past and future using a channel input distribution that approximately yields channel capacity. This does not in general imply that the code
will also work well when used on the unaveraged channel with a particular past
and future input sequence. We solve this problem by considering channels for
which the two distributions are close if the block length is long enough.
In order to use the Feinstein construction for one distribution on an actual
channel, we will modify the block codes slightly so as to make them robust in
the sense that if they are used on channels with slightly diﬀerent conditional
distributions, their performance as measured by probability of error does not
change much. In this section we prove that this can be done. The basic technique
is due to Dobrushin [33] and a similar technique was studied by Ahlswede and
G´cs [4]. (See also Ahlswede and Wolfowitz [5].) The results of this section are
a
due to Gray, Ornstein, and Dobrushin [59].
A channel block length n code {wi , Γi ; i = 1, 2, . . . , M will be called δ robust (in the Hamming distance sense) if the decoding sets Γi are such that the
expanded sets
1
(Γi )δ ≡ {y n : dn (y n , Γi ) ≤ δ }
n
are disjoint, where
dn (y n , Γi ) = min dn (y n , un )
n
u ∈Γi and n−1 dn (y n , un ) = dH (yi , ui )
i=0 and dH (a, b) is the Hamming distance (1 if a = b and 0 if a = b). Thus the code
is δ robust if received ntuples in a decoding set can be changed by an average
Hamming distance of up to δ without falling in a diﬀerent decoding set. We
show that by reducing the rate of a code slightly we can always make a Feinstein
code robust.
Lemma 12.5.1 Let {wi , Γi ; i = 1, 2, . . . , M } be a (µ, enR , n, )Feinstein code
for a channel ν . Given δ ∈ (0, 1/4) and
R < R − h2 (2δ ) − 2δ log(B  − 1),
where as before h2 (a) is the binary entropy function −a log a − (1 − a) log(1 − a),
there exists a δ robust (µ, enR , n, n )Feinstein code for ν with
n ≤ + e−n(R −R−h2 (2δ)−2δ log(B −1)−3/n) . 260 CHAPTER 12. CODING FOR NOISY CHANNELS Proof: For i = 1, 2, . . . , M let ri (y n ) denote the indicator function for (Γi )2δ .
For a ﬁxed y n there can be at most
2δn 2δn n
n
1 i (1 n−i
)
)
(B  − 1)i = B n
(1 −
B  B 
i
i
i=0 i=0 ntuples bn ∈ B n such that n−1 dn (y n , bn ) ≤ 2δ . Set p = 1 − 1/B  and apply
Lemma 2.3.5 to the sum to obtain the bound
2δn n
1 i 1 n−i
(1 −
)(
)
k
B  B  i=0 ≤ B n e−nh2 (2δp) = B n e−nh2 (2δp)+n log B  , where
h2 (2δ p) =
=
= 2δ
1 − 2δ
+ (1 − 2δ ) ln
p
1−p
B 
−h2 (δ ) + 2δ ln
+ (1 − 2δ ) ln B 
B  − 1
−h2 (δ ) + ln B  − 2δ ln(B  − 1). 2δ ln Combining this bound with the fact that the Γi are disjoint we have that
M 2δn ri (y n ) ≤
i=0 i=1 n
(B  − 1)i ≤ e−n(h2 (2δ)+2δ ln(B −1) .
i Set M = enR and select 2M subscripts k1 , · · · , k2M from {1, · · · , M } by
random equally likely independent selection without replacement so that each
index pair (kj , km ); j, m = 1, . . . , 2M ; j = m, assumes any unequal pair with
probability (M (M − 1))−1 . We then have that E = ≤
≤
≤ 1
2M 2M 2M ν (Γkj
ˆ (Γkm )2δ wkj ) j =1 m=1,m=j 1
2M
1
2M 2M 2M M M j =1 m=1,m=j k=1 i=1,i=k
2M 2M M j =1 m=1,m=j k=1 1
M (M − 1) 1
M (M − 1) ν (y n wk )ri (y n )
ˆ
y n ∈Γk
M ν (y n wk )
ˆ
y n ∈Γk 2M n(h2 (2δ)+2δ log(B −1)
e
M −1
4e−n(R −R−h2 (2δ)−2δ log(B −1) ≡ λn , ri (y n )
i=1,i=k 12.5. ROBUST BLOCK CODES 261 where we have assumed that M ≥ 2 so that M − 1 ≥ M /2. Analogous to
a random coding argument, since the above expectation is less than λn , there
must exist a ﬁxed collection of subscripts i1 , · · · , i2M such that
1
2M 2M 2M (Γim )2δ wi j ) ≤ λn . ν (Γij
ˆ
j =1 m=1,m=j Since no more than half of the above indices can exceed twice the expected
value, there must exist indices k1 , · · · , kM ∈ {j1 , · · · , j2M } for which
M (Γkm )2δ wkj ) ≤ 2λn ; i = 1, 2, . . . , M. ν (Γkj
ˆ
m=1,m=j Deﬁne the code {wi , Γi ; i = 1, . . . , M } by wi = wki and
M Γi = Γki − (Γkm )2δ .
m=1,m=i The (Γi )δ are obviously disjoint since we have removed from Γki all words within
2δ of a word in any other decoding set. Furthermore, we have for all i =
1, 2, . . . , M that
1− ≤
= ν (Γki wki )
ˆ ν (Γki
ˆ (Γkm )2δ wki ) + ν (Γki
ˆ m=i ≤ ν (Γki
ˆ c (Γkm )2δ wki ) m=i (Γkm )2δ wki ) + ν (Γi wi )
ˆ m=i < 2λn + ν (Γi wi )
ˆ and hence
ν (Γi wi ) ≥ 1 − − 8e−n(R −R−h2 (2δ)−2δ log(B −1) ,
ˆ
2 which proves the lemma. Corollary 12.5.1 Let ν be a stationary channel and let Cn be a sequence of
(µn , enR , n, /2) Feinstein codes for n ≥ n0 . Given an R > 0 and δ > 0 such
that R < R − h2 (2δ ) − 2δ log(B  − 1), there exists for n1 suﬃciently large a
sequence Cn ; n ≥ n1 , of δ robust (µn , enR , n, ) Feinstein codes.
Proof: The corollary follows from the lemma by choosing n1 so that
e−n1 (R −R−h2 (2δ)−2δ ln(B −1)−3/n1 ) ≤ 2 . 2
Note that the sources may be diﬀerent for each n and that n1 does not
depend on the channel input measure. 262 CHAPTER 12. CODING FOR NOISY CHANNELS 12.6 Block Coding Theorems for Noisy Channels ¯
Suppose now that ν is a stationary ﬁnite alphabet dcontinuous channel. Suppose also that for n ≥ n1 we have a sequence of δ robust (µn , enR , n, ) Feinstein codes {wi , Γi } as in the previous section. We now quantify the performance of these codes when used as channel block codes, that is, used on the
actual channel ν instead of on an induced channel. As previously let ν n be the
ˆ
ndimensional channel induced by µn and the channel ν , that is, for µn (an ) > 0
n
ν n (Gan ) = Pr(Y n ∈ GX n = an ) =
ˆ 1
µn (an )
n c(an ) n
νx (G) dµ(x), (12.25) where c(an ) is the rectangle {x : x ∈ AT ; xn = an }, an ∈ An , and where
n
G ∈ BB . We have for the Feinstein codes that
max ν n (Γc wi ) ≤ .
ˆ
i
i We use the same codewords wi for the channel code, but we now use the expanded regions (Γi )δ for the decoding regions. Since the Feinstein codes were
δ robust, these sets are disjoint and the code well deﬁned. Since the channel is
¯
dcontinuous we can choose an n large enough to ensure that if xn = xn , then
¯
¯nn
dn (νx , νx ) ≤ δ 2 .
¯
Suppose that we have a Feinstein code such that for the induced channel
ν (Γi wi ) ≥ 1 − .
ˆ
Then if the conditions of Lemma 10.5.1 are met and µn is the channel input
source of the Feinstein code, then
ν n (Γi wi )
ˆ =
≤ 1 n
νx (Γi ) dµ(x)
c(wi )
n
νx ((Γi )δ ) + δ µn (wi )
n
inf x∈c(wi ) n
≤ sup νx (Γi )
x∈c(wi ) and hence
inf x∈c(wi ) n
νx ((Γi )δ ) ≥ ν n (Γi wi ) − δ ≥ 1 − − δ.
ˆ Thus if the channel block code is constructed using the expanded decoding sets,
we have that
max sup νx ((Γi )c ) ≤ + δ ;
δ
i x∈c(wi ) that is, the code {wi , (Γi )δ } is a ( enR , n, + δ ) channel code. We have now
proved the following result. 12.7. JOINT SOURCE AND CHANNEL BLOCK CODES 263 ¯
Lemma 12.6.1 Let ν be a stationary dcontinuous channel and Cn ; n ≥ n0 , a
nR
sequence of δ robust (µn , e
, n, ) Feinstein codes. Then for n1 suﬃciently
large and each n ≥ n1 there exists a ( enR , n, + δ ) block channel code.
Combining the lemma with Lemma 12.4.4 and Lemma 12.4.5 yields the
following theorem.
¯
Theorem 12.6.1 Let ν be an AMS ergodic dcontinuous channel. If R < C
then given > 0 there is an n0 such that for all n ≥ n0 there exist ( enR , n, )
channel codes. If the channel is not ergodic, then the same holds true if C is
replaced by C ∗ .
Up to this point the channel coding theorems have been “one shot” theorems
in that they consider only a single use of the channel. In a communication
system, however, a channel will be used repeatedly in order to communicate a
sequence of outputs from a source. 12.7 Joint Source and Channel Block Codes We can now combine a source block code and a channel block code of comparable rates to obtain a block code for communicating a source over a noisy
channel. Suppose that we wish to communicate a source {Xn } with a distri¯
ˆ
bution µ over a stationary and ergodic dcontinuous channel [B, ν, B ]. The
channel coding theorem states that if K is chosen to be suﬃciently large, then
we can reliably communicate length K messages from a collection of eKR
messages if R < C . Suppose that R = C − /2. If we wish to send the
given source across this channel, then instead of having a source coding rate of
(K/N ) log B  bits or nats per source symbol for a source (N, K ) block code, we
reduce the source coding rate to slightly less than the channel coding rate R, say
Rsource = (K/N )(R − /2) = (K/N )(C − ). We then construct a block source
codebook C of this rate with performance near δ (Rsource , µ). Every codeword
in the source codebook is assigned a channel codeword as index. The source is
encoded by selecting the minimum distortion word in the codebook and then
inserting the resulting channel codeword into the channel. The decoder then
uses its decoding sets to decide which channel codeword was sent and then puts
out the corresponding reproduction vector. Since the indices of the source code
words are accurately decoded by the receiver with high probability, the reproduction vector should yield performance near that of δ ((K/N )(C ), µ). Since
is arbitrary and δ (R, µ) is a continuous function of R, this implies that the
OPTA for block coding µ for ν is given by δ ((K/N )C, µ), that is, by the OPTA
for block coding a source evaluated at the channel capacity normalized to bits
or nats per source symbol. Making this argument precise yields the block joint
source and channel coding theorem.
A joint source and channel (N, K ) block code consists of an encoder α :
ˆ
ˆ
AN → B K and decoder β : B K → AN . It is assumed that N source time units 264 CHAPTER 12. CODING FOR NOISY CHANNELS correspond to K channel time units. The block code yields sequence coders
¯ˆ
ˆ
α : AT → B T and β : B T → AT deﬁned by
¯
α(x)
¯
¯
β (x) = {α(xN ); all i}
iN
= {β (xN ); all i}.
iN Let E denote the class of all such codes (all N and K consistent with the physical stationarity requirement). Let ∆∗ (µ, ν, E ) denote the block coding OPTA
function and D(R, µ) the distortionrate function of the source with respect to
an additive ﬁdelity criterion {ρn }. We assume also that ρn is bounded, that is,
there is a ﬁnite value ρmax such that
1
ρn (xn , xn ) ≤ ρmax
ˆ
n
for all n. This assumption is an unfortunate restriction, but it yields a simple
proof of the basic result.
Theorem 12.7.1 Let {Xn } be a stationary source with distribution µ and let
¯
ν be a stationary and ergodic dcontinuous channel with channel capacity C .
Let {ρn } be a bounded additive ﬁdelity criterion. Given > 0 there exists for
suﬃciently large N and K (where K channel time units correspond to N source
ˆ
ˆ
time units) an encoder α : AN → B K and decoder β : B K → AN such that
¯ˆ
ˆ
if α : AT → B T and β : B T → AT are the induced sequence coders, then the
¯
resulting performance is bounded above as
K
¯
ˆ
∆(µ, α, ν, β ) = EρN (X N , X N ) ≤ δ ( C, µ) + .
¯
N
Proof: Given , choose γ > 0 so that
δ( K
K
(C − γ ), µ) ≤ δ ( C, µ) +
N
N
3 and choose N large enough to ensure the existence of a source codebook C of
length N and rate Rsource = (K/N )(C − γ ) with performance
ρ(C , µ) ≤ δ (Rsource , µ) + .
3
We also assume that N and hence K is chosen large enough so that for a
suitably small δ (to be speciﬁed later) there exists a channel ( eKR , K, δ ) code,
with R = C − γ/2. Index the eN Rsource words in the source codebook by the
eK (C −γ/2 channel codewords. By construction there are more indices than
source codewords so that this is possible. We now evaluate the performance of
this code.
Suppose that there are M words in the source codebook and hence M of the
channel words are used. Let xi and wi denote corresponding source and channel
ˆ
codewords, that is, if xi is the minimum distortion word in the source codebook
ˆ 12.7. JOINT SOURCE AND CHANNEL BLOCK CODES 265 for an observed vector, then wi is transmitted over the channel. Let Γi denote
the corresponding decoding region. Then
M M ˆ
EρN (X N , X N ) =
x:α(xN )=wi i=1 j =1 K
dµ(x)νx (Γj )ρN (xN , xj )
ˆ M
K
dµ(x)νx (Γi )ρN (xN , xi )
ˆ =
x:α(xN )=w i=1
M i M +
i=1 j =1,j =i x:α(xN )=wi K
dµ(x)νx (Γj )ρN (xN , xj )
ˆ M dµ(x)ρN (xN , xi )
ˆ ≤
i=1 x:α(xN )=wi
M M +
i=1 j =1,J =i x:α(xN )=wi K
dµ(x)νx (Γj )ρN (xN , xj )
ˆ The ﬁrst term is bounded above by δ (Rsource , µ) + /3 by construction. The
second is bounded above by ρmax times the channel error probability, which is
less than δ by assumption. If δ is chosen so that ρmax δ is less than /2, the
theorem is proved.
2
Theorem 12.7.2 Let {Xn } be a stationary source source with distribution µ
and let ν be a stationary channel with channel capacity C . Let {ρn } be a
bounded additive ﬁdelity criterion. For any block stationary communication system (µ, f, ν, g ), the average performance satisﬁes
∆(µ, f, ν, g ) ≤ dµ(x)D(C, µx ),
¯
¯
x where µ is the stationary mean of µ and {µx } is the ergodic decomposition of
¯
¯
µ, C is the capacity of the channel, and D(R, µ) the distortionrate function.
¯
N
K
K
ˆN
Proof: Suppose that the process {XnN , UnK , YnK , XnN } is stationary and con¯(X ; X ). From the data processing
ˆ
sider the overall mutual information rate I
theorem (Lemma 9.4.8) K
K¯
¯
ˆ
I (X ; X ) ≤ I (U ; Y ) ≤ C.
N
N
Choose L suﬃciently large so that
1
K
ˆ
I (X n ; X n ) ≤ C +
n
N
and
Dn ( K
K
C + , µ) ≥ D( C + , µ) − δ
N
N 266 CHAPTER 12. CODING FOR NOISY CHANNELS for n ≥ L. Then if the ergodic component µx is in eﬀect, the performance can
be no better than
ˆ
Eµx ρN (X n , X N ) ≥ inf pN ∈R K
N(N C+ ,µN )
x K
ˆ
ρN (X N , X N ) ≥ DN ( C + , µx )
N which when integrated yields a lower bound of
dµ(x)D( K
C + , µx ) − δ.
N Since δ and are arbitrary, the lemma follows from the continuity of the distortion rate function.
2
Combining the previous results yields the block coding operational DRF for
¯
stationary sources and stationary and ergodic dcontinuous channels.
Corollary 12.7.1 Let {Xn } be a stationary source with distribution µ and let
¯
ν be a stationary and ergodic dcontinuous process with channel capacity C . Let
{ρn } be a bounded additive ﬁdelity criterion. The block coding operational DRF
is given by
∆∗ (µ, ν, E , D) = 12.8 dµ(x)D(C, µx ).
¯
¯ Synchronizing Block Channel Codes As in the source coding case, the ﬁrst step towards proving a sliding block coding
theorem is to show that a block code can be synchronized, that is, that the decoder can determine (at least with high probability) where the block code words
begin and end. Unlike the source coding case, this cannot be accomplished by
the use of a simple synchronization sequence which is prohibited from appearing
within a block code word since channel errors can cause the appearance of the
sync word at the receiver by accident. The basic idea still holds, however, if the
codes are designed so that it is very unlikely that a nonsync word can be con¯
verted into a valid sync word. If the channel is dcontinuous, then good robust
Feinstein codes as in Corollary 12.5.1 can be used to obtain good codebooks
. The basic result of this section is Lemma 12.8.1 which states that given a
sequence of good robust Feinstein codes, the code length can be chosen large
enough to ensure that there is a sync word for a slightly modiﬁed codebook;
that is, the synch word has length a speciﬁed fraction of the codeword length
and the sync decoding words never appear as a segment of codeword decoding words. The technique is due to Dobrushin [33] and is an application of
Shannon’s random coding technique. The lemma originated in [59].
The basic idea of the lemma is this: In addition to a good long code, one
selects a short good robust Feinstein code (from which the sync word will be
chosen) and then performs the following experiment. A word from the short
code and a word from the long code are selected independently and at random.
The probability that the short decoding word appears in the long decoding word 12.8. SYNCHRONIZING BLOCK CHANNEL CODES 267 is shown to be small. Since this average is small, there must be at least one short
word such that the probability of its decoding word appearing in the decoding
word of a randomly selected long code word is small. This in turn implies
that if all long decoding words containing the short decoding word are removed
from the long code decoding sets, the decoding sets of most of the original long
code words will not be changed by much. In fact, one must remove a bit more
from the long word decoding sets in order to ensure the desired properties are
preserved when passing from a Feinstein code to a channel codebook.
Lemma 12.8.1 Assume that ≤ 1/4 and {Cn ; n ≥ n0 } is a sequence of robust
¯
{τ, M (n), n, /2} Feinstein codes for a dcontinuous channel ν having capacity
C > 0. Assume also that h(2 ) + 2 log(B  − 1) < C , where B is the channel
output alphabet. Let δ ∈ (0, 1/4). Then there exists an n1 such that for all
n ≥ n1 the following statements are true.
(A) If Cn = {vi , Γi ; i = 1, . . . , M (n)}, then there is a modiﬁed codebook Wn =
{wi ; Wi ; i = 1, . . . , K (n)} and a set of K (n) indices Kn = {k1 , · · · , kK (n) ⊂
{1, · · · , M (n)} such that wi = vki , Wi ⊂ (Γi ) 2 ; i = 1, . . . , K (n), and
n
sup νx (Wjc ) ≤ . max 1≤j ≤K (n) x∈c(wj ) (12.26) (B) There is a sync word σ ∈ Ar , r = r(n) = δ n = smallest integer larger
r
than δn, and a sync decoding set S ∈ BB such that
r
sup νx (S c ) ≤ . (12.27) x∈c(σ ) and such that no rtuple in S appears in any ntuple in Wi ; that is, if
r
G(br ) = {y n : yi = br some i = 0, . . . , n − r} and G(S ) = br ∈S G(br ),
then
G(S ) Wi = ∅, i = 1, . . . , K (n).
(12.28)
(C) We have that
{k : k ∈ Kn } ≤ δM (n). (12.29) The modiﬁed code Wn has fewer words than the original code Cn , but (12.29)
ensures that Wn cannot be much smaller since
K (n) ≥ (1 − δ )M (n). (12.30) Given a codebook Wn = {wi , Wi ; i = 1, . . . , K (n)}, a sync word σ ∈ Ar ,
and a sync decoding set S , we call the length n + r codebook {σ × wi , S × Wi ;
i = 1, . . . , K (n)} a preﬁxed or punctuated codebook.
¯
Proof: Since ν is dcontinuous, n2 can be chosen so large that for n ≥ n2
max sup an ∈An x,x ∈c(an ) δ
¯nn
dn (νx , νx ) ≤ ( )2 .
2 (12.31) 268 CHAPTER 12. CODING FOR NOISY CHANNELS From Corollary 12.5.1 there is an n3 so large that for each r ≥ n3 there exists
an /2robust (τ, J, r, /2)Feinstein code Cs = {sj , Sj : j = 1, . . . , J }; J ≥ 2rRs ,
where Rs ∈ (0, C − h(2 ) − 2 log(B  − 1)). Assume that n1 is large enough
to ensure that δn1 ≥ n2 ; δn1 ≥ n3 , and n1 ≥ n0 . Let 1F denote the indicator
function of the set F and deﬁne λn by
J λn = J −1
j =1
J = J −1
j =1 = J −1 1
M (n)
1
M (n) 1
M (n) M (n) ν n (G((Sj ) )
ˆ Γi vi ) i=1
M (n) ν n (y n vi )1G(b ) (y n )
ˆ
i=1 b ∈(Sj ) y n ∈Γi M (n) J 1G(b ) (y n ) . (12.32) ν n (y n vi ) ˆ
i=1 y n ∈Γi j =1 b ∈(Sj ) Since the (Sj ) are disjoint and a ﬁxed y n can belong to at most n − r ≤ n sets
G(br ), the bracket term above is bound above by n and hence
λn ≤ n1
J M (n) M (n) ν n (y n vi ) ≤
ˆ
i=1 n
≤ n2−rRs ≤ n2−δnRs → 0
n→∞
J so that choosing n1 also so that n1 2−δnRs ≤ (δ )2 h we have that λn ≤ (δ )2 if
n ≥ n1 . From (12.32) this implies that for n ≥ n1 there must exist at least one
j for which
M (n) ν n (G((Sj ) )
ˆ Γi vi ) ≤ (δ )2 i=1 which in turn implies that for n ≥ n1 there must exist a set of indices Kn ⊂
{1, · · · , M (n)} such that
ν n (G((Sj ) )
ˆ Γi vi ) ≤ δ , i ∈ Kn , {i : i ∈ Kn } ≤ δ .
Deﬁne σ = sj ; S = (Sj ) /2 , wi = vki , and Wi = (Γki G((Sj ) )c ) δ ; i =
1, . . . , K (n). We then have from Lemma 12.6.1 and (12.31) that if x ∈ c(σ ),
then since δ ≤ /2
r
r
νx (S ) = νx ((Sj ) /2 ) ≥ ν r (Sj σ ) −
ˆ 2 ≥1− , proving (12.27). Next observe that if y n ∈ (G((Sj ) )c ) δ , then there is a bn ∈
G((Sj ) )c such that dn (y n , bn ) ≤ δ and thus for i = 0, 1, . . . , n − r we have that
r
dr (yi , br ) ≤
i nδ
≤.
r2
2 12.8. SYNCHRONIZING BLOCK CHANNEL CODES 269 Since bn ∈ G((Sj ) )c , it has no rtuple within of an rtuple in Sj and hence
r
the rtuples yi are at least /2 distant from Sj and hence y n ∈ H ((S ) /2 )c ). We
have therefore that (G((Sj ) )c ) δ ⊂ G((Sj ) )c and hence
G(S ) Wi = G((Sj ) ) ⊂ G((Sj ) /2 ) (Γki G((Sj ) )c )δ (G((Sj ) )c )δ = ∅, completing the proof.
2
Combining the preceding lemma with the existence of robust Feinstein codes
at rates less than capacity (Lemma 12.6.1) we have proved the following synchronized block coding theorem.
¯
Corollary 12.8.1 Le ν be a stationary ergodic dcontinuous channel and ﬁx
> 0 and R ∈ (0, C ). Then there exists for suﬃciently large blocklength N , a
length N codebook {σ × wi , S × Wi ; i = 1, . . . , M }, M ≥ 2N R , σ ∈ Ar , wi ∈ An ,
r + n = N , such that
r
sup νx (S c ) ≤ , x∈c(σ ) n
max νx (Wjc ) ≤ , i≤j ≤M Wj G(S ) = ∅. Proof: Choose δ ∈ (0, /2) so small that C − h(2δ ) − 2δ log(B  − 1) > (1 +
δ )R(1 − log(1 − δ 2 )) and choose R ∈ ((1 + δ )R(1 − log(1 − δ 2 )), C − h(2δ ) −
2δ log(B  − 1). From Lemma 12.6.1 there exists an n0 such that for n ≥
n0 there exist δ robust (τ, µ, n, δ ) Feinstein codes with M (n) ≥ 2nR . From
Lemma 12.8.1 there exists a codebook {wi , Wi ; i = 1, . . . , K (n)}, a sync word
r
σ ∈ Ar , and a sync decoding set S ∈ BB , r = δ n such that
n
max sup νx (Wjc ) ≤ 2δ ≤ ,
j x∈c(wj ) r
sup νx (S ) ≤ 2δ ≤ , x∈c(σ ) G(S ) Wj = ∅; j = 1, . . . , K (n), and from (12.30)
M = K (n) ≥ (1 − δ 2 )M (n). Therefore for N = n + r
N −1 log M completing the proof. ≥ (n nδ )−1 log((1 − δ 2 )2nR )
nR + log(1 − δ 2 )
=
n + nδ
R + n−1 log(1 − δ 2 )
=
1+δ
R + log(1 − δ 2 )
≥
≥ R,
1+δ
2 270 CHAPTER 12. CODING FOR NOISY CHANNELS 12.9 Sliding Block Source and Channel Coding Analogous to the conversion of block source codes into sliding block source
codes, the basic idea of constructing a sliding block channel code is to use a
punctuation sequence to stationarize a block code and to use sync words to
locate the blocks in the decoded sequence. The sync word can be used to
mark the beginning of a codeword and it will rarely be falsely detected during
a codeword. Unfortunately, however, an rtuple consisting of a segment of
a sync and a segment of a codeword may be erroneously detected as a sync
with nonnegligible probability. To resolve this confusion we look at the relative
frequency of syncdetects over a sequence of blocks instead of simply trying to
ﬁnd a single sync. The idea is that if we look at enough blocks, the relative
frequency of the syncdetects in each position should be nearly the probability
of occurrence in that position and these quantities taken together give a pattern
that can be used to determine the true sync location. For the ergodic theorem
to apply, however, we require that blocks be ergodic and hence we ﬁrst consider
totally ergodic sources and channels and then generalize where possible. Totally Ergodic Sources
¯
Lemma 12.9.1 Let ν be a totally ergodic stationary dcontinuous channel. Fix
, δ > 0 and assume that CN = {σ × wi ; S × Wi ; i = 1, . . . , K } is a preﬁxed
codebook satisfying (12.26)–(12.28). Let γn : GN → CN assign an N tuple in the
preﬁxed codebook to each N tuple in GN and let [G, µ, U ] be an N stationary, N ergodic source. Let c(an ) denote the cylinder set or rectangle of all sequences u =
(· · · , u−1 , u0 , u1 , · · · ) for which un = an . There exists for suﬃciently large L
(which depends on the source) a sync locating function s : B LN → {0, 1, . . . , N −
N
m
1} and a set Φ ∈ BG , m = (L +1)N , such that if um ∈ Φ and γN (ULN ) = σ × wi ,
then
inf x∈c(γm (um )) νx (y : s(y LN ) = θ, θ = 0, . . . , N −1; yLN ∈ S ×Wi ) ≥ 1−3 . (12.33) Comments: The lemma can be interpreted as follows. The source is block encoded using γN . The decoder observes a possible sync word and then looks
“back” in time at previous channel outputs and calculates s(y LN ) to obtain the
exact sync location, which is correct with high probability. The sync locator
function is constructed roughly as follows: Since µ and ν are N stationary and
N ergodic, if γ : A∞ → B ∞ is the sequence encoder induced by the length
¯
N block code γN , then the encoded source µγ −1 and the induced channel
¯
output process η are all N stationary and N ergodic. The sequence zj =
η (T j c(S ))); j = . . . , −1, 0, 1, . . . is therefore periodic with period N . Furthermore, zj can have no smaller period than N since from (12.26)–(12.28)
η (T j c(S )) ≤ , j = r + 1, . . . , n − r and η (c(S )) ≥ 1 − . Thus deﬁning the
sync pattern {zj ; j = 0, 1, . . . , N − 1}, the pattern is distinct from any cyclic
shift of itself of the form {zk , · · · , zN −1 , z0 , · · · , xk−1 }, where k ≤ N − 1. The
sync locator computes the relative frequencies of the occurrence of S at intervals of length N for each of N possible starting points to obtain, say, a 12.9. SLIDING BLOCK SOURCE AND CHANNEL CODING 271 vector z N = (ˆ0 , z1 , · · · , zN −1 ). The ergodic theorem implies that the zi will
ˆ
zˆ
ˆ
ˆ
be near their expectation and hence with high probability (ˆ0 , · · · , zN −1 ) =
z
ˆ
(zθ , zθ+1 , · · · , zN −1 , z0 , · · · , zθ−1 ), determining θ. Another way of looking at
the result is to observe that the sources ηT j ; j = 0, . . . , N − 1 are each N ergodic and N stationary and hence any two are either identical or orthogonal
in the sense that they place all of their measure on disjoint N invariant sets.
(See, e.g., Exercise 1, Chapter 6 of [50].) No two can be identical, however,
since if ηT i = ηT j for i = j ; 0 ≤ i, j ≤ N − 1, then η would be periodic with
period i − j  strictly less than N , yielding a contradiction. Since membership
in any set can be determined with high probability by observing the sequence
for a long enough time, the sync locator attempts to determine which of the
N distinct sources ηT j is being observed. In fact, synchronizing the output
is exactly equivalent to forcing the N sources ηT j ; j = 0, 1, . . . , N − 1 to be
distinct N ergodic sources. After this is accomplished, the remainder of the
¯
proof is devoted to using the properties of dcontinuous channels to show that
synchronization of the output source when driven by µ implies that with high
probability the channel output can be synchronized for all ﬁxed input sequences
in a set of high µ probability.
The lemma is stronger (and more general) than the similar results of Nedoma
[108] and Vajda [143], but the extra structure is required for application to
sliding block decoding.
Proof: Choose ζ > 0 so that ζ < /2 and
ζ< 1
min zi − zj .
8 i,j :zi =zj (12.34) LN
˜
For α > 0 and θ = 0, 1, . . . , N − 1 deﬁne the sets ψ (θ, α) ∈ BB and ψ (θ, α) ∈
m
BB , m = (L + 1)N by ψ (θ, α) = {y LN :  ˜
ψ (θ, α) θ 1
L−1 L−2
r
1S (yj +iN ) − zθ+j  ≤ α; j = 0, 1, . . . , N − 1}
i=0
N −θ = B × ψ (θ, α) × B . From the ergodic theorem L can be chosen large enough so that
N −1 N −1 ˜
ψ (θ, ζ )) ≥ 1 − ζ 2 . T −θ c(ψ (θ, ζ ))) = η m ( η(
θ =0 (12.35) θ =0 Assume also that L is large enough so that if xi = xi , i = 0, . . . , m − 1 then
ζ
m
m
¯
dm (νx , νx ) ≤ ( )2 .
N (12.36) 272 CHAPTER 12. CODING FOR NOISY CHANNELS From (12.35)
N −1 N −1 ˜
ψ (θ, ζ ))c ) ζ 2 ≥ η m (( =
c(am ) am ∈Gm θ =0 ˜
ψ (θ, ζ )c )) m
dµ(u)νγ (u) ((
¯
θ =0
N −1 ˜
ψ (θ, ζ ))c γm (am )) µm (am )ˆ((
ν =
am ∈ G m and hence there must be a set Φ ∈ m
BB θ =0 such that N −1 ˜
ψ (θ, ζ ))c γm (am )) ≤ ζ, am ∈ Φ, ν m ((
ˆ (12.37) θ =0 µm (Φ) ≤ ζ. (12.38) LN Deﬁne the sync locating function s : B
→ {0, 1, · · · , N − 1} as follows:
Deﬁne the set ψ (θ) = {y LN ∈ (ψ (θ, ζ ))2ζ/N } and then deﬁne
y LN ∈ ψ (θ)
otherwise θ
1 s(y LN ) = We show that s is well deﬁned by showing that ψ (θ) ⊂ ψ (θ, 4ζ ), which sets are
disjoint for θ = 0, 1, . . . , N − 1 from (12.34). If y LN ∈ ψ (θ), there is a bLN ∈
ψ (θ, ζ ) for which dLN (y LN , bLN ) ≤ 2ζ/N and hence for any j ∈ {0, 1, · · · , N − 1}
N
at most LN (2ζ/N ) = 2ζL of the consecutive nonoverlapping N tuples yj +iN ,
i = 0, 1, . . . , L − 2, can diﬀer from the corresponding bN iN and therefore
j+
 1
L−1 L −2
r
1S (yj +iN ) − zθ+j  ≤ 
i=0 1
L−1 L −2 1S (br+iN ) − zθ+j  + 2ζ ≤ 3ζ
j
i=0 m
˜
and hence y LN ∈ ψ (θ, 4ζ ). If ψ (θ) is deﬁned to be B θ × ψ (θ) × B N −θ ∈ BB ,
then we also have that
N −1 N −1 ˜
ψ (θ, ζ ))ζ/N ⊂ (
θ =0 ˜
ψ (θ )
θ =0 N −1
θ =0 ˜
since if y ∈ (
ψ (θ, ζ ))ζ/N , then there is a bm such that bLN ∈ ψ (θ, ζ );
θ
θ = 0, 1, . . . , N − 1 and dm (y m , bm ) ≤ ζ/N for θ = 0, 1, . . . , N − 1. This implies
from Lemma 12.6.1 and (12.36)–(12.38) that if x ∈ γ m (am ) and am ∈ Φ, then
n N −1
m
˜
νx (
ψ (θ))
θ =0 N −1 ≥ ˜
ψ (θ, ζ ))ζ/N ) m
νx ((
θ =0
N −1 ≥ ν(
ˆ
θ =0 ≥ ζ
˜
ψ (θ, ζ )γ m (am )) −
N 1−ζ − ζ
≥1− .
N (12.39) 12.9. SLIDING BLOCK SOURCE AND CHANNEL CODING 273 To complete the proof, we use (12.26)–(12.28) and (12.39) to obtain for
am ∈ Φ and γm (aN LN ) = σ × wi that
LN
N
νx (y : s(yθ ) = θ, θ = 0, 1, . . . , N − 1; yLN ∈ S × Wi ) ≥ N −1
m
νx (
ψ (θ))
θ =0 N
− νT −N L x (S × Wic ) ≥ 1 − − 2 . 2
Next the preﬁxed block code and the sync locator function are combined
with a random punctuation sequence of Lemma 9.5.2 to construct a good sliding
block code for a totally ergodic source with entropy less than capacity.
¯
Lemma 12.9.2 Given a dcontinuous totally ergodic stationary channel ν with
Shannon capacity C , a stationary totally ergodic source [G, µ, U ] with entropy
rate H (µ) < C , and δ > 0, there exists for suﬃciently large n, m a sliding block
encoder f : Gn → A and decoder g : B m → G such that Pe (µ, ν, f, g ) ≤ δ .
¯
¯
Proof: Choose R, H < R < C , and ﬁx > 0 so that ≤ δ/5 and ≤ (R − H )/2.
Choose N large enough so that the conditions and conclusions of Corollary 12.8.1
hold. Construct ﬁrst a joint source and channel block encoder γN as follows:
From the asymptotic equipartition property (Lemma 3.2.1 or Section 3.5)there
is an n0 large enough to ensure that for N ≥ n0 the set
= ¯
{uN : N −1 hN (u) − H  ≥ } = {uN : e−N (H + ) ≤ µ(uN ) ≤ e−N (H − ) } (12.40) µU N (GN ) ≥ 1 − . GN (12.41) ¯ ¯ has probability
Observe that if M = GN , then
¯ ¯ 2N (H − ) ≤ M ≤ 2N (H + ) ≤ 2N (R− ) . (12.42) Index the members of GN as βi ; i = 1, . . . , M . If uN = βi , set γN (uN )
= σ × wi . Otherwise set γN (uN ) = σ × wM +1 . Since for large N , 2N (R− ) + 1 ≤
2N R , γN is well deﬁned. γN can be viewed as a synchronized extension of the
almost noiseless code of Section 3.5. Deﬁne also the block decoder ψN (y N ) = βi
if y N ∈ S × Wi ; i = 1, . . . , M . Otherwise set ψN (y N ) = β ∗ , an arbitrary
reference vector. Choose L so large that the conditions and conclusions of
Lemma 12.9.1 hold for C and γN . The sliding block decoder gm : B m → G,
m
ˆ
m = (L + 1)N , yielding decoded process Uk = gm (Yk−N L ) is deﬁned as follows:
ˆ
If s(yk−N L , · · · , yk − 1) = θ, form bN = ψN (yk−θ , · · · , yk−θ−N ) and set Uk (y ) =
gm (yk−N L , · · · , yk+N ) = bθ , the appropriate symbol of the appropriate block.
The sliding block encoder f will send very long sequences of block words
with random spacing to make the code stationary. Let K be a large number 274 CHAPTER 12. CODING FOR NOISY CHANNELS satisfying K ≥ L + 1 so that m ≤ KN and recall that N ≥ 3 and L ≥ 1. We
then have that
1
1
≤
≤.
(12.43)
KN
3K
6
Use Corollary 9.4.2 to produce a (KN, ) punctuation sequence Zn using a
ﬁnite length sliding block code of the input sequence. The punctuation process
is stationary and ergodic, has a ternary output and can produce only isolated
0’s followed by KN 1’s or individual 2’s. The punctuation sequence is then used
to convert the block encoder γN into a sliding block coder: Suppose that the
encoder views an input sequence u = · · · , u−1 , u0 , u1 , · · · and is to produce a
single encoded symbol x0 . If u0 is a 2, then the encoder produces an arbitrary
channel symbol, say a∗ . If x0 is not a 2, then the encoder inspects u0 , u−1 , u−2
and so on into the past until it locates the ﬁrst 0. This must happen within KN
input symbols by construction of the punctuation sequence. Given that the ﬁrst
1 occurs at, say, Zl = 1,, the encoder then uses the block code γN to encode
successive blocks of input N tuples until the block including the symbol at time 0
is encoded. The sliding block encoder than produces the corresponding channel
symbol x0 . Thus if Zl = 1, then for some J < Kx0 = (γN (ul+JN ))l mod N where
the subscript denotes that the (l mod N )th coordinate of the block codeword is
put out. The ﬁnal sliding block code has a ﬁnite length given by the maximum
of the lengths of the code producing the punctuation sequence and the code
imbedding the block code γN into the sliding block code.
ˆ
We now proceed to compute the probability of the error event {u, y : U0 (y ) =
¯
ˆ0 (y ) = U0 (u)}, f be the sequence
U0 (u)} = E . Let Eu denote the section {y : U
coder induced by f , and F = {u : Z0 (u) = 0}. Note that if u ∈ T −1 F ,
then T u ∈ F and hence Z0 (T u) = Z1 (u) since the coding is stationary. More
generally, if uT −i F , then Zi = 0. By construction any 1 must be followed by
KN 1’s and hence the sets T −i F are disjoint for i = 0, 1, . . . , KN − 1 and hence
we can write
ˆ
Pe = Pr(U0 = U0 ) = µν (E ) = dµ(u)νf (u) (Eu )
¯
KN −1 LN −1 ≤
i=0 T −i F dµ(u)νf (u) (Eu ) +
¯
i=LN + T −i F dµ(u)νf (u) (Eu )
¯ dµ(u)
S
( KN −1 T −i F )c
i=0
KN −1 = LN µ(F ) +
i=LN T −i F dµ(u)νf (u) (Eu ) + a
¯ KN −1 ≤2 +
i=LN akN ∈GkN u ∈T −i (F T c(aK N )) ˆ
dµ(u )νf (u ) (y : U0 (u ) = U0 (u )),
¯
(12.44) where we have used the fact that µ(F ) ≤ (KN )−1 (from Corollary 9.4.2) and
hence LN µ(F ) ≤ L/K ≤ . Fix i = kN + j ; 0 ≤ j ≤ N − 1 and deﬁne 12.9. SLIDING BLOCK SOURCE AND CHANNEL CODING 275 u = T j +LN u and y = T j +LN y , and the integrals become
dµ(u )×
u ∈T −i (F T c(aKN ))
m
νf (u ) (y : U0 (u ) = gm (Y−N L (y ))
¯ dµ(u )× =
u∈T −(k−L)N (F T c(aKN )) νf (T −(j+LN ) u) (y :U0 (T j +LN u) = gm (Y− N Lm (T j +N L y )))
¯
dµ(u )× =
u∈T −(k−L)N (F T c(aKN )) m
νf (T −(j+LN ) u) (y : uj +LN = gm (yj ))
¯ = dµ(u )
u∈T −(k−L)N (F T c(aKN )) N
LN
× νf (T −(j+LN ) u) (y : uN = ψN (yLN ) or s(yj = j )). (12.45)
¯
LN
N
N
= βj ∈ GN , then uN = ψN (yLN ) if yLN ∈ S × Wi . If u ∈
LN
KN
m
m
T
c(a ), then u = a(k−L)N and hence from Lemma 12.9.1 and stationarity we have for i = kN + j that If uN
LN
−(k−L)N aKN ∈GKN T −i (c(aKN ) T dµ(u)νf (u) (Eu )
¯
F) µ(T −(k−L)N (c(aKN ) ≤3 ×
KN F )) KN a
∈G
am −L)N ∈ Φ (GLN × GN )
(k
µ(T −(k−L)N (c(aKN ) +
KN F )) KN a
∈G
am −L)N ∈ Φ (GLN × GN )
(k
µ(c(aKN ) ≤3 × F )) aKN ∈GKN µ(c(aKN ) + F )) S
am −L)N ∈Φc (GLN ×GN )c
(k ≤ 3 µ(F ) + µ(c(Φc ) F ) + µ(c(GN ) F ). (12.46) Choose the partition in Lemmas 9.5.1–9.5.2 to be that generated by the sets
c(Φc ) and c(GN ) (the partition with all four possible intersections of these sets
or their complements). Then the above expression is bounded above by
3
+
+
≤5
NK
NK
NK
NK
and hence from (12.44)
Pe ≤ 5 ≤ δ (12.47) 276 CHAPTER 12. CODING FOR NOISY CHANNELS
2 which completes the proof.
The lemma immediately yields the following corollary. ¯
Corollary 12.9.1 If ν is a stationary dcontinuous totally ergodic channel with
Shannon capacity C , then any totally ergodic source [G, µ, U ] with H (µ) < C is
admissible. Ergodic Sources
If a preﬁxed blocklength N block code of Corollary 12.9.1 is used to block encode
a general ergodic source [G, µ, U ], then successive N tuples from µ may not be
ergodic, and hence the previous analysis does not apply. From the Nedoma
ergodic decomposition [107] (see, e.g., [50], p. 232), any ergodic source µ can be
represented as a mixture of N ergodic sources, all of which are shifted versions
of each other. Given an ergodic measure µ and an integer N , then there exists
a decomposition of µ into M N ergodic, N stationary components where M
∞
divides N , that is, there is a set Π ∈ BG such that
TMΠ = Π (12.48) T Π) = 0; i, j ≤ M, i = j (12.49) T i Π) = 1 µ(Π) i = 1
,
M j µ(T Π
M −1 µ(
i=0 such that the sources [G, µi , U ], where πi (W ) = µ(W T i Π) = M µ(W
are N ergodic and N stationary and
µ(W ) = 1
M M −1 πi ( W ) =
i=0 1
M T i Π) M −1 µ(W T i Π). (12.50) i=0 This decomposition provides a method of generalizing the results for totally
ergodic sources to ergodic sources. Since µ(·Π) is N ergodic, Lemma 12.9.2 is
valid if µ is replaced by µ(·Π). If an inﬁnite length sliding block encoder f is
used, it can determine the ergodic component in eﬀect by testing for T −i Π in
the base of the tower and insert i dummy symbols and then encode using the
length N preﬁxed block code. In other words, the encoder can line up the block
code with a prespeciﬁed one of the N possible N ergodic modes. A ﬁnite length
encoder can then be obtained by approximating the inﬁnite length encoder by
a ﬁnite length encoder. Making these ideas precise yields the following result.
¯
Theorem 12.9.1 If ν is a stationary dcontinuous totally ergodic channel with
Shannon capacity C , then any ergodic source [G, µ, U ] with H (µ) < C is admissible. 12.9. SLIDING BLOCK SOURCE AND CHANNEL CODING 277 Proof: Assume that N is large enough for Corollary 12.8.1 and (12.40)–(12.42)
to hold. From the Nedoma decomposition
1
M M −1 µN (GN T i Π) = µN (GN ) ≥ 1 − .
i=0 and hence there exists at least one i for which µN (GN T i Π) ≥ 1 − ; that is, at
least one N ergodic mode must put high probability on the set GN of typical
N tuples for µ. For convenience relabel the indices so that this good mode is
µ(·Π) and call it the design mode. Since µ(·Π) is N ergodic and N stationary,
Lemma 12.9.1 holds with µ replaced by µ(·Π); that is, there is a source/channel
block code (γN , ψN ) and a sync locating function s : B LN → {0, 1, · · · , M − 1}
such that there is a set Φ ∈ Gm ; m = (L + 1)N , for which (12.33) holds and
µm (ΦΠ) ≥ 1 − . The sliding block decoder is exactly as in Lemma 12.9.1. The
sliding block encoder, however, is somewhat diﬀerent. Consider a punctuation
sequence or tower as in Lemma 9.5.2, but now consider the partition generated
by Φ, GN , and T i Π, i = 0, 1, . . . , M − 1. The inﬁnite length sliding block
N K −1
code is deﬁned as follows: If u ∈ k=0 T k F , then f (u) = a∗ , an arbitrary
channel symbol. If u ∈ T i (F T −j Π) and if i < j , set f (u) = a∗ (these
are spacing symbols to force alignment with the proper N ergodic mode). If
j ≤ i ≤ KN − (M − j ), then i = j + kN + r for some 0 ≤ k ≤ (K − 1)N ,
r ≤ N − 1. Form GN (uN kN ) = aN and set f (u) = ar . This is the same
j+
encoder as before, except that if u ∈ T j Π, then block encoding is postponed for
j symbols (at which time u ∈ Π). Lastly, if KN − (M − j ) ≤ i ≤ KN − 1, then
f (u) = a∗ .
As in the proof of Lemma 12.9.2
Pe (µ, ν, f, gm ) = m
dµ(u)νf (u) (y : U0 (u) = gm (Y−LN (y )))
KN −1 ˆ
u ∈ T i F dµ(u)νf (u) (y : U0 (u) = U0 (y )) ≤2 +
i=LN KN −1 M −1 =2 +
i=LN j =0 aKN ∈GKN ˆ
dµ(u)νf (u) (y : U0 (u) = U0 (y ))
u∈T i (c(aKN ) T F T T −j Π)
M −1 KN −(M −j ) ≤2 +
j =0 i=LN +j aKN ∈GKN ˆ
dµ(u)νf (u) (y : U0 (u) = U0 (y ))
u∈T i (c(aKN ) T F T T −j Π)
M −1 + M µ(F
j =0 T −j Π), (12.51) 278 CHAPTER 12. CODING FOR NOISY CHANNELS where the rightmost term is
M −1 M 1
M
≤
≤.
KN
K T −j Π) ≤ µ(F
j =0 Thus
M −1 KN −(M −j ) Pe (µ, ν, f, gm ) ≤ 3 +
j =0 i=LN +j aKN ∈GKN ˆ
dµ(u)νf (u) (y : U0 (u) = U0 (y )).
u∈T i (c(aKN ) T F T T −j Π) Analogous to (12.45) (except that here i = j + kN + r, u = T −(LN +r) u ) u ∈T i (c(aKN ) T F T T −j Π) m
dµ(u )νf (u ) (y : U0 (u ) = gm (Y−LN (y ))) ≤ dµ(u)×
T j +(k−L)N (c(aKN ) T F T T −j Π) N
LN
νf (T i +LN u) (y : uN = ψN (yLN )ors(yr ) = r).
LN Thus since u ∈ T j +(k−L)N (c(aKN ) F T −j Π implies um = am k−L)N , analj +(
ogous to (12.46) we have that for i = j + kN + r dµ(u)νf (u) (y : U0 (u) = gm (Y− LN m (y )))
T i (c(aKN ) aKN ∈GKN T F T T −j Π) µ(T j +(k−L)N (c(aKN ) = F T −j Π)) aKN :am k−L)N ∈Φ
j +( µ(T j +(k−L)N (c(aKN ) + F T −j Π)) aKN :am k−L)N ∈Φ
j +( µ(c(aKN ) = F T −j Π) aKN :am k−L)N ∈Φ
j +( µ(c(aKN ) + F T −j Π) aKN :am k−L)N ∈Φ
j +( = µ(T −(j +(k−L)N ) c(Φ) F T −j Π) + µ(T −(j +(k−L)N ) c(Φ)c F T −j Π). 12.9. SLIDING BLOCK SOURCE AND CHANNEL CODING 279 From Lemma 9.5.2 (the RohlinKakutani theorem), this is bounded above
by
µ(T −(j +(k−L)N ) c(Φ) T −j Π) µ(T −(j +(k−L)N ) c(Φ)c T −j Π)
+
KN
KN
µ(T −(j +(k−L)N ) c(Φ)T −j Π)µ(Π) µ(T −(j +(k−L)N ) c(Φ)c T −j Π)µ(Π)
=
+
KN
KN
µ(Π)
2
µ(Π)
= µ(c(Φ)Π)
µ(c(Φ)c Π)
+≤
.
KN
KN
M KN
With (12.50)–(12.51) this yields
Pe (µ, ν, f, gm ) ≤ 3 + M KN 2
≤5 ,
M KN (12.52) which completes the result for an inﬁnite sliding block code.
The proof is completed by applying Corollary 10.5.1, which shows that by
choosing a ﬁnite length sliding block code f0 from Lemma 4.2.4 so that Pr(f =
f0 ) is suﬃciently small, then the resulting Pe is close to that for the inﬁnite
length sliding block code.
2
In closing we note that the theorem can be combined with the sliding block
source coding theorem to prove a joint source and channel coding theorem similar to Theorem 12.7.1, that is, one can show that given a source with distortion
rate function D(R) and a channel with capacity C , then sliding block codes
exist with average distortion approximately D(C ). 280 CHAPTER 12. CODING FOR NOISY CHANNELS Bibliography
[1] N. M. Abramson. Information Theory and Coding. McGrawHill, New
York, 1963.
[2] R. Adler. Ergodic and mixing properties of inﬁnite memory channels.
Proc. Amer. Math. Soc., 12:924–930, 1961.
[3] R. L. Adler, D. Coppersmith, and M. Hassner. Algorithms for slidingblock codes–an application of symbolic dynamics to information theory.
IEEE Trans. Inform. Theory, IT29:5–22, 1983.
[4] R. Ahlswede and P. G´cs. Two contributions to information theory. In
a
Topics in Information Theory, pages 17–40, Keszthely,Hungary, 1975.
[5] R. Ahlswede and J. Wolfowitz. Channels without synchronization. Adv.
in Appl. Probab., 3:383–403, 1971.
[6] P. Algoet. LogOptimal Investment. PhD thesis, Stanford University, 1985.
[7] P. Algoet and T. Cover. A sandwich proof of the ShannonMcMillanBreiman theorem. Ann. Probab., 16:899–909, 1988.
[8] E. Ayan˘glu and R. M. Gray. The design of joint source and channel trellis
o
waveform coders. IEEE Trans. Inform. Theory, IT33:855–865, November
1987.
[9] A. R. Barron. The strong ergodic theorem for densities: generalized
ShannonMcMillanBreiman theorem. Ann. Probab., 13:1292–1303, 1985.
[10] T. Berger. Rate distortion theory for sources with abstract alphabets and
memory. Inform. and Control, 13:254–273, 1968.
[11] T. Berger. Rate Distortion Theory.
Cliﬀs,New Jersey, 1971. PrenticeHall Inc., Englewood [12] T. Berger. Multiterminal source coding. In G. Longo, editor, The Information Theory Approach to Communications, volume 229 of CISM
Courses and Lectures, pages 171–231. SpringerVerlag, Vienna and New
York, 1978.
281 282 BIBLIOGRAPHY [13] E. Berlekamp. Algebraic Coding Theory. McGrawHill, New York, 1968.
[14] E. Berlekamp, editor. Key Papers in the Development of Coding Theory.
IEEE Press, New York, 1974.
[15] P. Billingsley. Ergodic Theory and Information. Wiley, New York, 1965.
[16] G. D. Birkhoﬀ. Proof of the ergodic theorem. Proc. Nat. Acad. Sci.,
17:656–660, 1931.
[17] R. E. Blahut. Computation of channel capacity and ratedistortion functions. IEEE Trans. Inform. Theory, IT18:460–473, 1972.
[18] R. E. Blahut. Theory and Practice of Error Control Codes. Addison
Wesley, Reading, Mass., 1987.
[19] L. Breiman. The individual ergodic theorem of information theory. Ann.
of Math. Statist., 28:809–811, 1957.
[20] L. Breiman. A correction to ‘The individual ergodic theorem of information theory’. Ann. of Math. Statist., 31:809–810, 1960.
[21] J. R. Brown. Ergodic Theory and Topological Dynamics. Academic Press,
New York, 1976.
[22] J. A. Bucklew. A large deviation theory proof of the abstract alphabet
source coding theorem. IEEE Trans. Inform. Theory, IT34:1081–1083,
1988.
[23] T. M. Cover, P. Gacs, and R. M. Gray. Kolmogorov’s contributions to
information theory and algorithmic complexity. Ann. Probab., 17:840–865,
1989.
[24] I. Csisz´r. Informationtype measures of diﬀerence of probability distria
butions and indirect observations. Studia Scientiarum Mathematicarum
Hungarica, 2:299–318, 1967.
[25] I. Csisz´r. Idivergence geometry of probability distributions and minia
mization problems. Ann. Probab., 3(1):146–158, 1975.
[26] I. Csisz´r and J. K¨rner. Coding Theorems of Information Theory. Acaa
o
demic Press/Hungarian Academy of Sciences, Budapest, 1981.
[27] L. D. Davisson and R.M. Gray. A simpliﬁed proof of the slidingblock
source coding theorem and its universal extension. In Conf. Record 1978
Int’l. Conf. on Comm. 2, pages 34.4.1–34.4.5, Toronto, 1978.
[28] L. D. Davisson, R. J. McEliece, M. B. Pursley, and M. S. Wallace. Eﬃcient
universal noiseless source codes. IEEE Trans. Inform. Theory, IT27:269–
279, 1981. BIBLIOGRAPHY 283 [29] L. D. Davisson and M. B. Pursley. An alternate proof of the coding theorem for stationary ergodic sources. In Proceedings of the Eighth Annual
Princeton Conference on Information Sciences and Systems, 1974.
[30] M. Denker, C. Grillenberger, and K. Sigmund. Ergodic Theory on Compact
Spaces, volume 57 of Lecture Notes in Mathematics. SpringerVerlag, New
York, 1970.
[31] J.D. Deushcel and D. W. Stroock. Large Deviations, volume 137 of Pure
and Applied Mathematics. Academic Press, Boston, 1989.
[32] R. L. Dobrushin. A general formulation of the fundamental Shannon
theorem in information theory. Uspehi Mat. Akad. Nauk. SSSR, 14:3–104,
1959. Translation in Transactions Amer. Math. Soc, series 2,vol. 33,323–
438.
[33] R. L. Dobrushin. Shannon’s theorems for channels with synchronization
errors. Problemy Peredaci Informatsii, 3:18–36, 1967. Translated in Problems of Information Transmission, vol.,3,11–36 (1967),Plenum Publishing
Corporation.
[34] M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain
Markov process expectations for large time. J. Comm. Pure Appl. Math.,
28:1–47, 1975.
[35] J. G. Dunham. A note on the abstract alphabet block source coding
with a ﬁdelity criterion theorem. IEEE Trans. Inform. Theory, IT24:760,
November 1978.
[36] P. Elias. Two famous papers. IRE Transactions on Information Theory,
page 99, 1958.
[37] R. M. Fano. Transmission of Information. Wiley, New York, 1961.
[38] A. Feinstein. A new basic theorem of information theory. IRE Transactions on Information Theory, pages 2–20, 1954.
[39] A. Feinstein. Foundations of Information Theory. McGrawHill, New
York, 1958.
[40] A. Feinstein. On the coding theorem and its converse for ﬁnitememory
channels. Inform. and Control, 2:25–44, 1959.
[41] G. D. Forney, Jr. The Viterbi algorithm. Proc. IEEE, 61:268–278, March
1973.
[42] N. A. Friedman. Introduction to Ergodic Theory. Van Nostrand Reinhold
Company, New York, 1970.
[43] R. G. Gallager. Information Theory and Reliable Communication. John
Wiley & Sons, New York, 1968. 284 BIBLIOGRAPHY [44] A. El Gamal and T. Cover. Multiple user information theory. Proc. IEEE,
68:1466–1483, 1980.
[45] I. M. Gelfand, A. N. Kolmogorov, and A. M. Yaglom. On the general
deﬁnitions of the quantity of information. Dokl. Akad. Nauk, 111:745–
748, 1956. (In Russian.).
[46] A. Gersho and V. Cuperman. Vector quantization: A patternmatching
technique for speech coding. IEEE Communications Magazine, 21:15–21,
December 1983.
[47] A. Gersho and R. M. Gray. Vector Quantization and Signal Compression.
Kluwer Academic Publishers, Boston, 1992.
[48] R. M. Gray. Treesearched block source codes. In Proceedings of the 1980
Allerton Conference, Allerton IL, Oct. 1980.
[49] R. M. Gray. Vector quantization. IEEE ASSP Magazine, 1,No. 2:4–29,
April 1984.
[50] R. M. Gray. Probability, Random Processes, and Ergodic Properties.
SpringerVerlag, New York, 1988.
[51] R. M. Gray. Spectral analysis of quantization noise in a singleloop sigmadelta modulator with dc input. IEEE Trans. Comm., COM37:588–599,
1989.
[52] R. M. Gray. Source Coding Theory. Kluwer Academic Press, Boston,
1990.
[53] R. M. Gray and L. D. Davisson. Source coding without the ergodic assumption. IEEE Trans. Inform. Theory, IT20:502–516, 1974.
[54] R. M. Gray and J. C. Kieﬀer. Asymptotically mean stationary measures.
Ann. Probab., 8:962–973, 1980.
[55] R. M. Gray, D. L. Neuhoﬀ, and J. K. Omura. Process deﬁnitions of distortion rate functions and source coding theorems. IEEE Trans. Inform.
Theory, IT21:524–532, 1975.
[56] R. M. Gray, D. L. Neuhoﬀ, and D. Ornstein. Nonblock source coding with
a ﬁdelity criterion. Ann. Probab., 3:478–491, 1975.
[57] R. M. Gray, D. L. Neuhoﬀ, and P. C. Shields. A generalization of ornstein’s
dbar distance with applications to information theory. Ann. Probab.,
3:315–328, April 1975.
[58] R. M. Gray and D. S. Ornstein. Slidingblock joint source/noisychannel
coding theorems. IEEE Trans. Inform. Theory, IT22:682–690, 1976. BIBLIOGRAPHY 285 [59] R. M. Gray, D. S. Ornstein, and R. L. Dobrushin.
Block
synchronization,slidingblock coding, invulnerable sources and zero error
codes for discrete noisy channels. Ann. Probab., 8:639–674, 1980.
[60] R. M. Gray, M. Ostendorf, and R. Gobbi. Ergodicity of Markov channels.
IEEE Trans. Inform. Theory, 33:656–664, September 1987.
[61] R. M. Gray and F. Saadat. Block source coding theory for asymptotically
mean stationary sources. IEEE Trans. Inform. Theory, 30:64–67, 1984.
[62] P. R. Halmos. Lectures on Ergodic Theory. Chelsea, New York, 1956.
[63] G. H. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge
Univ. Press, London, 1952. Second Edition,1959.
[64] R. V. L. Hartley. Transmission of information. Bell System Tech. J.,
7:535–563, 1928.
[65] E. Hoph. Ergodentheorie. SpringerVerlag, Berlin, 1937.
¨
[66] K. Jacobs. Die Ubertragung diskreter Informationen durch periodishce
und fastperiodische Kanale. Math. Annalen, 137:125–135, 1959.
¨
[67] K. Jacobs. Uber die Struktur der mittleren Entropie. Math. Z., 78:33–43,
1962.
[68] K. Jacobs. The ergodic decomposition of the KolmogorovSinai invariant.
In F. B. Wright and F. B. Wright, editors, Ergodic Theory. Academic
Press, New York, 1963.
[69] N. S. Jayant and P. Noll. Digital Coding of Waveforms. PrenticeHall,
Englewood Cliﬀs,New Jersey, 1984.
[70] T. Kadota. Generalization of feinstein’s fundamental lemma. IEEE Trans.
Inform. Theory, IT16:791–792, 1970.
[71] S. Kakutani. Induced measure preserving transformations. In Proceedings
of the Imperial Academy of Tokyo, volume 19, pages 635–641, 1943.
[72] L.V. Kantorovich. On one eﬀective method of solving certain classes of
extremal problems. Dokl. Akad. Nauk, 28,:212–215, 1940.
[73] A. J. Khinchine. The entropy concept in probability theory. Uspekhi
Matematicheskikh Nauk., 8:3–20, 1953. Translated in Mathematical Foundations of Information Theory,Dover New York (1957).
[74] A. J. Khinchine. On the fundamental theorems of information theory.
Uspekhi Matematicheskikh Nauk., 11:17–75, 1957. Translated in Mathematical Foundations of Information Theory,Dover New York (1957).
[75] J. C. Kieﬀer. A counterexample to Perez’s generalization of the ShannonMcMillan theorem. Ann. Probab., 1:362–364, 1973. 286 BIBLIOGRAPHY [76] J. C. Kieﬀer. A general formula for the capacity of stationary nonanticipatory channels. Inform. and Control, 26:381–391, 1974.
[77] J. C. Kieﬀer. On the optimum average distortion attainable by ﬁxedrate
coding of a nonergodic source. IEEE Trans. Inform. Theory, IT21:190–
193, March 1975.
[78] J. C. Kieﬀer. A generalization of the pursleydavissonmackenthun universal variablerate coding theorem. IEEE Trans. Inform. Theory, IT23:694–697, 1977.
[79] J. C. Kieﬀer. A uniﬁed approach to weak universal source coding. IEEE
Trans. Inform. Theory, IT24:674–682, 1978.
[80] J. C. Kieﬀer. Extension of source coding theorems for block codes to
sliding block codes. IEEE Trans. Inform. Theory, IT26:679–692, 1980.
[81] J. C. Kieﬀer. Block coding for weakly continuous channels. IEEE Trans.
Inform. Theory, IT27:721–727, 1981.
[82] J. C. Kieﬀer. Slidingblock coding for weakly continuous channels. IEEE
Trans. Inform. Theory, IT28:2–10, 1982.
[83] J. C. Kieﬀer. Coding theorem with strong converse for block source coding
subject to a ﬁdelity constraint, 1989. Preprint.
[84] J. C. Kieﬀer. An ergodic theorem for constrained sequences of functions.
Bulletin American Math Society, 1989.
[85] J. C. Kieﬀer. Sample converses in source coding theory, 1989. Preprint.
[86] J. C. Kieﬀer. Elementary information theory. Unpublished manuscript,
1990.
[87] J. C. Kieﬀer and M. Rahe. Markov channels are asymptotically mean
stationary. Siam Journal of Mathematical Analysis, 12:293–305, 1980.
[88] A. N. Kolmogorov. On the Shannon theory of information in the case
of continuous signals. IRE Transactions Inform. Theory, IT2:102–108,
1956.
[89] A. N. Kolmogorov. A new metric invariant of transitive dynamic systems
and automorphisms in lebesgue spaces. Dokl. Akad. Nauk SSR, 119:861–
864, 1958. (In Russian.).
[90] A. N. Kolmogorov. On the entropy per unit time as a metric invariant
of automorphisms. Dokl. Akad. Naud SSSR, 124:768–771, 1959. (In Russian.). BIBLIOGRAPHY 287 [91] A. N. Kolmogorov, A. M. Yaglom, and I. M. Gelfand. Quantity of information and entropy for continuous distributions. In Proceedings 3rd
AllUnion Mat. Conf., volume 3, pages 300–320. Izd. Akad. Nauk. SSSR,
1956.
[92] S. Kullback. A lower bound for discrimination in terms of variation. IEEE
Trans. Inform. Theory, IT13:126–127, 1967.
[93] S. Kullback. Information Theory and Statistics. Dover, New York, 1968.
Reprint of 1959 edition published by Wiley.
[94] B. M. Leiner and R. M. Gray. Bounds on ratedistortion functions for
stationary sources and contextdependent ﬁdelity criteria. IEEE Trans.
Inform. Theory, IT19:706–708, Sept. 1973.
[95] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions,and reversals. Sov. Phys.Dokl., 10:707–710, 1966.
[96] S. Lin. Introduction to Error Correcting Codes. PrenticeHall, Englewood
Cliﬀs,NJ, 1970.
[97] K. M. Mackenthun and M. B. Pursley. Strongly and weakly universal
source coding. In Proceedings of the 1977 Conference on Information
Science and Systems, pages 286–291, Johns Hopkins University, 1977.
[98] F. J. MacWilliams and N. J. A. Sloane. The Theory of ErrorCorrecting
Codes. NorthHolland, New York, 1977.
[99] A. Maitra. Integral representations of invariant measures. Transactions
of the American Mathematical Society, 228:209–235, 1977.
[100] J. Makhoul, S. Roucos, and H. Gish. Vector quantization in speech coding.
Proc. IEEE, 73. No. 11:1551–1587, November 1985.
[101] B. Marcus. Sophic systems and encoding data. IEEE Trans. Inform.
Theory, IT31:366–377, 1985.
[102] K. Marton. On the rate distortion function of stationary sources. Problems
of Control and Information Theory, 4:289–297, 1975.
[103] R. McEliece. The Theory of Information and Coding. Cambridge University Press, New York, NY, 1984.
[104] B. McMillan. The basic theorems of information theory. Ann. of Math.
Statist., 24:196–219, 1953.
[105] L. D. Meshalkin. A case of isomorphisms of bernoulli scheme. Dokl. Akad.
Nauk SSSR, 128:41–44, 1959. (In Russian.).
[106] ShuTeh C. Moy. Generalizations of ShannonMcMillan theorem. Paciﬁc
Journal Math., 11:705–714, 1961. 288 BIBLIOGRAPHY [107] J. Nedoma. On the ergodicity and rergodicity of stationary probability
measures. Z. Wahrsch. Verw. Gebiete, 2:90–97, 1963.
[108] J. Nedoma. The synchronization for ergodic channels. Transactions Third
Prague Conf. Information Theory, Stat. Decision Functions,and Random
Processes, pages 529–539, 1964.
[109] D. L. Neuhoﬀ and R. K. Gilbert. Causal source codes. IEEE Trans.
Inform. Theory, IT28:701–713, 1982.
[110] D. L. Neuhoﬀ, R. M. Gray, and L. D. Davisson. Fixed rate universal
block source coding with a ﬁdelity criterion. IEEE Trans. Inform. Theory,
21:511–523, 1975.
[111] D. L. Neuhoﬀ and P. C. Shields. Channels with almost ﬁnite memory.
IEEE Trans. Inform. Theory, pages 440–447, 1979.
[112] D. L. Neuhoﬀ and P. C. Shields. Channel distances and exact representation. Inform. and Control, 55(1), 1982.
[113] D. L. Neuhoﬀ and P. C. Shields. Channel entropy and primitive approximation. Ann. Probab., 10(1):188–198, 1982.
[114] D. L. Neuhoﬀ and P. C. Shields. Indecomposable ﬁnite state channels
and primitive approximation. IEEE Trans. Inform. Theory, IT28:11–19,
1982.
[115] D. Ornstein. Bernoulli shifts with the same entropy are isomorphic. Advances in Math., 4:337–352, 1970.
[116] D. Ornstein. An application of ergodic theory to probability theory. Ann.
Probab., 1:43–58, 1973.
[117] D. Ornstein. Ergodic Theory,Randomness,and Dynamical Systems. Yale
University Press, New Haven, 1975.
[118] D. Ornstein and B. Weiss. The ShannonMcMillanBreiman theorem for
a class of amenable groups. Israel J. of Math, 44:53–60, 1983.
[119] D. O’Shaughnessy. Speech Communication. AddisonWesley, Reading,
Mass., 1987.
[120] P. PapantoniKazakos and R. M. Gray. Robustness of estimators on stationary observations. Ann. Probab., 7:989–1002, Dec. 1979.
[121] A. Perez. Notions g´n´ralisees d’incertitude,d’entropie et d’information du
ee
point de vue de la th´orie des martingales. In Transactions First Prague
e
Conf. on Information Theory, Stat. Decision Functions,and Random Processes, pages 183–208. Czech. Acad. Sci. Publishing House, 1957. BIBLIOGRAPHY 289 [122] A. Perez. Sur la convergence des incertitudes,entropies et informations
´chantillon vers leurs valeurs vraies. In Transactions First Prague Conf.
e
on Information Theory, Stat. Decision Functions,and Random Processes,
pages 245–252. Czech. Acad. Sci. Publishing House, 1957.
[123] A. Perez. Sur la th´orie de l’information dans le cas d’un alphabet abstrait.
e
In Transactions First Prague Conf. on Information Theory, Stat. Decision
Functions,Random Processes, pages 209–244. Czech. Acad. Sci. Publishing
House, 1957.
[124] A. Perez. Extensions of ShannonMcMillan’s limit theorem to more general stochastic processes processes. In Third Prague Conf. on Inform.
Theory,Decision Functions,and Random Processes, pages 545–574, Prague
and New York, 1964. Publishing House Czech. Akad. Sci. and Academic
Press.
[125] K. Petersen. Ergodic Theory. Cambridge University Press, Cambridge,
1983.
[126] M. S. Pinsker. Dynamical systems with completely positive or zero entropy. Soviet Math. Dokl., 1:937–938, 1960.
[127] D. Ramachandran. Perfect Measures. ISI Lecture Notes,No. 6 and 7.
Indian Statistical Institute, Calcutta,India, 1979.
[128] V. A. Rohlin and Ya. G. Sinai. Construction and properties of invariant
measurable partitions. Soviet Math. Dokl., 2:1611–1614, 1962.
[129] L. R¨schendorf. Wassersteinmetric. In Michiel Hazewinkel, editor, Encyu
clopaedia of Mathematics. Supplement, I, II, III. Kluwer Academic Publishers, 1997–2001.
[130] V. V. Sazanov. On perfect measures. Izv. Akad. Nauk SSSR, 26:391–414,
1962. American Math. Soc. Translations,Series 2, No. 48,pp. 229254,1965.
[131] C. E. Shannon. A mathematical theory of communication. Bell Syst.
Tech. J., 27:379–423,623–656, 1948.
[132] C. E. Shannon. Coding theorems for a discrete source with a ﬁdelity
criterion. In IRE National Convention Record,Part 4, pages 142–163,
1959.
[133] P. C. Shields. The Theory of Bernoulli Shifts. The University of Chicago
Press, Chicago,Ill., 1973.
[134] P. C. Shields. The ergodic and entropy theorems revisited. IEEE Trans.
Inform. Theory, IT33:263–266, 1987.
[135] P. C. Shields and D. L. Neuhoﬀ. Block and slidingblock source coding.
IEEE Trans. Inform. Theory, IT23:211–215, 1977. 290 BIBLIOGRAPHY [136] Ya. G. Sinai. On the concept of entropy of a dynamical system. Dokl.
Akad. Nauk. SSSR, 124:768–771, 1959. (In Russian.).
[137] Ya. G. Sinai. Weak isomorphism of transformations with an invariant
measure. Soviet Math. Dokl., 3:1725–1729, 1962.
[138] Ya. G. Sinai.
Introduction to Ergodic Theory.
Notes,Princeton University Press, Princeton, 1976. Mathematical [139] D. Slepian. A class of binary signaling alphabets. Bell Syst. Tech. J.,
35:203–234, 1956.
[140] D. Slepian, editor. Key Papers in the Development of Information Theory.
IEEE Press, New York, 1973.
[141] A. D. Sokai. Existence of compatible families of proper regular conditional
probabilities. Z. Wahrsch. Verw. Gebiete, 56:537–548, 1981.
[142] J. Storer. Data Compression. Computer Science Press, Rockville, Maryland, 1988.
[143] I. Vajda. A synchronization method for totally ergodic channels. In Transactions of the Fourth Prague Conf. on Information Theory,Decision Functions,and Random Processes, pages 611–625, Prague, 1965.
[144] E. van der Meulen. A survey of multiway channels in information theory:
1961–1976. IEEE Trans. Inform. Theory, IT23:1–37, 1977.
[145] S. R. S. Varadhan. Large Deviations and Applications. Society for Industrial and Applied Mathematics, Philadelphia, 1984.
[146] L. N. Vasershtein. Markov processes on countable product space describing
large systems of automata. Problemy Peredachi Informatsii, 5:64–73, 1969.
[147] A. J. Viterbi and J. K. Omura. Principles of Digital Communication and
Coding. McGrawHill, New York, 1979.
[148] J. von Neumann. Zur operatorenmethode in der klassischen mechanik.
Ann. of Math., 33:587–642, 1932.
[149] P. Walters. Ergodic TheoryIntroductory Lectures. Lecture Notes in Mathematics No. 458. SpringerVerlag, New York, 1975.
[150] E. J. Weldon, Jr. and W. W. Peterson. Error Correcting Codes. MIT
Press, Cambridge, Mass., 1971. Second Ed.
[151] K. Winkelbauer. Communication channels with ﬁnite past history. Transactions of the Second Prague Conf. on Information Theory,Decision Functions,and Random Processes, pages 685–831, 1960.
[152] J. Wolfowitz. Strong converse of the coding theorem for the general discrete ﬁnitememory channel. Inform. and Control, 3:89–93, 1960. BIBLIOGRAPHY 291 [153] J. Wolfowitz. Coding Theorems of Information Theory. SpringerVerlag,
New York, 1978. Third edition.
[154] A. Wyner. A deﬁnition of conditional mutual information for arbitrary
ensembles. Inform. and Control, pages 51–59, 1978. ...
View Full
Document
 Spring '10
 sd
 Electrical Engineering

Click to edit the document details