This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Preface
Once in a while you get shown the light,
In the strangest of places if you look at it right.
Grateful Dead The ﬁrst and most obvious use for this book is as a textbook for a one year
graduate course in probability taught to students who are familiar with measure
theory. An Appendix, which gives complete proofs of the results from measure
theory we need, is provided so that the book can be used whether or not the
students are assumed to be familiar with measure theory.
The title of the book indicates that as we develop the theory, we will focus
our attention on examples. Hoping that the book would be a useful reference
for people who apply probability in their work, we have tried to emphasize the
results that can be used to solve problems.
Exercises are integrated into the text because they are an integral part of
it. In general, the exercises embedded in the text can be done immediately
using the material just presented, and the reader should do these exercises to
check her understanding and prepare for later developments. Exercises at the
end of the section present extensions of the results and various complements.
Changes in the Second and Third Edition. The second edition published in 1995 brought four major changes: (i) More than 500 typographical
errors were corrected. (ii) More details were added to many proofs to make
them easier to understand. For example, Chapter 1 grew from 63 to 78 pages.
(iii) Some sections were rearranged or divided into subsections. (iv) Last, and
most important, I worked all the problems and prepared a solutions manual.
While the second edition was an improvement over the ﬁrst, several hundred, mostly minor, typos remained. In this the third edition, I have concentrated on correcting errors, and adding a few lines here and there where I
couldn’t ﬁgure out what the author had in mind. I am grateful to Antal Jarai,
who sorted through a twoinchthick folder of emails, and made the ﬁrst typo
list that was posted on my web page in the summer of 2000.
With the third edition, the book enters the Duxbury Classics series, where
it will hopefully live long, prosper, and be reasonably inexpensive. I would
like to express my appreciation to my editor Carolyn Crockett for her help in
making this happen. I don’t plan on doing a fourth edition, but small errors can iv Preface
be corrected in future reprints so keep those emails coming to [email protected]
Acknowledgements. I am always grateful to the many people who sent
me comments and typos. Helping to correct the ﬁrst edition were David Aldous, Ken Alexander, Daren Cline, Ted Cox, Robert Dalang, Joe Glover, David
Griﬀeath, Phil Griﬃn, Joe Horowitz, Olav Kallenberg, Jim Kuelbs, Robin Pemantle, Yuval Peres, Ken Ross, Byron Schmuland, Steve Samuels, Jon Wellner,
and Ruth Williams.
The third edition beneﬁtted from input from Manel Baucells, Eric Blair,
ZhenQing Chen, Ted Cox, Bradford Crain, Winston Crandall, Finn Christensen, Amir Dembo, Neil Falkner, Changyong Feng, Brighten Godfrey, Boris
Granovsky, Jan Hannig, Andrew Hayen, Martin Hildebrand, Kyoungmun Jang,
Anatole Joﬀe, Daniel Kifer, Steve Krone, Greg Lawler, T.Y. Lee, Shlomo Levental, Torgny Lindvall, Arif Mardin, Carl Mueller, Robin Pemantle, Yuval Peres,
Mark Pinsky, Ross Pinsky, Boris Pittel, David Pokorny, Vinayak Prabhu, Brett
Presnell, Jim Propp, Yossi Schwarzfuchs, Rami Shakarchi, Lian Shen, Marc
Shivers, Rich Sowers, Bob Strain, Tsachy Weissman, and Hao Zhang.
Family Update. Turning to the home front, where the date is March
2003, David and Greg are now 16 and 14. Life is the same and it is diﬀerent. The
game console (now the Nintendo Game Cube) and the computer games have
changed, in the latter case from being exclusively on their PCs to primarily
being found on or played over the Internet, but as I write this we are again
waiting for the latest Legend of Zelda game to be released. High school brings
new challenges to David and Greg inside and outside the classroom. David
works on the Ithaca High School paper and has an internship at our local
paper, the Ithaca Journal. Greg plays clarinet and golf with his father. I am
looking forward to this summer when he can teach me how to program in Java.
Most of Greg and David’s achievements would not be possible without their
mother, who drives them to their lessons and jobs, makes sure they do their
homework, and helps clear up problems when they arise. In between, she frets
about the damage to her trees from this January’s ice storm and bides her time
waiting for Spring by having the kitchen redone. It is impossible to encapsulate
23 years of married life into a tensecond sound bite. However, the recent melt
down of two 20+ year marriages involving people we know well has given me a
new appreciation for her many ﬁne quantities. The three editions of this book
(including the courses I taught in their preparation) represent a total of four to
ﬁve years of my life. I hope you enjoy and learn from this the “ﬁnal” version.
Rick Durrett Contents Introductory Lecture ix 1 Laws of Large Numbers 1 1. Basic Deﬁnitions 1
2. Random Variables 8
3. Expected Value 12
a. Inequalities 12
b. Integration to the limit 15
c. Computing expected values 17
4. Independence 22
a. Suﬃcient conditions for independence 23
b. Independence, distribution, and expectation 26
c. Constructing independent random variables 31
5. Weak Laws of Large Numbers 34
a. L2 weak laws 34
b. Triangular arrays 37
c. Truncation 40
6. BorelCantelli Lemmas 46
7. Strong Law of Large Numbers 55
*8. Convergence of Random Series 60
*9. Large Deviations 69 2 Central Limit Theorems
1. The De MoivreLaplace Theorem 77
2. Weak Convergence 80
a. Examples 80
b. Theory 83
3. Characteristic Functions 89
a. Deﬁnition, inversion formula 90
b. Weak convergence 97 77 vi Contents 4. *5.
6. *7.
*8.
*9. c. Moments and derivatives 99
*d. Polya’s criterion 102
*e. The moment problem 105
Central Limit Theorems 110
a. i.i.d. sequences 110
b. Triangular arrays 114
*c. Prime divisors (Erd¨sKac) 119
o
*d. Rates of convergence (BerryEsseen) 124
Local Limit Theorems 129
Poisson Convergence 135
a. Basic limit theorem 135
b. Two examples with dependence 140
c. Poisson processes 143
Stable Laws 147
Inﬁnitely Divisible Distributions 159
Limit theorems in Rd 162 3 Random Walks
1.
2.
*3.
*4. 171 Stopping Times 171
Recurrence 182
Visits to 0, Arcsine Laws
Renewal Theory 202 4 Martingales 195 217 1. Conditional Expectation 217
a. Examples 219
b. Properties 222
*c. Regular conditional probabilities 227
2. Martingales, Almost Sure Convergence 228
3. Examples 236
a. Bounded increments 236
b. Polya’s urn scheme 238
c. RadonNikodym derivatives 239
d. Branching processes 243
4. Doob’s Inequality, Lp Convergence 246
* Square integrable martingales 252
5. Uniform Integrability, Convergence in L1 256
6. Backwards Martingales 262
7. Optional Stopping Theorems 269 Contents 5 Markov Chains 274 1.
2.
3.
4.
5. Deﬁnitions and Examples 274
Extensions of the Markov Property 282
Recurrence and Transience 288
Stationary Measures 296
Asymptotic Behavior 308
a. Convergence theorems 308
*b. Periodic case 314
*c. Tail σ ﬁeld 316
*6. General State Space 322
a. Recurrence and transience 325
b. Stationary measures 327
c. Convergence theorem 328
d. GI/G/1 queue 329 6 Ergodic Theorems
1.
2.
3.
*4.
*5.
*6.
*7. Deﬁnitions and Examples 332
Birkhoﬀ’s Ergodic Theorem 337
Recurrence 343
Mixing 347
Entropy 353
A Subadditive Ergodic Theorem 358
Applications 364 7 Brownian Motion
1.
2.
3.
4.
5.
6.
*7. 332 371 Deﬁnition and Construction 372
Markov Property, Blumenthal’s 01 Law 378
Stopping Times, Strong Markov Property 384
Maxima and Zeros 389
Martingales 395
Donsker’s Theorem 399
CLT’s for Dependent Variables 408
a. Martingales 408
b. Stationary sequences 415
c. Mixing properties 420
*8. Empirical Distributions, Brownian Bridge 425
*9. Laws of the Iterated Logarithm 431 vii viii Contents Appendix: Measure Theory
1.
2.
3.
4.
5.
6.
7.
8.
9. 437 LebesgueStieltjes Measures 437
Carath´odory’s Extension Theorem 444
e
Completion, etc. 449
Integration 452
Properties of the Integral 461
Product Measures, Fubini’s Theorem 466
Kolmogorov’s Extension Theorem 471
RadonNikodym Theorem 473
Diﬀerentiating Under the Integral 478 References
Notation 481
489 Normal Table
Index 493 492 Introductory Lecture As Breiman should have said in his preface, “Probability theory has a right
and a left hand. On the left is the rigorous foundational work using the tools
of measure theory. The right hand ‘thinks probabilistically,’ reduces problems
to gambling situations, cointossing, and motions of a physical particle.” We
have interchanged Breiman’s hands in the quote because we learned in a high
school English class that the left hand is sinister and the right is dexterous.
While measure theory does not, as the dictionary says, “threaten harm, evil
or misfortune,” it is an unfortunate fact that we will need four sections of
deﬁnitions before we come to the ﬁrst interesting result. To motivate the reader
for this necessary foundational work, we will now give some previews of coming
attractions.
For a large part of the ﬁrst two chapters, we will be concerned with the laws
of large numbers and the central limit theorem. To introduce these theorems
and to illustrate their use, we will begin by giving their interpretation for a
person playing roulette. In doing this, we will use some terms (e.g. independent,
mean, variance) without explaining them. If some of the words that we use are
unfamiliar, don’t worry. There will be more than enough deﬁnitions when the
time comes.
A roulette wheel has 38 slots  18 red, 18 black, and 2 green ones that are
numbered 0 and 00. Thus, if our gambler bets $1 on red coming up, he wins
$1 with probability 18/38 and loses $1 with probability 20/38. Let X1 , X2 , . . .
be the outcomes of the ﬁrst, second, and subsequent bets. If the house and
gambler are honest, X1 , X2 , . . . are independent random variables and each has
the same distribution, namely P (Xi = 1) = 9/19 and P (Xi = −1) = 10/19.
The gambler’s main interest in what we can tell him about the amount he has
won at time n: Sn = X1 + · · · + Xn .
The ﬁrst facts we can tell him are that (i) the average amount of money
he will win on one play ( = the mean of X1 and denoted EX1 ) is
(9/19) · $1 + (10/19) · (−$1) = −$1/19 = −$.05263
and (ii) on the average after n ways his winnings will be ESn = nEX1 =
−$n/19. For most values of n the probability of having lost exactly n/19 dollars x Introductory Lecture
is zero, so the next question to be answered is: How close will his experience
be to the average? The ﬁrst answer is provided by the
Weak Law of Large Numbers. If X1 , X2 , . . . are independent and identically
distributed random variables with mean EX1 = µ, then for all > 0
P (Sn /n − µ > ) → 0 as n → ∞
Less formally, if n is large then Sn /n is close to µ with high probability.
This result provides some information but leaves several questions unanswered. The ﬁrst one is: If our gambler was statistically minded and wrote
down the values of Sn /n, would the resulting sequence of numbers converge to
−1/19? The answer to this question is given by the
Strong Law of Large Numbers. If X1 , X2 , . . . are independent and identically distributed random variables with mean EXi = µ then with probability
one, Sn /n converges to µ.
An immediate consequence of the last result of interest to our gambler is that
with probability one Sn → −∞ as n → ∞. That is, the gambler will eventually
go bankrupt no matter how much money he starts with.
The laws of large numbers tell us what happens in the long run but do not
provide much information about what will happens over the short run. That
gap is ﬁlled by the
Central Limit Theorem. If X1 , X2 , . . . are independent and identically distributed random variables with mean EXi = µ and variance σ 2 = E (Xi − µ)2
then for any y
Sn − nµ
≤ y → N (y )
P
σn1/2
where N (y ) = y
−1/2 −x2 /2
e
−∞ (2π ) dx is the (standard) normal distribution. If we let χ denote a random variable with a normal distribution, then the
last conclusion can be written informally as
Sn ≈ nµ + σn1/2 χ
In the example, we have been considering µ = −1/19 and
σ2 = 9
10
(1 + 1/19)2 + (−1 + 1/19)2 = 1 − (1/19)2 = .9972
19
19 Introductory Lecture
If we use σ 2 ≈ 1 to simplify the arithmetic, then the central limit theorem tells
us
Sn ≈ −n/19 + n1/2 χ
or when n = 100,
S100 ≈ −5.26 + 10χ
If we are interested in the probability S100 ≥ 0, this is
P (−5.26 + 10χ ≥ 0) = P (χ ≥ .526) ≈ .30
from the table of the normal distribution at the back of the book.
The last result shows that after 100 plays the negative drift is not too
noticeable. The gambler has lost $5.26 on the average and has a probability .3
of being ahead. To see why casinos make money, suppose there are 100 gamblers
playing 100 times and set n = 10, 000 to get
S10,000 ≈ −526 + 100χ
Now P (χ ≤ 2.3) = .99 so with that probability S10,000 ≤ −296, that is, the
casino is slowly but surely making money. xi 1 Laws of Large Numbers In the ﬁrst three sections, we will recall some deﬁnitions and results from measure theory. Our purpose is not only to review that material but also to introduce the terminology of probability theory, which diﬀers slightly from that of
measure theory. In Section 1.4, we introduce the crucial concept of independence and explore its properties. In Section 1.5, we prove the weak law of large
numbers and give several applications. In Section 1.6, we prove some BorelCantelli lemmas to prepare for the proof of the strong law of large numbers
in Section 1.7. In Section 1.8, we investigate the convergence of random series
that leads to estimates on the rate of convergence in the law of large numbers.
Finally, in Section 1.9, we show that in nice situations convergence in the weak
law occurs exponentially rapidly. 1.1. Basic Deﬁnitions
Here and throughout the book, terms being deﬁned are set in boldface. We
begin with the most basic quantity. A probability space is a triple (Ω, F , P )
where Ω is a set of “outcomes,” F is a set of “events,” and P : F → [0, 1] is
a function that assigns probabilities to events. We assume that F is a σ ﬁeld
(or σ algebra), i.e., a (nonempty) collection of subsets of Ω that satisfy
(i) if A ∈ F then Ac ∈ F , and
(ii) if Ai ∈ F is a countable sequence of sets then ∪i Ai ∈ F .
Here and in what follows, countable means ﬁnite or countably inﬁnite. Since
∩i Ai = (∪i Ac )c , it follows that a σ ﬁeld is closed under countable intersections.
i
We omit the last property from the deﬁnition to make it easier to check.
Without P , (Ω, F ) is called a measurable space, i.e., it is a space on
which we can put a measure. A measure is a nonnegative countably additive
set function; that is, a function µ : F → R with
(i) µ(A) ≥ µ(∅) = 0 for all A ∈ F , and 2 Chapter 1 Laws of Large Numbers
(ii) if Ai ∈ F is a countable sequence of disjoint sets, then
µ(∪i Ai ) = µ(Ai )
i If µ(Ω) = 1, we call µ a probability measure. In this book, probability
measures are usually denoted by P . The next exercise gives some consequences
of the deﬁnition that we will need later. In all cases, we assume that the sets
we mention are in F . For (i) one needs to know that B − A = B ∩ Ac . For (iv),
it is useful to note that (ii) of the deﬁnition with A1 = A and A2 = Ac implies
P (Ac ) = 1 − P (A).
Exercise 1.1. Let P be a probability measure on (Ω, F )
(i) monotonicity. If A ⊂ B then P (B ) − P (A) = P (B − A) ≥ 0.
(ii) subadditivity. If Am ∈ F for m ≥ 1 and A ⊂ ∪∞=1 Am then P (A) ≤
m
∞
m=1 P (Am ).
(iii) continuity from below. If Ai ↑ A (i.e., A1 ⊂ A2 ⊂ . . . and ∪i Ai = A)
then P (Ai ) ↑ P (A).
(iv) continuity from above. If Ai ↓ A (i.e., A1 ⊃ A2 ⊃ . . . and ∩i Ai = A)
then P (Ai ) ↓ P (A).
Some examples of probability measures should help to clarify the concept. We
leave it to the reader to check that they are examples, i.e., F is a σ ﬁeld and P
is a probability measure.
Example 1.1. Discrete probability spaces. Let Ω = a countable set, i.e.,
ﬁnite or countably inﬁnite. Let F = the set of all subsets of Ω. Let
p(ω ) where p(ω ) ≥ 0 and P (A) =
ω ∈A p(ω ) = 1
ω ∈Ω A little thought reveals that this is the most general probability measure on
this space. In many cases when Ω is a ﬁnite set, we have p(ω ) = 1/Ω where
Ω = the number of points in Ω. Concrete examples in this category are:
a. ﬂipping a fair coin: Ω = { Heads, Tails }
b. rolling a die: Ω = {1, 2, 3, 4, 5, 6}
Example 1.2. Real line and unit interval. Let R = the real line, R =
the Borel sets = the smallest σ ﬁeld containing the open sets, λ = Lebesgue
measure = the only measure on R with λ((a, b]) = b − a for all a < b. The Section 1.1 Basic Deﬁnitions
construction of Lebesgue measure is carried out in Section 1 of the Appendix.
λ(R) = ∞. To get a probability space, let Ω = (0, 1), F = {A ∩ (0, 1) : A ∈ R}
and P (B ) = λ(B ) for B ∈ F . P is Lebesgue measure restricted to the Borel
subsets of (0,1).
Exercise 1.2. (i) If Fi , i ∈ I are σ ﬁelds then ∩i∈I Fi is. Here I = ∅ is an
arbitrary index set (i.e., possibly uncountable). (ii) Use the result in (i) to
show if we are given a set Ω and a collection A of subsets of Ω, then there is
a smallest σ ﬁeld containing A. We will call this the σ ﬁeld generated by A
and denote it by σ (A),
Example 1.3. Product spaces. If (Ωi , Fi , Pi ) i = 1, . . . , n are probability
spaces, we can let Ω = Ω1 × · · · × Ωn = {(ω1 , . . . , ωn ) : ωi ∈ Ωi }. F =
F1 × · · · × Fn = the σ ﬁeld generated by {A1 × · · · × An : Ai ∈ Fi }. Let
P = P1 × · · · × Pn = the measure on F that has
P (A1 × · · · × An ) = P1 (A1 ) · P2 (A2 ) · · · Pn (An )
For more details, see Section 6 of the Appendix. Concrete examples of product
spaces are:
a. Roll two dice. Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}, F = all subsets of Ω,
P (A) = A/36.
b. Unit cube. If Ωi = (0, 1), Fi = the Borel sets, and Pi =Lebesgue measure,
then the product space deﬁned above is the unit cube Ω = (0, 1)n , F = the
Borel subsets of Ω, and P is ndimensional Lebesgue measure restricted to F .
Exercise 1.3. Let Rn = {(x1 , . . . , xn ) : xi ∈ R}. Rn = the Borel subsets
of Rn is deﬁned to be the σ ﬁeld generated by the open subsets of Rn . Prove
this is the same as R · · · × R = the σ ﬁeld generated by sets of the form A1 ×
· · · × An . Hint: Show that both σ ﬁelds coincide with the one generated by
(a1 , b1 ) × · · · × (an , bn ).
Probability spaces become a little more interesting when we deﬁne random
variables on them. A real valued function X deﬁned on Ω is said to be a
random variable if for every Borel set B ⊂ R we have
X −1 (B ) = {ω : X (ω ) ∈ B } ∈ F
When we need to emphasize the σ ﬁeld, we will say that X is F measurable
or write X ∈ F . If Ω is a discrete probability space (see Example 1.1), then any 3 4 Chapter 1 Laws of Large Numbers
function X : Ω → R is a random variable. A second trivial, but useful, type of
example of a random variable is the indicator function of a set A ∈ F :
1A (ω ) = 1 ω∈A
0 ω∈A The notation is supposed to remind you that this function is 1 on A. Analysts
call this object the characteristic function of A. In probability, that term is
used for something quite diﬀerent. (See Section 2.3.)
If X is a random variable, then X induces a probability measure on R
called its distribution by setting µ(A) = P (X ∈ A) for Borel sets A. Using the
notation introduced above, the righthand side can be written as P (X −1 (A)).
In words, we pull A ∈ R back to X −1 (A) ∈ F and then take P of that set.
To check that µ is a probability measure we observe that if the Ai are
disjoint then using the deﬁnition of µ; the fact that X lands in the union if and
only if it lands in one of the Ai ; the fact that if the sets Ai ∈ R are disjoint
then the events {X ∈ Ai } are disjoint; and the deﬁnition of µ again; we have:
µ (∪i Ai ) = P (X ∈ ∪i Ai ) = P (∪i {X ∈ Ai }) = P (X ∈ Ai ) =
i µ(Ai )
i The distribution of a random variable X is usually described by giving its
distribution function, F (x) = P (X ≤ x).
(1.1) Theorem. Any distribution function F has the following properties:
(i) F is nondecreasing.
(ii) limx→∞ F (x) = 1, limx→−∞ F (x) = 0.
(iii) F is right continuous, i.e. limy↓x F (y ) = F (x).
(iv) If F (x−) = limy↑x F (y ) then F (x−) = P (X > x).
(v) P (X = x) = F (x) − F (x−).
Proof To prove (i), note that if x ≤ y then {X ≤ x} ⊂ {X ≤ y }, and then
use (i) in Exercise 1.1 to conclude that P (X ≤ x) ≤ P (X ≤ y ).
To prove (ii), we observe that if x ↑ ∞, then {X ≤ x} ↑ Ω, while if x ↓ −∞
then {X ≤ x} ↓ ∅ and then use (iii) and (iv) of Exercise 1.1.
To prove (iii), we observe that if y ↓ x, then {X ≤ y } ↓ {X ≤ y }.
To prove (iv), we observe that if y ↑ x, then {X ≤ y } ↑ {X < x}.
For (v), note P (X = x) = P (X ≤ x) − P (X < x) and use (iii) and (iv). Section 1.1 Basic Deﬁnitions
The next result shows that we have found more than enough properties to
characterize distribution functions.
(1.2) Theorem. If F satisﬁes (i), (ii), and (iii) in (1.1), then it is the distribution function of some random variable.
Proof Let Ω = (0, 1), F = the Borel sets, and P = Lebesgue measure. If
ω ∈ (0, 1), let
X (ω ) = sup{y : F (y ) < ω }
Once we show that
() {ω : X (ω ) ≤ x} = {ω : ω ≤ F (x)} the desired result follows immediately since P (ω : ω ≤ F (x)) = F (x). (Recall P
is Lebesgue measure.) To check ( ), we observe that if ω ≤ F (x) then X (ω ) ≤ x,
since x ∈ {y : F (y ) < ω }. On the other hand if ω > F (x), then since F is right
/
continuous, there is an > 0 so that F (x + ) < ω and X (ω ) ≥ x + > x.
Even though F may not be 11 and onto we will call X the inverse of F and
denote it by F −1 . The scheme in the proof of (1.2) is useful in generating random variables on a computer. Standard algorithms generate random variables
U with a uniform distribution, then one applies the inverse of the distribution
function deﬁned in (1.2) to get a random variable F −1 (U ) with distribution
function F .
An immediate consequence of (1.2) is
(1.3) Corollary. If F satisﬁes (i), (ii), and (iii) in (1.1), there is a unique
probability measure µ on (R, R) that has µ((a, b]) = F (b) − F (a) for all a, b.
Proof (1.2) gives the existence of a random variable X with distribution function F . The measure it induces on (R, R) is the desired µ. There is only one
measure associated with a given F because the sets (a, b] are closed under intersection and generate the σ ﬁeld. (See (2.2) in the Appendix.)
If X and Y induce the same distribution µ on (R, R) we say X and Y are equal
in distribution. In view of (1.3), this holds if and only if X and Y have the
same distribution function, i.e., P (X ≤ x) = P (Y ≤ x) for all x. When X and
Y have the same distribution, we like to write
d X=Y
but this is too tall to use in text, so for typographical reasons we will also use
X =d Y . 5 6 Chapter 1 Laws of Large Numbers
When the distribution function F (x) = P (X ≤ x) has the form
x (∗) f (y ) dy F ( x) =
−∞ we say that X has density function f . In remembering formulas, it is often
useful to think of f (x) as being P (X = x) although
x+ P (X = x) = lim →0 f (y ) dy = 0
x− We can start with f and use (∗) to deﬁne F . In order to end up with a
distribution function it is necessary and suﬃcient that f (x) ≥ 0 and f (x) dx =
1. Three examples that will be important in what follows are:
Example 1.4. Uniform distribution on (0,1). f (x) = 1 for x ∈ (0, 1), 0
otherwise. Distribution function:
0
x
1 F ( x) = x≤0
0≤x≤1
x>1 Example 1.5. Exponential distribution. f (x) = e−x for x ≥ 0, 0 otherwise.
Distribution function:
F ( x) = 0
1 − e−x x≤0
x≥0 Example 1.6. Standard normal distribution.
f (x) = (2π )−1/2 exp(−x2 /2)
In this case, there is no closed form expression for F (x), but we have the
following bounds that are useful for large x:
(1.4) Theorem. For x > 0,
∞ (x−1 − x−3 ) exp(−x2 /2) ≤ exp(−y 2 /2)dy ≤ x−1 exp(−x2 /2)
x Changing variables y = x + z and using exp(−z 2 /2) ≤ 1 gives Proof
∞ ∞ exp(−y 2 /2) dy ≤ exp(−x2 /2)
x exp(−xz ) dz = x−1 exp(−x2 /2)
0 Section 1.1 Basic Deﬁnitions
For the other direction, we observe
∞ (1 − 3y −4 ) exp(−y 2 /2) dy = (x−1 − x−3 ) exp(−x2 /2)
x A distribution function on R is said to be absolutely continuous if it has
a density and singular if the corresponding measure is singular w.r.t. Lebesgue
measure. See Section 8 of the Appendix for more on these notions. An example
of a singular distribution is:
Example 1.7. Uniform distribution on the Cantor set. The Cantor set C
is deﬁned by removing (1/3, 2/3) from [0,1] and then removing the middle third
of each interval that remains. We deﬁne an associated distribution function by
setting F (x) = 0 for x ≤ 0, F (x) = 1 for x ≥ 1, F (x) = 1/2 for x ∈ [1/3, 2/3],
F (x) = 1/4 for x ∈ [1/9, 2/9], F (x) = 3/4 for x ∈ [7/9, 8/9], ... The function F
that results is called Lebesgue’s singular function because there is no f for
which (∗) holds. From the deﬁnition, it is immediate that the corresponding
measure has µ(C c ) = 0.
A probability measure P (or its associated distribution function) is said to
be discrete if there is a countable set S with P (S c ) = 0. The simplest example
of a discrete distribution is
Example 1.8. Pointmass at 0. F (x) = 1 for x ≥ 0, F (x) = 0 for x < 0.
The next example shows that the distribution function associated with a discrete probability measure can be quite wild.
Example 1.9. Dense discontinuities. Let q1 , q2 , ... be an enumeration of
the rationals and let
∞ 2−i 1[qi ,∞) F ( x) =
i=1 where 1[θ,∞) (x) = 1 if x ∈ [θ, ∞) = 0 otherwise.
Exercises
1.4. Let Ω = R, F = all subsets so that A or Ac is countable, P (A) = 0 in the
ﬁrst case and = 1 in the second. Show that (Ω, F , P ) is a probability space.
1.5. A σ ﬁeld F is said to be countably generated if there is a countable
collection C ⊂ F so that σ (C ) = F . Show that Rd is countably generated. 7 8 Chapter 1 Laws of Large Numbers
1.6. Suppose X and Y are random variables on (Ω, F , P ) and let A ∈ F . Show
that if we let Z (ω ) = X (ω ) for ω ∈ A and Z (ω ) = Y (ω ) for ω ∈ Ac , then Z is
a random variable.
1.7. Let χ have the standard normal distribution. Use (1.4) to get upper and
lower bounds on P (χ ≥ 4).
1.8. Show that a distribution function has at most countably many discontinuities.
1.9. Show that if F (x) = P (X ≤ x) is continuous then Y = F (X ) has a uniform
distribution on (0,1), that is, if y ∈ [0, 1], P (Y ≤ y ) = y.
1.10. Suppose X has continuous density f , P (α ≤ X ≤ β ) = 1 and g is
a function that is strictly increasing and diﬀerentiable on (α, β ). Then g (X )
has density f (g −1 (x))/g (g −1 (x)) for x ∈ (g (α), g (β ) and 0 otherwise. When
g (x) = ax + b with a > 0, the answer is f ((y − b)/a).
1.11. Suppose X has a normal distribution. Use the previous exercise to compute the density of exp(X ). (The answer is called the lognormal distribution.)
1.12. (i) Suppose X has density function f . Compute the distribution function
of X 2 and then diﬀerentiate to ﬁnd its density function. (ii) Work out the
answer when X has a standard normal distribution to ﬁnd the density of the
chisquare distribution. 1.2. Random Variables
In this section, we will develop some results that will help us later to prove that
quantities we deﬁne are random variables, i.e., they are measurable. Since most
of what we have to say is true for random elements of an arbitrary measurable
space (S, S ) and the proofs are the same (sometimes easier), we will develop
our results in that generality. First we need a deﬁnition. A function X : Ω → S
is said to be a measurable map from (Ω, F ) to (S, S ) if
X −1 (B ) ≡ {ω : X (ω ) ∈ B } ∈ F for all B ∈ S If (S, S ) = (Rd , Rd ) and d > 1 then X is called a random vector. Of course,
if d = 1, X is called a random variable, or r.v. for short.
The next result is useful for proving that maps are measurable.
(2.1) Theorem. If {ω : X (ω ) ∈ A} ∈ F for all A ∈ A and A generates S
(i.e., S is the smallest σ ﬁeld that contains A), then X is measurable. Section 1.2 Random Variables
Proof Writing {X ∈ B } as shorthand for {ω : X (ω ) ∈ B }, we have
{X ∈ ∪i Bi } = ∪i {X ∈ Bi }
{X ∈ B c } = {X ∈ B }c So the class of sets B = {B : {X ∈ B } ∈ F} is a σ ﬁeld. Since B ⊃ A and A
generates S , B ⊃ S .
It follows from the two equations displayed in the previous proof that if S
is a σ ﬁeld, then {{X ∈ B } : B ∈ S} is a σ ﬁeld. It is the smallest σ ﬁeld on
Ω that makes X a measurable map. It is called the σ ﬁeld generated by X
and denoted σ (X ).
Exercise 2.1. Show that if A generates S , then X −1 (A) ≡ {{X ∈ A} : A ∈ A}
generates σ (X ) = {{X ∈ B } : B ∈ S}.
Example 2.1. If (S, S ) = (R, R) then possible choices of A in (2.1) are
{(−∞, x] : x ∈ R} or {(−∞, x) : x ∈ Q} where Q = the rationals.
Example 2.2. If (S, S ) = (Rd , Rd ), a useful choice of A is
{(a1 , b1 ) × · · · × (ad , bd ) : −∞ < ai < bi < ∞}
or occasionally the larger collection of open sets.
(2.2) Theorem. If X : (Ω, F ) → (S, S ) and f : (S, S ) → (T, T ) are measurable
maps, then f (X ) is a measurable map from (Ω, F ) to (T, T )
Proof Let B ∈ T . {ω : f (X (ω )) ∈ B } = {ω : X (ω ) ∈ f −1 (B )} ∈ F , since by
assumption f −1 (B ) ∈ S .
From (2.2), it follows immediately that if X is a random variable then so
is cX for all c ∈ R, X 2 , sin(X ), etc. The next result shows why we wanted to
prove (2.2) for measurable maps.
(2.3) Theorem. If X1 , . . . Xn are random variables and f : (Rn , Rn ) → (R, R)
is measurable, then f (X1 , . . . , Xn ) is a random variable.
Proof In view of (2.2), it suﬃces to show that (X1 , . . . , Xn ) is a random
vector. To do this, we observe that if A1 , . . . , An are Borel sets then
{(X1 , . . . , Xn ) ∈ A1 × · · · × An } = ∩i {Xi ∈ Ai } ∈ F 9 10 Chapter 1 Laws of Large Numbers
Since sets of the form A1 × · · · × An generate Rn , (2.3) follows from (2.1).
(2.4) Corollary. If X1 , . . . , Xn are random variables then X1 + . . . + Xn is a
random variable.
Proof In view of (2.3) it suﬃces to show that f (x1 , . . . , xn ) = x1 + . . . + xn is
measurable. To do this, we use Example 2.1 and note that {x : x1 + . . . + xn < a}
is an open set and hence is in Rn .
By combining (2.4) with a remark after (2.2), we see that if X and Y are
random variables, then X − Y is. To get a feeling for the barehands approach
to proving measurability, try
Exercise 2.2. Prove (2.4) when n = 2 by checking {X1 + X2 < x} ∈ F .
(2.5) Theorem. If X1 , X2 , . . . are random variables then so are
inf Xn sup Xn n lim sup Xn n lim inf Xn n n Proof Since the inﬁmum of a sequence is < a if and only if some term is < a
(if all terms are ≥ a then the inﬁmum is), we have
{inf Xn < a} = ∪n {Xn < a} ∈ F
n A similar argument shows {supn Xn > a} = ∪n {Xn > a} ∈ F . For the last
two, we observe
lim inf Xn = sup
n→∞ n lim sup Xn = inf
n→∞ n inf Xm m≥n sup Xm
m≥n To complete the proof in the ﬁrst case, note that Yn = inf m≥n Xm is a random
variable for each n so supn Yn is as well.
From (2.5), we see that
Ωo ≡ {ω : lim Xn exists } = {ω : lim sup Xn − lim inf Xn = 0}
n→∞ n→∞ n→∞ is a measurable set. (Here ≡ indicates that the ﬁrst equality is a deﬁnition.)
If P (Ωo ) = 1, we say that Xn converges almost surely, or a.s. for short. Section 1.2 Random Variables
This type of convergence called almost everywhere in measure theory. To have
a limit deﬁned on the whole space, it is convenient to let
X∞ = lim sup Xn
n→∞ but this random variable may take the value +∞ or −∞. To accomodate this
and some other headaches, we will generalize the deﬁnition of random variable.
A function whose domain is a set D ∈ F and whose range is R∗ ≡ [−∞, ∞]
is said to be a random variable if for all B ∈ R∗ we have X −1 (B ) = {ω :
X (ω ) ∈ B } ∈ F . Here R∗ = the Borel subsets of R∗ with R∗ given the usual
topology, i.e., the one generated by intervals of the form [−∞, a), (a, b) and
(b, ∞] where a, b ∈ R. The reader should note that the extended real line
(R∗ , R∗ ) is a measurable space, so all the results above generalize immediately.
Exercises
2.3. Show that if f is continuous and Xn → X almost surely then f (Xn ) →
f (X ) almost surely.
2.4. (i) Show that a continuous function from Rd → R is a measurable map
from (Rd , Rd ) to (R, R). (ii) Show that Rd is the smallest σ ﬁeld that makes
all the continuous functions measurable.
2.5. A function f is said to be lower semicontinuous or l.s.c. if
lim inf f (y ) ≥ f (x)
y →x and upper semicontinuous (u.s.c.) if −f is l.s.c. Show that f is l.s.c. if and
only if {x : f (x) ≤ a} is closed for each a ∈ R and conclude that semicontinuous
functions are measurable.
2.6. Let f : Rd → R be an arbitrary function and let f δ (x) = sup{f (y ) :
2
2
y − x < δ } and fδ (x) = inf {f (y ) : y − x < δ } where z  = (z1 + . . . + zd )1/2 .
δ
0
δ
Show that f is l.s.c. and fδ is u.s.c. Let f = limδ↓0 f , f0 = limδ↓0 fδ , and
conclude that the set of points at which f is discontinuous = {f 0 = f0 } is
measurable.
2.7. A function ϕ : Ω → R is said to be simple if
n ϕ(ω ) = cm 1Am (ω )
m=1 where the cm are real numbers and Am ∈ F . Show that the class of F measurable functions is the smallest class containing the simple functions and closed
under pointwise limits. 11 12 Chapter 1 Laws of Large Numbers
2.8. Use the previous exercise to conclude that Y is measurable with respect
to σ (X ) if and only if Y = f (X ) where f : R → R is measurable.
2.9. To get a constructive proof of the last result, note that {ω : m2−n ≤ Y <
(m + 1)2−n } = {X ∈ Bm,n } for some Bm,n ∈ R and set fn (x) = m2−n for
x ∈ Bm,n and show that as n → ∞ fn (x) → f (x) and Y = f (X ). 1.3. Expected Value
If X ≥ 0 is a random variable on (Ω, F , P ) then we deﬁne its expected value
to be EX = X dP , which always makes sense, but may be ∞. (The integral
is deﬁned in Section 4 of the Appendix.) To reduce the general case to the
nonnegative case, let x+ = max{x, 0} be the positive part and let x− =
max{−x, 0} be the negative part of x. We declare that EX exists and set
EX = EX + − EX − whenever the subtraction makes sense, i.e., EX + < ∞ or
EX − < ∞.
EX is often called the mean of X and denoted by µ. EX is deﬁned by
integrating X , so it has all the properties that integrals do. From (4.5) and
(4.7) in the Appendix and the trivial observation that E (b) = b for any real
number b, we get the following:
Theorem. Suppose X, Y ≥ 0 or E X , E Y  < ∞
(3.1a) If E (X + Y ) = EX + EY .
(3.1b) E (aX + b) = aE (X ) + b for any real numbers a, b.
(3.1c) If X ≥ Y then EX ≥ EY .
Exercise 3.1. Suppose E X , E Y  < ∞. Show that equality holds in (3.1c)
if and only if X = Y a.s. Hint: Use (3.4) below.
Exercise 3.2. Suppose only that EX and EY exist. Show that (3.1c) always
holds; (3.1a) holds unless one expected value is ∞ and the other is −∞; and
(3.1b) holds unless a = 0 and EX is inﬁnite.
In this section, we will recall some properties of expected value and prove some
new ones. To organize things, we will divide the developments into three subsections. Section 1.3 Expected Value a. Inequalities
Our ﬁrst two results are (5.1) and (5.2) from the Appendix.
(3.2) Jensen’s inequality. Suppose ϕ is convex, that is,
λϕ(x) + (1 − λ)ϕ(y ) ≥ ϕ(λx + (1 − λ)y )
for all λ ∈ (0, 1) and x, y ∈ R. Then
E (ϕ(X )) ≥ ϕ(EX )
provided both expectations exist, i.e., E X  and E ϕ(X ) < ∞.
Two useful special cases are
EX  ≤ E X  (EX )2 ≤ E (X 2 ) To recall the direction in which the inequality goes note that if P (X = x) = λ
and P (X = y ) = 1 − λ then
Eϕ(X ) = λϕ(x) + (1 − λ)ϕ(y ) ≥ ϕ(λx + (1 − λ)y ) = ϕ(EX )
Exercise 3.3. Suppose ϕ is strictly convex, i.e., > holds for λ ∈ (0, 1). Show
that, under the assumptions of (3.2), ϕ(EX ) = Eϕ(X ) implies X = EX a.s.
Exercise 3.4. Suppose ϕ : Rn → R is convex. Imitate the proof of (5.1) in
the Appendix to show
Eϕ(X1 , . . . , Xn ) ≥ ϕ(EX1 , . . . , EXn )
provided E ϕ(X1 , . . . , Xn ) < ∞ and E Xi  < ∞ for all i.
(3.3) H¨lder’s inequality. If p, q ∈ [1, ∞] with 1/p + 1/q = 1 then
o
E XY  ≤ X
Here X r = (E X r )1/r for r ∈ [1, ∞); X p Y
∞ q = inf {M : P (X  > M ) = 0}. The special case p = q = 2 is called the CauchySchwarz inequality:
E XY  ≤ E X 2 EY 2 1/2 13 14 Chapter 1 Laws of Large Numbers
To state our next result, we need some notation. If we only integrate over
A ⊂ Ω, we write
E (X ; A) = X dP
A (3.4) Chebyshev’s inequality. Suppose ϕ : R → R has ϕ ≥ 0, let A ∈ R
and let iA = inf {ϕ(y ) : y ∈ A}.
iA P (X ∈ A) ≤ E (ϕ(X ); X ∈ A) ≤ Eϕ(X )
Proof The deﬁnition of iA and the fact that ϕ ≥ 0 imply that
iA 1(X ∈A) ≤ ϕ(X )1(X ∈A) ≤ ϕ(X ) So taking expected values and using (3.1.c) gives the desired result.
Remark. Some authors call (3.4) Markov’s inequality and use the name
Chebyshev’s inequality for the special case ϕ(x) = x2 and A = {x : x ≥ a}:
(∗) a2 P (X  ≥ a) ≤ EX 2 Our next four exercises are concerned with how good (∗) is and with complements and converses. These constitute a digression from the main story and
can be skipped without much loss.
Exercise 3.5. (∗) is and is not sharp. (i) Show that (∗) is sharp by showing
that if 0 < a ≤ b are ﬁxed there is an X with EX 2 = b2 for which equality
holds. (ii) Show that (∗) is not sharp by showing that if X has 0 < EX 2 < ∞
then
lim a2 P (X  ≥ a)/EX 2 = 0
a→∞ Exercise 3.6. Onesided Chebyshev bound. (i) Let a > b > 0, 0 < p < 1,
and let X have P (X = a) = p and P (X = −b) = 1 − p. Apply (3.4) to
ϕ(x) = (x + b)2 and conclude that if Y is any random variable with EY = EX
and var(Y ) = var(X ), then P (Y ≥ a) ≤ p and equality holds when Y = X .
(ii) Suppose EY = 0, var(Y ) = σ 2 , and a > 0. Show that P (Y ≥ a) ≤
σ 2 /(a2 + σ 2 ), and there is a Y for which equality holds.
Exercise 3.7. Two nonexistent lower bounds.
Show that: (i) if > 0, inf {P (X  > ) : EX = 0, var(X ) = 1} = 0.
(ii) if y ≥ 1, σ 2 ∈ (0, ∞), inf {P (X  > y ) : EX = 1, var(X ) = σ 2 } = 0. Section 1.3 Expected Value Exercise 3.8. A useful lower bound. Let Y ≥ 0 with EY 2 < ∞. Apply
the CauchySchwarz inequality to Y 1(Y >0) and conclude
P (Y > 0) ≥ (EY )2 /EY 2 b. Integration to the limit
There are three classic real analysis results, (5.4)–(5.6) in the Appendix, about
what happens when we interchange limits and integrals.
(3.5) Fatou’s lemma. If Xn ≥ 0 then lim inf n→∞ EXn ≥ E (lim inf n→∞ Xn ).
To recall the direction of the inequality, think of the special case Xn = n1(0,1/n)
(on the unit interval equipped with the Borel sets and Lebesgue measure). Here
Xn → 0 a.s. but EXn = 1 for all n.
(3.6) Monotone convergence theorem. If 0 ≤ Xn ↑ X then EXn ↑ EX.
This follows immediately from (3.5) since Xn ↑ X and (3.1c) imply
lim sup EXn ≤ EX
n→∞ (3.7) Dominated convergence theorem. If Xn → X a.s., Xn  ≤ Y for all
n, and EY < ∞, then EXn → EX.
The special case of (3.7) in which Y is constant is called the bounded convergence theorem.
In the developments below, we will need another result on integration to
the limit. Perhaps the most important special case of this result occurs when
g (x) = xp with p > 1 and h(x) = x.
(3.8) Theorem. Suppose Xn → X a.s. Let g, h be continuous functions with
(i) g ≥ 0 and g (x) > 0 when x is large, (ii) h(x)/g (x) → 0 as x → ∞, and
(iii) Eg (Xn ) ≤ K < ∞ for all n. Then Eh(Xn ) → Eh(X ).
Proof By subtracting a constant from h, we can suppose without loss of
generality that h(0) = 0. Pick M large so that P (X  = M ) = 0 and g (x) > 0
¯
when x ≥ M . Given a random variable Y , let Y = Y 1(Y ≤M ) . Since P (X  = 15 16 Chapter 1 Laws of Large Numbers
¯
¯
¯
M ) = 0, Xn → X a.s. Since h(Xn ) is bounded and h is continuous, it follows
from the bounded convergence theorem that
¯
¯
Eh(Xn ) → Eh(X ) (a) To control the eﬀect of the truncation, we use the following:
(b) ¯
¯
Eh(Y ) − Eh(Y ) ≤ E h(Y ) − h(Y ) ≤ E (h(Y ); Y > M ) ≤ M Eg (Y ) where M = sup{h(x)/g (x) : x ≥ M }. To check the second inequality, note
¯
that when Y  ≤ M , Y = Y , and we have supposed h(0) = 0. The third
inequality follows from the deﬁnition of M .
Taking Y = Xn in (b) and using (iii), it follows that
(c) ¯
Eh(Xn ) − Eh(Xn ) ≤ K M ¯
To estimate Eh(X ) − Eh(X ), we observe that g ≥ 0 and g is continuous, so
Fatou’s lemma implies
Eg (X ) ≤ lim inf Eg (Xn ) ≤ K
n→∞ Taking Y = X in (b) gives
(d) ¯
Eh(X ) − Eh(X ) ≤ K M The triangle inequality implies
¯
Eh(Xn ) − Eh(X ) ≤ Eh(Xn ) − Eh(Xn )
¯
¯
¯
+ Eh(Xn ) − Eh(X ) + Eh(X ) − Eh(X )
Taking limits and using (a), (c), (d), we have
lim sup Eh(Xn ) − Eh(X ) ≤ 2K M n→∞ which proves the desired result since K < ∞ and M → 0 as M → ∞. A simple example shows that (3.8) can sometimes be applied when (3.7) cannot.
Exercise 3.9. Let Ω = (0, 1) equipped with the Borel sets and Lebesgue
measure. Let α ∈ (1, 2) and Xn = nα 1(1/(n+1),1/n) → 0 a.s. Show that (3.8)
can be applied with h(x) = x and g (x) = x2/α , but the Xn are not dominated
by an integrable function. Section 1.3 Expected Value c. Computing Expected Values
Integrating over (Ω, F , P ) is nice in theory, but to do computations we have to
shift to a space on which we can do calculus. In most cases, we will apply the
next result with S = Rd .
(3.9) Change of variables formula. Let X be a random element of (S, S )
with distribution µ, i.e., µ(A) = P (X ∈ A). If f is a measurable function from
(S, S ) to (R, R) so that f ≥ 0 or E f (X ) < ∞, then
f (y ) µ(dy ) Ef (X ) =
S Remark. To explain the name, write h for X and P ◦ h−1 for µ to get
f (y ) d(P ◦ h−1 ) f (h(ω )) dP =
Ω S Proof We will prove this result by verifying it in four increasingly more general special cases that parallel the way that the integral is deﬁned (see section
4 of the Appendix). The reader should note the method employed, since it will
be used several times below.
Case 1: Indicator functions. If B ∈ S and f = 1B then recalling the
relevant deﬁnitions shows
E 1B (X ) = P (X ∈ B ) = µ(B ) = 1B (y ) µ(dy )
S n Case 2: Simple functions. Let f (x) = m=1 cm 1Bm where cm ∈ R, Bm ∈
S . The linearity of expected value, the result of Case 1, and the linearity of
integration imply
n cm E 1Bm (X ) Ef (X ) =
m=1
n = cm
m=1 1Bm (y ) µ(dy ) =
S f (y ) µ(dy )
S Case 3: Nonegative functions. Now if f ≥ 0 and we let
fn (x) = ([2n f (x)]/2n ) ∧ n 17 18 Chapter 1 Laws of Large Numbers
where [x] = the largest integer ≤ x and a ∧ b = min{a, b}, then the fn are
simple and fn ↑ f , so using the result for simple functions and the monotone
convergence theorem:
Ef (X ) = lim Efn (X ) = lim
n n fn (y ) µ(dy ) =
S f (y ) µ(dy )
S Case 4: Integrable functions. The general case now follows by writing
f (x) = f (x)+ − f (x)− . The condition E f (X ) < ∞ guarantees that Ef (X )+
and Ef (X )− are ﬁnite. So using the result for nonnegative functions and
linearity of expected value and integration:
Ef (X ) = Ef (X )+ − Ef (X )− = f (y )+ µ(dy ) −
S = f (y )− µ(dy )
S f (y ) µ(dy )
S For practice with the proof technique of (3.9), do
Exercise 3.10. Suppose that the probability measure µ has µ(A) = A f (x) dx
for all A ∈ R. Then for any g with g ≥ 0 or g (x) µ(dx) < ∞ we have
g (x) µ(dx) = g (x)f (x) dx A consequence of (3.9) is that we can compute expected values of functions
of random variables by performing integrals on the real line. Before we can do
some examples, we need to introduce the terminology for what we are about
to compute. If k is a positive integer then EX k is called the kth moment
of X . The ﬁrst moment EX is usually called the mean and denoted by µ. If
EX 2 < ∞ then the variance of X is deﬁned to be var(X ) = E (X − µ)2 . To
compute the variance the following formula is useful:
(3.10a) var(X ) = E (X − µ)2 = EX 2 − 2µEX + µ2 = EX 2 − µ2 From this it is immediate that
(3.10b) var(X ) ≤ EX 2 Here EX 2 is the expected value of X 2 . When we want the square of EX , we
will write (EX )2 . Since E (aX + b) = aEX + b by (3.1b), it follows easily from
the deﬁnition that
(3.10c) var(aX + b) = E (aX + b − E (aX + b))2
= a2 E (X − EX )2 = a2 var(X ) Section 1.3 Expected Value We turn now to concrete examples and leave the calculus in the ﬁrst two examples to the reader. (Integrate by parts.)
Example 3.1. If X has an exponential distribution then
∞ EX k = xk e−x dx = k !
0 So the mean of X is 1 and variance is EX 2 − (EX )2 = 2 − 12 = 1. If we
let Y = X/λ, then by Exercise 1.10, Y has density λe−λy for y ≥ 0, the
exponential density with parameter λ. From (3.1b) and (3.10c), it follows
that Y has mean 1/λ and variance 1/λ2 .
Example 3.2. If X has a standard normal distribution,
EX = x(2π )−1/2 exp(−x2 /2) dx = 0 var(X ) = EX 2 = (by symmetry) x2 (2π )−1/2 exp(−x2 /2) dx = 1 If we let σ > 0, µ ∈ R, and Y = σX + µ, then (3.1b) and (3.10c) imply EY = µ
and var(Y ) = σ 2 . By Exercise 1.10, Y has density
(2πσ 2 )−1/2 exp(−(y − µ)2 /2σ 2 )
the normal distribution with mean µ and variance σ 2 .
We will next consider some discrete distributions. The ﬁrst is ridiculously
simple, but we will need the result several times below, so we record it here.
Example 3.3. We say that X has a Bernoulli distribution with parameter
p if P (X = 1) = p and P (X = 0) = 1 − p. Clearly,
EX = p · 1 + (1 − p) · 0 = p
Since X 2 = X , we have EX 2 = EX = p and
var(X ) = EX 2 − (EX )2 = p − p2 = p(1 − p) Example 3.4. We say that X has a Poisson distribution with parameter λ
if
P (X = k ) = e−λ λk /k ! for k = 0, 1, 2, . . . 19 20 Chapter 1 Laws of Large Numbers
To evaluate the moments of the Poisson random variable, we use a little inspiration to observe that for k ≥ 1
∞ j (j − 1) · · · (j − k + 1)e−λ E (X (X − 1) · · · (X − k + 1)) =
j =k ∞ e−λ = λk
j =k λj
j! λj −k
= λk
(j − k )! where the equalities follow from the facts that (i) j (j − 1) · · · (j − k +1) = 0 when
j < k , (ii) cancelling part of the factorial, (iii) the fact that Poisson distribution
has total mass 1. Using the last formula, it follows that EX = λ while
var(X ) = EX 2 − (EX )2 = E (X (X − 1)) + EX − λ2 = λ
Example 3.5. N is said to have a geometric distribution with success
probability p ∈ (0, 1) if
P (N = k ) = p(1 − p)k−1 for k = 1, 2, . . . N is the number of independent trials needed to observe an event with probability p. Diﬀerentiating the identity
∞ (1 − p)k = 1/p
k=0 and referring to Example 9.2 in the Appendix for the justiﬁcation gives
∞ k (1 − p)k−1 = −1/p2 −
k=1
∞ k (k − 1)(1 − p)k−2 = 2/p3
k=2 From this it follows that
∞ kp(1 − p)k−1 = 1/p EN =
k=1
∞ k (k − 1)p(1 − p)k−1 = 2(1 − p)/p2 EN (N − 1) =
k=1 var(N ) = EN 2 − (EN )2 = EN (N − 1) + EN − (EN )2
= 2(1 − p)
p
1
1−p
+ 2− 2=
p2
p
p
p2 Section 1.3 Expected Value Exercises
3.11. Inclusionexclusion formula. Let A1 , A2 , . . . An be events and A =
n
∪n Ai . Prove that 1A = 1 − i=1 (1 − 1Ai ). Expand out the right hand side,
i=1
then take expected value to conclude
n P (∪n Ai )
i=1 = P (Ai ) −
i=1 P (Ai ∩ Aj )
i<j P (Ai ∩ Aj ∩ Ak ) − . . . + (−1)n−1 P (∩n Ai )
i=1 +
i<j<k 3.12. Bonferroni inequalities. Let A1 , A2 , . . . An be events and A = ∪n Ai .
i=1
n
Show that 1A ≤ i=1 1Ai , etc. and then take expected values to conclude
n P (∪n Ai ) ≤
i=1 P (Ai )
i=1
n P (∪n Ai ) ≥
i=1 P (Ai ) −
i=1
n P (∪n Ai ) ≤
i=1 P (Ai ∩ Aj )
i<j P (Ai ) −
i=1 P (Ai ∩ Aj ) +
i<j P (Ai ∩ Aj ∩ Ak )
i<j<k In general, if we stop the inclusion exclusion formula after an even (odd) number
of sums, we get an lower (upper) bound.
3.13. If E X k < ∞ then for 0 < j < k , E X j < ∞, and furthermore
E X j ≤ (E X k )j/k
3.14. Apply Jensen’s inequality with ϕ(x) = ex and P (X = log ym ) = p(m) to
n
conclude that if m=1 p(m) = 1 and p(m), ym > 0 then
n n
p(
ym m) p(m)ym ≥
m=1 m=1 When p(m) = 1/n, this says the arithmetic mean exceeds the geometric mean.
−
3.15. If EX1 < ∞ and Xn ↑ X then EXn ↑ EX . 3.16. Let X ≥ 0 but do NOT assume E (1/X ) < ∞. Show
lim yE (1/X ; X > y ) = 0, y →∞ 3.17. If Xn ≥ 0 then E ( ∞
n=0 Xn ) = lim yE (1/X ; X > y ) = 0. y ↓0
∞
n=0 EXn . 3.18. If X is integrable and An are disjoint sets with union A then
∞ E (X ; An ) = E (X ; A)
n=0 i.e., the sum converges absolutely and has the value on the right. 21 22 Chapter 1 Laws of Large Numbers 1.4. Independence
We begin with what is hopefully a familiar deﬁnition and then work our way
up to a deﬁnition that is appropriate for our current setting.
Two events A and B are independent if P (A ∩ B ) = P (A)P (B ).
Two random variables X and Y are independent if for all C, D ∈ R,
P (X ∈ C, Y ∈ D) = P (X ∈ C )P (Y ∈ D)
i.e., the events A = {X ∈ C } and B = {Y ∈ D} are independent.
Two σ ﬁelds F and G are independent if for all A ∈ F and B ∈ G the events
A and B are independent.
As the next exercise shows, the second deﬁnition is a special case of the third.
Exercise 4.1. (i) Show that if X and Y are independent then σ (X ) and σ (Y )
are. (ii) Conversely, if F and G are independent, X ∈ F , and Y ∈ G , then X
and Y are independent.
The ﬁrst deﬁnition is, in turn, a special case of the second.
Exercise 4.2. (i) Show that if A and B are independent then so are Ac and B ,
A and B c , and Ac and B c . (ii) Conclude that events A and B are independent
if and only if their indicator random variables 1A and 1B are independent.
In view of the fact that the ﬁrst deﬁnition is a special case of the second,
which is a special case of the third, we take things in the opposite order when we
say what it means for several things to be independent. We begin by reducing
to the case of ﬁnitely many objects. An inﬁnite collection of objects (σ ﬁelds,
random variables, or sets) is said to be independent if every ﬁnite subcollection
is.
σ ﬁelds F1 , F2 , . . . , Fn are independent if whenever Ai ∈ Fi for i = 1, . . . , n,
we have
n P (∩n Ai ) =
i=1 P (Ai )
i=1 Random variables X1 , . . . , Xn are independent if whenever Bi ∈ R for i =
1, . . . , n we have
n P (∩n {Xi ∈ Bi }) =
i=1 P (Xi ∈ Bi )
i=1 Section 1.4 Independence
Sets A1 , . . . , An are independent if whenever I ⊂ {1, . . . n} we have
P (∩i∈I Ai ) = P (Ai )
i∈I At ﬁrst glance, it might seem that the last deﬁnition does not match the other
two. However, if you think about it for a minute, you will see that if the
indicator variables 1Ai , 1 ≤ i ≤ n are independent and we take Bi = {1}
for i ∈ I , and Bi = R for i ∈ I then the condition in the deﬁnition results.
Conversely,
Exercise 4.3. Let A1 , A2 , . . . , An be independent. Show (i) Ac , A2 , . . . , An
1
are independent; (ii) 1A1 , . . . , 1An are independent.
One of the ﬁrst things to understand about the deﬁnition of independent
events is that it is not enough to assume P (Ai ∩ Aj ) = P (Ai )P (Aj ) for all
i = j . A sequence of events A1 , . . . , An with the last property is called pairwise
independent. It is clear that independent events are pairwise independent.
The next example shows that the converse is not true.
Example 4.1. Let X1 , X2 , X3 be independent random variables with
P (Xi = 0) = P (Xi = 1) = 1/2
Let A1 = {X2 = X3 }, A2 = {X3 = X1 } and A3 = {X1 = X2 }. These events
are pairwise independent since if i = j then
P (Ai ∩ Aj ) = P (X1 = X2 = X3 ) = 1/4 = P (Ai )P (Aj )
but they are not independent since
P (A1 ∩ A2 ∩ A3 ) = 1/4 = 1/8 = P (A1 )P (A2 )P (A3 )
In order to show that random variables X and Y are independent, we have
to check that P (X ∈ A, Y ∈ B ) = P (X ∈ A)P (Y ∈ B ) for all Borel sets A and
B . Since there are a lot of Borel sets, our next topic is a. Suﬃcient Conditions for Independence
Our main result is (4.2). To state that result, we need a deﬁnition that generalizes all our earlier deﬁnitions. 23 24 Chapter 1 Laws of Large Numbers
Collections of sets A1 , A2 , . . . , An ⊂ F are said to be independent if whenever
Ai ∈ Ai and I ⊂ {1, . . . , n} we have P (∩i∈I Ai ) = i∈I P (Ai )
If each collection is a single set i.e., Ai = {Ai }, this deﬁnition reduces to the one
for sets. If each Ai contains Ω, e.g., Ai is a σ ﬁeld the condition is equivalent
n
to P (∩n Ai ) = i=1 P (Ai ) whenever Ai ∈ Ai since we can set Ai = Ω for
i=1
¯
i ∈ I . Conversely, if A1 , A2 , . . . , An are independent and Ai = Ai ∪ {Ω} then
¯
¯
¯1 , A2 , . . . , An are independent, so there is no loss of generality in supposing
A
Ω ∈ Ai .
The proof of (4.2) is based on Dynkin’s π − λ theorem ((2.1) in the Appendix). To state this result, we need two deﬁnitions. We say that A is a
π system if it is closed under intersection, i.e., if A, B ∈ A then A ∩ B ∈ A.
We say that L is a λsystem if: (i) Ω ∈ L. (ii) If A, B ∈ L and A ⊂ B then
B − A ∈ L. (iii) If An ∈ L and An ↑ A then A ∈ L.
(4.1) π − λ Theorem. If P is a π system and L is a λsystem that contains P
then σ (P ) ⊂ L.
(4.2) Theorem. Suppose A1 , A2 , . . . , An are independent and each Ai is a
π system. Then σ (A1 ), σ (A2 ), . . . , σ (An ) are independent.
Proof Let A2 , . . . , An be sets with Ai ∈ Ai , let F = A2 ∩ · · · ∩ An and let
L = {A : P (A ∩ F ) = P (A)P (F )}. As noted after the deﬁnition, we can
n
without loss of generality suppose Ω ∈ Ai . So we have P (F ) = i=2 P (Ai )
and (i) Ω ∈ L. To check (ii), we note that if A, B ∈ L with A ⊂ B then
(B − A) ∩ F = (B ∩ F ) − (A ∩ F ). So using (i) in Exercise 1.1, the fact A, B ∈ L
and then (i) in Exercise 1.1 again:
P ((B − A) ∩ F ) = P (B ∩ F ) − P (A ∩ F ) = P (B )P (F ) − P (A)P (F )
= {P (B ) − P (A)}P (F ) = P (B − A)P (F )
and we have B − A ∈ L. To check (iii) let Bk ∈ L with Bk ↑ B and note that
(Bk ∩ F ) ↑ (B ∩ F ) so using (iii) in Exercise 1.1, then the fact Bk ∈ L and then
(iii) in Exercise 1.1 again:
P (B ∩ F ) = lim P (Bk ∩ F ) = lim P (Bk )P (F ) = P (B )P (F )
k k Applying the π − λ theorem now gives L ⊃ σ (A1 ) and since A2 , . . . , An are
arbitrary members of A2 , . . . , An , we have:
(4.2 ) If A1 , A2 , . . . , An are independent then σ (A1 ), A2 , . . . , An are independent. Section 1.4 Independence
Applying (4.2 ) to A2 , . . . , An , σ (A1 ) (which are independent since the definition is unchanged by permuting the order of the collections) shows that
σ (A2 ), A3 , . . . , An , σ (A1 ) are independent, and after n iterations we have the
desired result.
Remark. The reader should note that it is not easy to show that if A, B ∈ L
then A ∩ B ∈ L, or A ∪ B ∈ L, but it is easy to check that if A, B ∈ L with
A ⊂ B then B − A ∈ L.
Having worked to establish (4.2), we get several corollaries.
(4.3) Corollary. In order for X1 , . . . , Xn to be independent, it is suﬃcient that
for all x1 , . . . , xn ∈ (−∞, ∞]
n P ( X1 ≤ x1 , . . . , X n ≤ xn ) = P ( Xi ≤ xi )
i=1 Proof Let Ai = the sets of the form {Xi ≤ xi }. Since {Xi ≤ x} ∩ {Xi ≤
y } = {Xi ≤ x ∧ y }, Ai is a π system. Since we have allowed xi = ∞, Ω ∈ Ai .
Exercise 2.1 implies σ (Ai ) = σ (Xi ), so the result follows from (4.2).
The last result expresses independence of random variables in terms of their distribution functions. The next two exercises treat density functions and discrete
random variables.
Exercise 4.4. Suppose (X1 , . . . , Xn ) has density f (x1 , x2 , . . . , xn ), that is
f (x) dx for A ∈ Rn P ((X1 , X2 , . . . , Xn ) ∈ A) =
A If f (x) can be written as g1 (x1 ) · · · gn (xn ) where the gm ≥ 0 are measurable,
then X1 , X2 , . . . , Xn are independent. Note that the gm are not assumed to be
probability densities.
Exercise 4.5. Suppose X1 , . . . , Xn are random variables that take values in
countable sets S1 , . . . , Sn . Then in order for X1 , . . . , Xn to be independent, it
is suﬃcient that whenever xi ∈ Si
n P ( X1 = x1 , . . . , X n = xn ) = P ( Xi = xi )
i=1 Our next goal is to prove that functions of disjoint collections of independent random variables are independent. See (4.5) for the precise statement.
First we will prove an analogous result for σ ﬁelds. 25 26 Chapter 1 Laws of Large Numbers
(4.4) Corollary. Suppose Fi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ m(i) are independent and
let Gi = σ (∪j Fi,j ). Then G1 , . . . , Gn are independent.
Proof Let Ai be the collection of sets of the form ∩j Ai,j where Ai,j ∈ Fi,j . Ai
is a π system that contains Ω and contains ∪j Fi,j so (4.2) implies σ (Ai ) = Gi
are independent.
(4.5) Corollary. If for 1 ≤ i ≤ n, 1 ≤ j ≤ m(i), Xi,j are independent and
fi : Rm(i) → R are measurable then fi (Xi,1 , . . . , Xi,m(i) ) are independent.
Proof Let Fi,j = σ (Xi,j ) and Gi = σ (∪j Fi,j ). Since fi (Xi,1 , . . . , Xi,m(i) ) ∈ Gi ,
the desired result follows from (4.4) and Exercise 4.1.
A concrete special case of (4.5) that we will use in a minute is: if X1 , . . . , Xn are
independent then X = X1 and Y = X2 · · · Xn are independent. Later, when we
study sums Sm = X1 + · · · + Xm of independent random variables X1 , . . . , Xn ,
we will use (4.5) to conclude that if m < n then Sn − Sm is independent of the
indicator function of the event {max1≤k≤m Sk > x}. b. Independence, Distribution, and Expectation
Our next goal is to obtain formulas for the distribution and expectation of
independent random variables.
(4.6) Theorem. Suppose X1 , . . . , Xn are independent random variables and
Xi has distribution µi , then (X1 , . . . , Xn ) has distribution µ1 × · · · × µn .
Proof Using the deﬁnitions of (i) A1 × · · · × An , (ii) independence, (iii) µi ,
and (iv) µ1 × · · · × µn
P ((X1 , . . . , Xn ) ∈ A1 × · · · × An ) = P (X1 ∈ A1 , . . . , Xn ∈ An )
n n P (Xi ∈ Ai ) = =
i=1 µi (Ai ) = µ1 × · · · × µn (A1 × · · · × An )
i=1 The last formula shows that the distribution of (X1 , . . . , Xn ) and the measure
µ1 × · · · × µn agree on sets of the form A1 × · · · × An , a π system that generates
Rn . So (2.2) in the Appendix implies they must agree.
(4.7) Theorem. Suppose X and Y are independent and have distributions µ
and ν . If h : R2 → R is a measurable function with h ≥ 0 or E h(X, Y ) < ∞
then
Eh(X, Y ) =
h(x, y ) µ(dx) ν (dy ) Section 1.4 Independence
In particular, if h(x, y ) = f (x)g (y ) where f, g : R → R are measurable functions
with f, g ≥ 0 or E f (X ) and E g (Y ) < ∞ then
Ef (X )g (Y ) = Ef (X ) · Eg (Y )
Proof Using (3.9) and then Fubini’s theorem ((6.2) in the Appendix) we have
Eh(X, Y ) = h d(µ × ν ) = h(x, y ) µ(dx) ν (dy ) R2 To prove the second result, we start with the result when f, g ≥ 0. In this case,
using the ﬁrst result, the fact that g (y ) does not depend on x and then (3.9)
twice we get
Ef (X )g (Y ) =
= f (x)g (y ) µ(dx) ν (dy ) = g (y ) f (x) µ(dx) ν (dy ) E f (X )g (y ) ν (dy ) = Ef (X )Eg (Y ) Applying the result for nonnegative f and g to f  and g , shows E f (X )g (Y ) =
E f (X )E g (Y ) < ∞, and we can repeat the last argument to prove the desired
result.
From (4.7), it is only a small step to
(4.8) Theorem. If X1 , . . . , Xn are independent and have (a) Xi ≥ 0 for all i,
or (b) E Xi  < ∞ for all i then
n E n Xi = i=1 EXi
i=1 i.e., the expectation on the left exists and has the value given on the right.
Proof X = X1 and Y = X2 · · · Xn are independent by (4.5) so taking f (x) =
x and g (y ) = y  we have E X1 · · · Xn  = E X1 E X2 · · · Xn , and it follows by
induction that if 1 ≤ m ≤ n
n E X m · · · X n  = E X k 
i=m If the Xi ≥ 0, then Xi  = Xi and the desired result follows from the special
case m = 1. To prove the result in general note that the special case m = 2 27 28 Chapter 1 Laws of Large Numbers
implies E Y  = E X2 · · · Xn  < ∞, so using (4.7) with f (x) = x and g (y ) = y
shows E (X1 · · · Xn ) = EX1 · E (X2 · · · Xn ), and the desired result follows by
induction.
Example 4.2. It can happen that E (XY ) = EX · EY without the variables
being independent. Suppose the joint distribution of X and Y is given by the
following table
Y
1 0 −1
1
0a
0
X
0
b
c
b
−1 0 a
0
where a, b > 0, c ≥ 0, and 2a + 2b + c = 1. Things are arranged so that
XY ≡ 0. Symmetry implies EX = 0 and EY = 0, so E (XY ) = 0 = EXEY .
The random variables are not independent since
P (X = 1, Y = 1) = 0 < ab = P (X = 1)P (Y = 1)
Two random variables X and Y with EX 2 , EY 2 < ∞ that have EXY =
EXEY are said to be uncorrelated. The ﬁnite second moments are needed
so that we know E XY  < ∞ by the CauchySchwarz inequality.
Exercise 4.6. Let Ω = (0, 1), F = Borel sets, P = Lebesgue measure.
Xn (ω ) = sin(2πnω ), n = 1, 2, . . . are uncorrelated but not independent.
We turn now to the distribution of the sum of two independent r.v.’s
(4.9) Theorem. If X and Y are independent, F (x) = P (X ≤ x), and G(y ) =
P (Y ≤ y ), then
P (X + Y ≤ z ) = F (z − y ) dG(y ) The integral on the righthand side is called the convolution of F and G and
is denoted F ∗ G(z ). The meaning of dG(y ) will be explained in the proof.
Proof Let h(x, y ) = 1(x+y≤z) . Let µ and ν be the probability measures with
distribution functions F and G. Since for ﬁxed y
h(x, y ) µ(dx) = 1(−∞,z−y] (x) µ(dx) = F (z − y ) Section 1.4 Independence
using (4.7) gives
1(x+y≤z) µ(dx) ν (dy ) P (X + Y ≤ z ) =
= F (z − y ) ν (dy ) = F (z − y ) dG(y ) The last equality is just a change of notation: We regard dG(y ) as a shorthand
for “integrate with respect to the measure ν with distribution function G.”
Exercise 4.7. (i) Show that if X and Y are independent with distributions µ
and ν then
µ({−y })ν ({y })
P (X + Y = 0) =
y (ii) Conclude that if X has continuous distribution P (X = Y ) = 0.
To treat concrete examples, we need a special case of (4.9).
(4.10) Theorem. Suppose that X with density f and Y with distribution
function G are independent. Then X + Y has density
h ( x) = f (x − y ) dG(y ) When Y has density g , the last formula can be written as
h ( x) = f (x − y ) g (y ) dy Proof From (4.9), the deﬁnition of density function, and Fubini’s theorem
((6.2) in the Appendix), which is justiﬁed since everything is nonnegative, we
get
z P (X + Y ≤ z ) = F (z − y ) dG(y ) = f (x − y ) dx dG(y )
−∞ z f (x − y ) dG(y ) dx =
−∞ The last equation says that X + Y has density h(x) = f (x − y )dG(y ). The
second formula follows from the ﬁrst when we recall the meaning of dG(y ) given
in (4.9) and use Exercise 3.10.
(4.10) plus some ugly calculus allows us to treat two standard examples.
These facts should be familiar from undergraduate probability. We give one
calculation and leave the other to the reader. 29 30 Chapter 1 Laws of Large Numbers
Example 4.3. The gamma density with parameters α and λ is given by
λα xα−1 e−λx /Γ(α)
0 f ( x) =
where Γ(α) = ∞
0 for x ≥ 0
for x < 0 xα−1 e−x dx. We will now show If X = gamma(α, λ) and Y = gamma(β, λ) are independent then X + Y is
gamma(α + β, λ).
Proof Writing fX +Y (z ) for the density function of X + Y and using (4.10)
x λα (x − y )α−1 −λ(x−y) λβ y β −1 −λy
e
e
dy
Γ(α)
Γ(β )
0
λα+β e−λx x
(x − y )α−1 y β −1 dy
=
Γ(α)Γ(β ) 0 fX +Y (x) = so it suﬃces to show the integral is xα+β −1 Γ(α)Γ(β )/Γ(α + β ). To do this, we
begin by changing variables y = xu, dy = x du to get
1 xα+β −1 x (1 − u)α−1 uβ −1 du =
0 (x − y )α−1 y β −1 dy
0 There are two ways to complete the proof at this point. The soft solution
is to note that we have shown that the density fX +Y (x) = cα,β e−λ λα+β xα+β −1
where
1
1
(1 − u)α−1 uβ −1 du
cα,β =
Γ(α)Γ(β ) 0
There is only one norming constant cα,β that makes this a probability distribution, so we must have cα,β = 1/Γ(α + β ).
The less elegant approach is to check the last equality by calculus. Multiplying each side by e−x , integrating from 0 to ∞, and then using Fubini’s
theorem on the right we have
1 (1 − u)α−1 uβ −1 du Γ(α + β )
0 ∞ x y β −1 e−y (x − y )α−1 e−(x−y) dy dx =
0 0
∞ ∞ y β −1 e−y =
0 which gives the desired result. (x − y )α−1 e−(x−y) dx dy = Γ(α)Γ(β )
x Section 1.4 Independence
Exercise 4.8. Use the fact that a gamma(1, λ) is an exponential with parameter λ, and induction to show that the sum of n independent exponential(λ)
r.v.’s, X1 + · · · + Xn , has a gamma(n, λ) distribution.
Exercise 4.9. In Example 3.2, we introduced the normal density with mean
µ and variance a, (2πa)−1/2 exp(−(x − µ)2 /2a). Show that if X = normal(µ, a)
and Y = normal(ν, b) are independent then X + Y = normal(µ + ν, a + b). To
simplify this tedious calculation notice that it is enough to prove the result for
µ = ν = 0. In Exercise 3.4 of Chapter 2 you will give a simpler proof of this
result. c. Constructing Independent Random Variables
The last question that we have to address before we can study independent
random variables is: Do they exist? (If they don’t exist, then there is no point
in studying them!) If we are given a ﬁnite number of distribution functions
Fi , 1 ≤ i ≤ n, it is easy to construct independent random variables X1 , . . . , Xn
with P (Xi ≤ x) = Fi (x). Let Ω = Rn , F = Rn , Xi (ω1 , . . . , ωn ) = ωi (the ith
coordinate of ω ∈ Rn ), and let P be the measure on Rn that has
P ((a1 , b1 ] × · · · × (an , bn ]) = (F1 (b1 ) − F1 (a1 )) · · · (Fn (bn ) − Fn (an ))
If µi is the measure with distribution function Fi then P = µ1 × · · · × µn .
To construct an inﬁnite sequence X1 , X2 , . . . of independent random variables with given distribution functions, we want to perform the last construction
on the inﬁnite product space
RN = {(ω1 , ω2 , . . .) : ωi ∈ R} = {functions ω : N → R}
where N = {1, 2, . . .} and N stands for natural numbers. We deﬁne Xi (ω ) =
ωi and we equip RN with the product σ ﬁeld RN , which is generated by the
ﬁnite dimensional sets = sets of the form {ω : ωi ∈ Bi , 1 ≤ i ≤ n} where
Bi ∈ R. It is clear how we want to deﬁne P for ﬁnite dimensional sets. To assert
the existence of a unique extension to RN we use (7.1) from the Appendix:
(4.11) Kolmogorov’s extension theorem. Suppose we are given probability
measures µn on (Rn , Rn ) that are consistent, that is,
µn+1 ((a1 , b1 ] × · · · × (an , bn ] × R) = µn ((a1 , b1 ] × · · · × (an , bn ])
Then there is a unique probability measure P on (RN , RN ) with
P (ω : ωi ∈ (ai , bi ], 1 ≤ i ≤ n) = µn ((a1 , b1 ] × · · · × (an , bn ]) 31 32 Chapter 1 Laws of Large Numbers
In what follows we will need to construct sequences of random variables
that take values in other measurable spaces (S, S ). Unfortunately, (4.11) is not
valid for arbitrary measurable spaces. The ﬁrst example (on an inﬁnite product
of diﬀerent spaces Ω1 × Ω2 × . . .) was constructed by Andersen and Jessen (1948).
(See Halmos (1950) p. 214 or Neveu (1965) p. 84.) For an example in which all
the spaces Ωi are the same see Wegner (1973). Fortunately, there is a class of
spaces that is adequate for all of our results and for which the generalization of
Kolmogorov’s theorem is trivial.
(S, S ) is said to be nice if there is a 11 map ϕ from S into R so that ϕ and
ϕ−1 are both measurable.
Such spaces are often called standard Borel spaces, but we already have too
many things named after Borel. The next result shows that most spaces arising
in applications are nice.
(4.12) Theorem. If S is a Borel subset of a complete separable metric space
M , and S is the collection of Borel subsets of S , then (S, S ) is nice.
Proof We begin with the special case S = [0, 1)N with metric
∞ xn − yn /2n ρ(x, y ) =
n=1 If x = (x1 , x2 , x3 , . . .), expand each component in binary xj = .xj xj xj . . . (tak123
ing the expansion with an inﬁnite number of 0’s). Let
ϕo (x) = .x1 x1 x2 x1 x2 x3 x1 x2 x3 x4 . . .
1213214321
To treat the general case, we observe that by letting
d(x, y ) = ρ(x, y )/(1 + ρ(x, y ))
(for more details, see Exercise 4.10) we can suppose that the metric has d(x, y ) <
1 for all x, y . Let q1 , q2 , . . . be a countable dense set in S. Let
ψ (x) = (d(x, q1 ), d(x, q2 ), . . .).
ψ : S → [0, 1)N is continuous and 11. ϕo ◦ ψ gives the desired mapping.
Exercise 4.10. Let ρ(x, y ) be a metric. (i) Suppose h is diﬀerentiable with
h(0) = 0, h (x) > 0 for x > 0 and h (x) decreasing on [0, ∞). Then h(ρ(x, y ))
is a metric. (ii) h(x) = x/(x + 1) satisﬁes the hypotheses in (i). Section 1.4 Independence
Caveat emptor. The proof above is somewhat light when it comes to details.
For a more comprehensive discussion, see Section 13.1 of Dudley (1989). An
interesting consequence of the analysis there is that for Borel subsets of a complete separable metric space the continuum hypothesis is true: i.e., all sets are
either ﬁnite, countably inﬁnite, or have the cardinality of the real numbers.
Exercises
4.11. Prove directly from the deﬁnition that if X and Y are independent and
f and g are measurable functions then f (X ) and g (Y ) are independent.
4.12. Let K ≥ 3 be a prime and let X and Y be independent random variables
that are uniformly distributed on {0, 1, . . . , K − 1}. For 0 ≤ n < K , let Zn =
X + nY mod K . Show that Z0 , Z1 , . . . , ZK −1 are pairwise independent, i.e.,
each pair is independent, but if we know the values of two of the variables then
we know the values of all the variables.
4.13. Find four random variables taking values in {−1, 1} so that any three
are independent but all four are not. Hint: Consider products of independent
random variables.
4.14. Let Ω = {1, 2, 3, 4}, F = all subsets of Ω, and P ({i}) = 1/4. Give an
example of two collections of sets A1 and A2 that are independent but whose
generated σ ﬁelds are not.
4.15. Show that if X and Y are independent, integervalued random variables,
then
P ( X + Y = n) =
P ( X = m) P ( Y = n − m)
m 4.16. In Example 3.4, we introduced the Poisson distribution with parameter
λ, which is given by P (Z = k ) = e−λ λk /k ! for k = 0, 1, 2, . . . Use the previous
exercise to show that if X = Poisson(λ) and Y = Poisson(µ) are independent
then X + Y = Poisson(λ + µ).
4.17. X is said to have a Binomial(n, p) distribution if
P ( X = m) = nm
p (1 − p)n−m
m (i) Show that if X = Binomial(n, p) and Y = Binomial(m, p) are independent
then X + Y = Binomial(n + m, p). (ii) Look at Example 3.3 and use induction
to conclude that the sum of n independent Bernoulli(p) random variables is
Binomial(n, p).
4.18. It should not be surprising that the distribution of X + Y can be F ∗ G
without the random variables being independent. Suppose X, Y ∈ {0, 1, 2} 33 34 Chapter 1 Laws of Large Numbers
and take each value with probability 1/3. (a) Find the distribution of X + Y
assuming X and Y are independent. (b) Find all the joint distributions (X, Y )
so that the distribution of X + Y is the same as the answer to (a).
4.19. Let X, Y ≥ 0 be independent with distribution functions F and G. Find
the distribution function of XY.
4.20. If we want an inﬁnite sequence of coin tossings, we do not have to use
Kolmogorov’s theorem. Let Ω be the unit interval (0,1) equipped with the Borel
sets F and Lebesgue measure P. Let Yn (ω ) = 1 if [2n ω ] is odd and 0 if [2n ω ]
is even. Show that Y1 , Y2 , . . . are independent with P (Yk = 0) = P (Yk = 1) =
1/2. 1.5. Weak Laws of Large Numbers
In this section, we will prove several “weak laws of large numbers.” The ﬁrst
order of business is to deﬁne the mode of convergence that appears in the
conclusion of the theorems. We say that Yn converges to Y in probability if
for all > 0, P (Yn − Y  > ) → 0 as n → ∞. a. L2 Weak Laws
Our ﬁrst set of weak laws come from computing variances and using Chebyshev’s inequalities. Extending a deﬁnition given in Example 4.2 for two random
variables, a family of random variables Xi , i ∈ I with EXi2 < ∞ is said to be
uncorrelated if we have
E (Xi Xj ) = EXi EXj whenever i = j The key to our weak law for uncorrelated random variables, (5.2), is:
(5.1) Lemma. Let X1 , . . . , Xn have E (Xi2 ) < ∞ and be uncorrelated. Then
var(X1 + · · · + Xn ) = var(X1 ) + · · · + var(Xn )
where var(Y ) = the variance of Y.
Proof Let µi = EXi and Sn = n Xi . Since ESn = n µi , using the
i=1
i=1
deﬁnition of the variance, writing the square of the sum as the product of two Section 1.5 Weak Laws of Large Numbers
copies of the sum, and then expanding, we have
2 n
2 var(Sn ) = E (Sn − ESn ) = E =E ( Xi − µ i )
i=1 n i=1 j =1 (Xi − µi )(Xj − µj ) n n i−1 E ( Xi − µ i ) 2 + 2 = n i=1 E ((Xi − µi )(Xj − µj ))
i=1 j =1 where in the last equality we have separated out the diagonal terms i = j and
used the fact that the sum over 1 ≤ i < j ≤ n is the same as the sum over
1 ≤ j < i ≤ n.
The ﬁrst sum is var(X1 )+ . . . +var(Xn ) so we want to show that the second
sum is zero. To do this, we observe
E ((Xi − µi )(Xj − µj )) = EXi Xj − µi EXj − µj EXi + µi µj
= EXi Xj − µi µj = 0
since Xi and Xj are uncorrelated.
In words, (5.1) says that for uncorrelated random variables the variance of
the sum is the sum of the variances. The second ingredient in our proof of (5.2)
is the following consequence of (3.10c):
var(cY ) = c2 var(Y )
This result and (5.1) lead easily to
(5.2) L2 weak law. Let X1 , X2 , . . . be uncorrelated random variables with
EXi = µ and var(Xi ) ≤ C < ∞. If Sn = X1 + . . . + Xn then as n → ∞,
Sn /n → µ in L2 and in probability.
Proof To prove L2 convergence, observe that E (Sn /n) = µ, so E (Sn /n − µ)2 = var(Sn /n) = 1
Cn
(var(X1 ) + · · · + var(Xn )) ≤ 2 → 0
n2
n To conclude there is also convergence in probability, we apply the next result
to Zn = Sn /n − µ.
(5.3) Lemma. If p > 0 and E Zn p → 0 then Zn → 0 in probability. 35 36 Chapter 1 Laws of Large Numbers
Proof Chebyshev’s inequality, (3.4), with ϕ(x) = xp and X = Zn  implies
that if > 0 then P (Zn  ≥ ) ≤ −p E Zn p → 0.
The most important special case of (5.2) occurs when X1 , X2 , . . . are independent random variables that all have the same distribution. In the jargon,
they are independent and identically distributed or i.i.d. for short. The
L2 weak law (5.2) tells us in this case that if EXi2 < ∞ then Sn /n converges to
µ = EXi in probability as n → ∞. In (5.8) below, we will see that E Xi  < ∞
is suﬃcient for the last conclusion, but for the moment we will concern ourselves
with consequences of the weaker result.
Our ﬁrst application is to a situation that on the surface has nothing to do
with randomness.
Example 5.1. Polynomial approximation. Let f be a continuous function
on [0,1], and let
n fn (x) =
m=0 nm
x (1 − x)n−m f (m/n) where
m n
m = n!
m!(n − m)! be the Bernstein polynomial of degree n associated with f . Then as n → ∞
sup fn (x) − f (x) → 0
x∈[0,1] Proof First observe that if Sn is the sum of n independent random variables
with P (Xi = 1) = p and P (Xi = 0) = 1 − p then EXi = p, var(Xi ) = p(1 − p)
and
nm
P ( Sn = m) =
p (1 − p)n−m
m
so Ef (Sn /n) = fn (p). (5.2) tells us that as n → ∞, Sn /n → p in probability.
The last two observations motivate the deﬁnition of fn (p), but to prove the
desired conclusion we have to use the proof of (5.2) rather than the result itself.
Combining the proof of (5.2) with our formula for the variance of Xi and
the fact that p(1 − p) ≤ 1/4 when p ∈ [0, 1], we have
P (Sn /n − p > δ ) ≤ var(Sn /n)/δ 2 = p(1 − p)/nδ 2 ≤ 1/4nδ 2
To conclude now that Ef (Sn /n) → f (p), let M = supx∈[0,1] f (x), let > 0,
and pick δ > 0 so that if x − y  < δ then f (x) − f (y ) < . (This is possible
since a continuous function is uniformly continuous on each bounded interval.)
Now, using Jensen’s inequality gives
Ef (Sn /n) − f (p) ≤ E f (Sn /n) − f (p) ≤ + 2M P (Sn /n − p > δ ) Section 1.5 Weak Laws of Large Numbers
Letting n → ∞, we have lim supn→∞ Ef (Sn /n) − f (p) ≤ , but
so this gives the desired result. is arbitrary Our next result is for comic relief.
Example 5.2. A highdimensional cube is almost the boundary of
a ball. Let X1 , X2 , . . . be independent and uniformly distributed on (−1, 1).
Let Yi = Xi2 , which are independent since they are functions of independent
random variables. EYi = 1/3 and var(Yi ) ≤ EYi2 ≤ 1, so (5.2) implies
2
2
(X1 + . . . + Xn )/n → 1/3 in probability as n → ∞ Let An, = {x ∈ Rn : (1 − ) n/3 < x < (1 + ) n/3} where x = (x2 + · · · +
1
x2 )1/2 . If we let S  denote the Lebesgue measure of S then the last conclusion
n
implies that for any > 0, An, ∩ (−1, 1)n /2n → 1, or, in words, most of the
volume of the cube (−1, 1)n comes from An, , which is almost the boundary of
the ball of radius n/3. b. Triangular Arrays
Many classical limit theorems in probability concern arrays Xn,k , 1 ≤ k ≤ n
of random variables and investigate the limiting behavior of their row sums
Sn = Xn,1 + · · · + Xn,n . In most cases, we assume that the random variables
on each row are independent, but for the next trivial (but useful) result we do
not need that assumption. Indeed, here Sn can be any sequence of random
variables.
2
2
(5.4) Theorem. Let µn = ESn , σn = var(Sn ). If σn /b2 → 0 then
n Sn − µn
→0
bn in probability Proof Our assumptions imply E ((Sn − µn )/bn )2 = b−2 var(Sn ) → 0, so the
n
desired conclusion follows from (5.3).
We will now give three applications of (5.4). For these three examples, the
following calculation is useful:
n 1
≥
m
m=1
n (∗) log n ≤ n
1 n dx
1
≥
x
m
m=2 1
≤ 1 + log n
m
m=1 37 38 Chapter 1 Laws of Large Numbers
Example 5.3. Coupon collector’s problem. Let X1 , X2 , . . . be i.i.d. uniform on {1, 2, . . . , n}. To motivate the name, think of collecting baseball cards
(or coupons). Suppose that the ith item we collect is chosen at random from
n
the set of possibilities and is independent of the previous choices. Let τk =
inf {m : {X1 , . . . , Xm } = k } be the ﬁrst time we have k diﬀerent items. In
n
this problem, we are interested in the asymptotic behavior of Tn = τn , the
n
time to collect a complete set. It is easy to see that τ1 = 1. To make later
n
n
n
formulas work out nicely, we will set τ0 = 0. For 1 ≤ k ≤ n, Xn,k ≡ τk − τk−1
represents the time to get a choice diﬀerent from our ﬁrst k − 1, so Xn,k has
a geometric distribution with parameter 1 − (k − 1)/n and is independent of
the earlier waiting times Xn,j , 1 ≤ j < k . Example 3.5 tells us that if X has
a geometric distribution with parameter p then EX = 1/p and var(X ) ≤ 1/p2 .
Using the linearity of expected value, (∗) and (5.1) we see that
n k−1
n k=1
n k−1
1−
n var(Tn ) ≤
k=1 n −1 −2 1− ETn = m−1 ∼ n log n =n
=n m=1
n
2 ∞ m −2 ≤n 2 m=1 m−2
m=1 Taking bn = n log n and using (5.4), it follows that
n Tn − n m=1 m−1
→0
n log n in probability and hence Tn /(n log n) → 1 in probability. For a concrete example, take n =
500. In this case the limit theorem says it will take about 500 log 500 = 3107
tries to get a complete set.
Example 5.4. Random permutations. Let Ωn consist of the n! permutations (i.e., onetoone mappings from {1, . . . , n} onto {1, . . . , n}) and make this
into a probability space by assuming all the permutations are equally likely.
This application of the weak law concerns the cycle structure of a random permutation π , so we begin by describing the decompostion of a permutation into
cycles. Consider the sequence 1, π (1), π (π (1)), . . . Eventually, π k (1) = 1. When
it does, we say the ﬁrst cycle is completed and has length k . To start the
second cycle, we pick the smallest integer i not in the ﬁrst cycle and look at
i, π (i), π (π (i)), . . . until we come back to i. We repeat the construction until all
the elements are accounted for. For example, if the permutation is
i
π ( i) 1
3 2
9 3
6 4
8 5
2 6
1 then the cycle decomposition is (136) (2975) (48). 7
5 8
4 9
7 Section 1.5 Weak Laws of Large Numbers
Let Xn,k = 1 if a right parenthesis occurs after the k th number in the
decomposition, Xn,k = 0 otherwise and let Sn = Xn,1 + . . . + Xn,n = the
number of cycles. (In the example, X9,3 = X9,7 = X9,9 = 1, and the other
X9,m = 0.) I claim that
Lemma. Xn,1 , . . . , Xn,n are independent and P (Xn,j = 1) = 1/(n − j + 1).
Intuitively, this is true since, independent of what has happened so far, there
are n − j + 1 values that have not appeared in the range, and only one of them
will complete the cycle.
Proof To prove this, it is useful to generate the permutation in a special way.
Let i1 = 1. Pick j1 at random from {1, . . . , n} and let π (i1 ) = j1 . If j1 = 1,
let i2 = j1 . If j1 = 1, let i2 = 2. In either case, pick j2 at random from
{1, . . . , n} − {j1 }. In general, if i1 , j1 , . . . , ik−1 , jk−1 have been selected and we
have set π (i ) = j for 1 ≤ < k , then (a) if jk−1 ∈ {i1 , . . . , ik−1 } so a cycle
has just been completed, we let ik = inf({1, . . . , n} − {i1 , . . . , ik−1 }) and (b)
if jk−1 ∈ {i1 , . . . , ik−1 } we let ik = jk−1 . In either case we pick jk at random
/
from {1, . . . , n} − {j1 , . . . , jk−1 } and let π (ik ) = jk .
The construction above is tedious to write out, or to read, but now I can
claim with a clear conscience that Xn,1 , . . . , Xn,n are independent and P (Xn,k =
1) = 1/(n − j +1) since when we pick jk there are n − j +1 values in {1, . . . , n}−
{j1 , . . . , jk−1 } and only one of them will complete the cycle.
To check the conditions of (5.4), now note
ESn = 1/n + 1/(n − 1) + · · · + 1/2 + 1
n var(Sn ) = n k=1 n
2
E (Xn,k ) = var(Xn,k ) ≤
k=1 E (Xn,k ) = ESn
k=1 2
where the results on the second line follow from (5.1), (3.10b), and Xn,k = Xn,k .
.5+
, the conditions of (5.4) are satisﬁed
Now ESn ∼ log n, so if bn = (log n)
and it follows that
n (∗) Sn − m=1 m−1
→0
(log n).5+ in probability Taking = 0.5 we have that Sn / log n → 1 in probability, but (∗) says more.
We will see in Example 4.6 of Chapter 2 that (∗) is false if = 0.
Example 5.5. An occupancy problem. Suppose we put r balls at random
in n boxes, i.e., all nr assignments of balls to boxes have equal probability. Let 39 40 Chapter 1 Laws of Large Numbers
Ai be the event that the ith box is empty and Nn = the number of empty
boxes. It is easy to see that
P (Ai ) = (1 − 1/n)r ENn = n(1 − 1/n)r and A little calculus (take logarithms) shows that if r/n → c, ENn /n → e−c . (For
a proof, see (1.3) in Chapter 2.) To compute the variance of Nn , we observe
that
2 n
2
ENn = E 1Am = m=1 P (Ak ∩ Am )
1≤k,m≤n 2
var(Nn ) = ENn − (ENn )2 = P (Ak ∩ Am ) − P (Ak )P (Am )
1≤k,m≤n = n(n − 1){(1 − 2/n)r − (1 − 1/n)2r } + n{(1 − 1/n)r − (1 − 1/n)2r }
The ﬁrst term comes from k = m and the second from k = m. Since (1 −
2/n)r → e−2c and (1 − 1/n)r → e−c , it follows easily from the last formula that
var(Nn /n) = var(Nn )/n2 → 0. Taking bn = n in (5.4) now we have
Nn /n → e−c in probability c. Truncation
To truncate a random variable X at level M means to consider
¯
X = X 1(X ≤M ) = X
0 if X  ≤ M
if X  > M To extend the weak law to random variables without a ﬁnite second moment, we
will truncate and then use Chebyshev’s inequality. We begin with a very general
but also very useful result. Its proof is easy because we have assumed what we
need for the proof. Later we will have to work a little to verify the assumptions
in special cases, but the general result serves to identify the essential ingredients
in the proof.
(5.5) Weak law for triangular arrays. For each n let Xn,k , 1 ≤ k ≤ n,
¯
be independent. Let bn > 0 with bn → ∞, and let Xn,k = Xn,k 1(Xn,k ≤bn ) .
Suppose that as n → ∞
(i) n
k=1 P (Xn,k  > bn ) → 0, and Section 1.5 Weak Laws of Large Numbers
(ii) b−2
n n
k=1 ¯2
E Xn,k → 0.
n
k=1 If we let Sn = Xn,1 + . . . + Xn,n and put an = ¯
E Xn,k then (Sn − an )/bn → 0 in probability
Proof ¯
¯
¯
Let Sn = Xn,1 + · · · + Xn,n . Clearly,
P Sn − an
>
bn ¯
Sn − an
>
bn ¯
≤ P ( Sn = Sn ) + P To estimate the ﬁrst term, we note that
n ¯
¯
P (Sn = Sn ) ≤ P ∪n=1 {Xn,k = Xn,k } ≤
k P (Xn,k  > bn ) → 0
k=1 ¯
by (i). For the second term, we note that Chebyshev’s inequality, an = E Sn ,
(5.1), and var(X ) ≤ EX 2 imply
P ¯
Sn − an
>
bn ≤ −2 E ¯
Sn − an
bn 2 = −2 −2
¯
bn var(Sn ) n n ¯
var(Xn,k ) ≤ (bn )−2 = (bn )−2
k=1 ¯
E (Xn,k )2 → 0
k=1 by (ii), and the proof is complete.
From (5.5), we get the following result for a single sequence.
(5.6) Weak law of large numbers. Let X1 , X2 , . . . be i.i.d. with
xP (Xi  > x) → 0 as x → ∞ Let Sn = X1 + · · · + Xn and let µn = E (X1 1(X1 ≤n) ). Then Sn /n − µn → 0 in
probability.
Remark. The assumption in the theorem is necessary for the existence of
constants an so that Sn /n − an → 0. See Feller, Vol. II (1971) p. 234–236 for
a proof.
Proof We will apply (5.5) with Xn,k = Xk and bn = n. To check (i), we note
n P (Xn,k  > n) = nP (Xi  > n) → 0
k=1 41 42 Chapter 1 Laws of Large Numbers
¯2
by assumption. To check (ii), we need to show n−2 · nE Xn,1 → 0. To do this,
we need the following result, which will be useful several times below.
∞
0 (5.7) Lemma. If Y ≥ 0 and p > 0 then E (Y p ) = py p−1 P (Y > y ) dy . Proof Using the deﬁnition of expected value, Fubini’s theorem (for nonnegative random variables), and then calculating the resulting integrals gives
∞ ∞ py p−1 P (Y > y ) dy = py p−1 1(Y >y) dP dy 0 Ω
∞ 0 py p−1 1(Y >y) dy dP =
Ω 0
Y py p−1 dy dP = =
Ω Y p dP = EY p
Ω 0 Returning to the proof of (5.6), we observe that (5.7) and the fact that
¯
Xn,1 = X1 1(X1 ≤n) imply
∞ n ¯
2yP (Xn,1  > y ) dy ≤ ¯2
E (Xn,1 ) =
0 2yP (X1  > y ) dy
0 ¯
since P (Xn,1  > y ) = 0 for y ≥ n and = P (X1  > y ) − P (X1  > n) for y ≤ n.
We claim that yP (X1  > y ) → 0 implies
2
E (Xn,1 )/n = 1
n n 2yP (X1  > y ) dy → 0
0 as n → ∞. Intuitively, this holds since the righthand side is the average of
g (y ) = 2yP (X1  > y ) over [0, n] and g (y ) → 0 as y → ∞. To spell out the
details, note that 0 ≤ g (y ) ≤ 2y and g (y ) → 0 as y → ∞, so we must have
M = sup g (y ) < ∞. If we let K = sup{g (y ) : y > K } then by considering the
integrals over [0, K ] and [K, n] separately
n 2yP (X1  > y ) dy ≤ KM + (n − K ) K 0 Dividing by n and letting n → ∞, we have
lim sup
n→∞ Since K is arbitrary and K 1
n n 2yP (X1  > y ) dy ≤ K 0 → 0 as K → ∞, the desired result follows. Section 1.5 Weak Laws of Large Numbers
Finally, we have the weak law in its most familiar form.
(5.8) Corollary. Let X1 , X2 , . . . be i.i.d. with E Xi  < ∞. Let Sn = X1 + · · · +
Xn and let µ = EX1 . Then Sn /n → µ in probability.
Remark. Applying (5.7) with p = 1 − and > 0, we see that xP (X1  >
x) → 0 implies E X1 1− < ∞, so the assumption in (5.6) is not much weaker
than ﬁnite mean.
Proof Two applications of the dominated convergence theorem imply
xP (X1  > x) ≤ E (X1 1(X1 >x) ) → 0 as x → ∞
µn = E (X1 1(X1 ≤n) ) → E (X1 ) = µ as n → ∞ Using (5.6), we see that if > 0 then P (Sn /n − µn  > /2) → 0. Since µn → µ,
it follows that P (Sn /n − µ > ) → 0.
Example 5.6. For an example where the weak law does not hold, suppose
X1 , X2 , . . . are independent and have a Cauchy distribution:
x P ( Xi ≤ x) =
−∞ dt
π (1 + t2 ) As x → ∞,
∞ P ( X 1  > x ) = 2
x 2
dt
∼
2)
π (1 + t
π ∞ t−2 dt =
x 2 −1
x
π From the necessity of the condition above, we can conclude that there is no
sequence of constants µn so that Sn /n − µn → 0. We will see later that Sn /n
always has the same distribution as X1 . (See Exercise 3.10 in Chapter 2.)
As the next example shows, we can have a weak law in some situations in
which E X  = ∞.
Example 5.7. The “St. Petersburg paradox.” Let X1 , X2 , . . . be independent random variables with
P (Xi = 2j ) = 2−j for j ≥ 1 In words, you win 2j dollars if it takes j tosses to get a heads. The paradox
here is that EX1 = ∞, but you clearly wouldn’t pay an inﬁnite amount to play 43 44 Chapter 1 Laws of Large Numbers
this game. An application of (5.5) will tell us how much we should pay to play
the game n times.
In this example, Xn,k = Xk . To apply (5.5), we have to pick bn . To do
this, we are guided by the principle that in checking (ii) we want to take bn as
small as we can and have (i) hold. With this in mind, we observe that if m is
an integer
∞ P (X1 ≥ 2m ) = 2−j = 2−m+1
j =m Let m(n) = log2 n + K (n) where K (n) → ∞ and is chosen so that m(n) is an
integer (and hence the displayed formula is valid). Letting bn = 2m(n) , we have
nP (X1 ≥ bn ) = n2−m(n)+1 = 2−K (n)+1 → 0
¯
proving (i). To check (ii), we observe that if Xn,k = Xk 1(Xk ≤bn ) then
m( n ) ¯2
E Xn,k = ∞ 22j · 2−j ≤ 2m(n)
j =1 2−k = 2bn
k=0 So the expression in (ii) is smaller than 2n/bn , which → 0 since
bn = 2m(n) = n2K (n) and K (n) → ∞
The last two steps are to evaluate an and to apply (5.6).
m( n ) ¯
E Xn,k = 2n 2−n = m(n)
n=1 so an = nm(n). We have m(n) = log n + K (n) (here and until the end of the
example all logs are base 2), so if we pick K (n)/ log n → 0 then an /n log n → 1
as n → ∞. Using (5.6) now, we have
Sn − an
→ 0 in probability
n2K (n)
If we suppose that K (n) ≤ log log n for large n then the last conclusion holds
with the denominator replaced by n log n, and it follows that Sn /(n log n) → 1
in probability.
Returning to our original question, we see that a fair price for playing
n times is $ log2 n per play. When n = 1024, this is $10 per play. Nicolas
Bernoulli wrote in 1713, “There ought not to exist any even halfway sensible Section 1.5 Weak Laws of Large Numbers
person who would not sell the right of playing the game for 40 ducates.” If the
wager were 1 ducat, one would need 240 ≈ 1012 plays to start to break even.
Exercises
5.1. Let X1 , X2 , . . . be uncorrelated random variables with EXi = µi and
var(Xi )/i → 0 as i → ∞. Let Sn = X1 + . . . + Xn and νn = ESn /n then as
n → ∞, Sn /n − νn → 0 in L2 and in probability.
5.2. The L2 weak law generalizes immediately to certain dependent sequences.
Suppose EXn = 0 and EXn Xm ≤ r(n − m) for m ≤ n (no absolute value on
the lefthand side!) with r(k ) → 0 as k → ∞. Show that (X1 + . . . + Xn )/n → 0
in probability.
5.3. Monte Carlo integration. (i) Let f be a measurable function on [0, 1]
1
with 0 f (x)dx < ∞. Let U1 , U2 , . . . be independent and uniformly distributed
on [0, 1], and let
In = n−1 (f (U1 ) + . . . + f (Un ))
1 1 Show that In → I ≡ 0 f dx in probability. (ii) Suppose 0 f (x)2 dx < ∞.
Use Chebyshev’s inequality to estimate P (In − I  > a/n1/2 ).
5.4. Let X1 , X2 , . . . be i.i.d. with P (Xi = (−1)k k ) = C/k 2 log k for k ≥ 2 where
C is chosen to make the sum of the probabilities = 1. Show that E Xi  = ∞,
but there is a ﬁnite constant µ so that Sn /n → µ in probability.
5.5. Let X1 , X2 , . . . be i.i.d. with P (Xi > x) = e/x log x for x ≥ e. Show that
E Xi  = ∞, but there is a sequence of constants µn → ∞ so that Sn /n − µn → 0
in probability.
5.6. (i) Show that if X ≥ 0 is integer valued EX =
a similar expression for EX 2 .
5.7. Generalize (5.7) to conclude that if H (x) =
then
∞
E H (X ) = n≥1 P (X ≥ n). (ii) Find (−∞,x] h(y ) dy with h(y ) ≥ 0, h(y )P (X ≥ y ) dy
−∞ An important special case is H (x) = exp(θx) with θ > 0.
5.8. An unfair “fair game.” Let pk = 1/2k k (k + 1), k = 1, 2, . . . and p0 =
1 − k≥1 pk .
∞ k=1 1
11
2k pk = (1 − ) + ( − ) + . . . = 1
2
23 so if we let X1 , X2 , . . . be i.i.d. with P (Xn = −1) = p0 and
P (Xn = 2k − 1) = pk for k ≥ 1 45 46 Chapter 1 Laws of Large Numbers
then EXn = 0. Let Sn = X1 + . . . + Xn . Use (5.5) with bn = 2m(n) where
m(n) = min{m : 2−m m−3/2 ≤ n−1 } to conclude that
Sn /(n/ log2 n) → −1 in probability
5.9. Weak law for positive variables. Suppose X1 , X2 , . . . are i.i.d., P (0 ≤
s
Xi < ∞) = 1 and P (Xi > x) > 0 for all x. Let µ(s) = 0 x dF (x) and
ν (s) = µ(s)/s(1 − F (s)). It is known that there exist constants an so that
Sn /an → 1 in probability, if and only if ν (s) → ∞ as s → ∞. Pick bn ≥ 1
so that nµ(bn ) = bn (this works for large n), and use (5.5) to prove that the
condition is suﬃcient. 1.6. BorelCantelli Lemmas
If An is a sequence of subsets of Ω, we let
lim sup An = lim ∪∞ m An = {ω that are in inﬁnitely many An }
n=
m→∞ (the limit exists since the sequence is decreasing in m) and let
lim inf An = lim ∩∞ m An = {ω that are in all but ﬁnitely many An }
n=
m→∞ (the limit exists since the sequence is increasing in m). The names lim sup and
lim inf can be explained by noting that
lim sup 1An = 1(lim sup An )
n→∞ lim inf 1An = 1(lim inf An )
n→∞ It is common to write lim sup An = {ω : ω ∈ An i.o.}, where i.o. stands for
inﬁnitely often. An example which illustrates the use of this notation is: “Xn →
0 a.s. if and only if for all > 0, P (Xn  > i.o.) = 0.” The reader will see
many other examples below. The next result should be familiar from measure
theory even though its name may not be.
(6.1) BorelCantelli lemma. If ∞
n=1 P (An ) < ∞ then P (An i.o.) = 0. Proof Let N = k 1Ak be the number of events that occur. Fubini’s theorem
implies EN = k P (Ak ) < ∞, so we must have N < ∞ a.s.
The next result is a typical application of the BorelCantelli lemma.
(6.2) Theorem. Xn → X in probability if and only if for every subsequence
Xn(m) there is a further subsequence Xn(mk ) that converges almost surely to
X. Section 1.6 BorelCantelli Lemmas
Proof Let k be a sequence of positive numbers that ↓ 0. For each k , there is
an n(mk ) > n(mk−1 ) so that P (Xn(mk ) − X  > k ) ≤ 2−k . Since
∞ P ( X n ( m k ) − X  > k) <∞ k=1 the BorelCantelli lemma implies P (Xn(mk ) − X  > k i.o.) = 0, i.e., Xn(mk ) →
X a.s. To prove the second conclusion, we note that if for every subsequence
Xn(m) there is a further subsequence Xn(mk ) that converges almost surely to X
then we can apply the next lemma to the sequence of numbers yn = P (Xn −
X  > δ ) for any δ > 0 to get the desired result
(6.3) Lemma. Let yn be a sequence of elements of a topological space. If every
subsequence yn(m) has a further subsequence yn(mk ) that converges to y then
yn → y .
Proof If yn → y then there is an open set G containing y and a subsequence
yn(m) with yn(m) ∈ G for all m, but clearly no subsequence of yn(m) converges
to y .
Remark. Since there is a sequence of random variables that converges in
probability but not a.s. (for an example, see Exercises 6.14 or 6.15), it follows
from (6.3) that a.s. convergence does not come from a metric, or even from a
topology. Exercises 6.4 and 6.5 will give a metric for convergence in probability,
and show that the space of random variables is a complete space under this
metric.
(6.2) allows us to upgrade convergence in probability to convergence almost
surely. An example of the usefulness of this is
(6.4) Corollary. If f is continuous and Xn → X in probability then f (Xn ) →
f (X ) in probability. If, in addition, f is bounded then Ef (Xn ) → Ef (X ).
Proof If Xn(m) is a subsequence then (6.2) implies there is a further subsequence Xn(mk ) → X almost surely. Since f is continuous, Exercise 2.3 implies f (Xn(mk ) ) → f (X ) almost surely and (6.2) implies f (Xn ) → f (X ) in
probability. If f is bounded then the bounded convergence theorem implies
Ef (Xn(mk ) ) → Ef (X ), and applying (6.3) to yn = Ef (Xn ) gives the desired
result.
Exercise 6.1. Prove the ﬁrst result in (6.4) directly from the deﬁnition. 47 48 Chapter 1 Laws of Large Numbers
Exercise 6.2. Fatou’s lemma. Suppose Xn ≥ 0 and Xn → X in probability.
Show that lim inf n→∞ EXn ≥ EX .
Exercise 6.3. Dominated convergence. Suppose Xn → X in probability
and (a) Xn  ≤ Y with EY < ∞ or (b) there is a continuous function g with
g (x) > 0 for large x with x/g (x) → 0 as x → ∞ so that Eg (Xn ) ≤ C < ∞
for all n. Show that EXn → EX.
Exercise 6.4. Show (a) that d(X, Y ) = E (X − Y /(1 + X − Y )) deﬁnes a
metric on the set of random variables, i.e., (i) d(X, Y ) = 0 if and only if X = Y
a.s., (ii) d(X, Y ) = d(Y, X ), (iii) d(X, Z ) ≤ d(X, Y ) + d(Y, Z ) and (b) that
d(Xn , X ) → 0 as n → ∞ if and only if Xn → X in probability.
Exercise 6.5. Show that random variables are a complete space under the
metric deﬁned in the previous exercise, i.e., if d(Xm , Xn ) → 0 whenever m,
n → ∞ then there is a r.v. X∞ so that Xn → X∞ in probability.
As our second application of the BorelCantelli lemma, we get our ﬁrst
strong law of large numbers:
(6.5) Theorem. Let X1 , X2 , . . . be i.i.d. with EXi = µ and EXi4 < ∞. If
Sn = X1 + · · · + Xn then Sn /n → µ a.s.
Proof By letting Xi = Xi − µ, we can suppose without loss of generality that
µ = 0. Now
4 n
4
ESn =E Xi
i=1 =E Xi Xj Xk X
1≤i,j,k, ≤n Terms in the sum of the form E (Xi3 Xj ), E (Xi2 Xj Xk ), and E (Xi Xj Xk X ) are
0 (if i, j, k, are distinct) since the expectation of the product is the product of
the expectations, and in each case one of the terms has expectation 0. The only
2
terms that do not vanish are those of the form EXi4 and EXi2 Xj = (EXi2 )2 .
There are n and 3n(n − 1) of these terms, respectively. (In the second case we
can pick the two indices in n(n − 1)/2 ways, and with the indices ﬁxed, the
term can arise in a total of 6 ways.) The last observation implies
4
4
2
ESn = nEX1 + 3(n2 − n)(EX1 )2 ≤ Cn2 where C < ∞. Chebyshev’s inequality gives us
4
P (Sn  > n ) ≤ E (Sn )/(n )4 ≤ C/(n2 4 ) Section 1.6 BorelCantelli Lemmas
Summing on n and using the BorelCantelli lemma gives P (Sn  > n i.o.) = 0.
Since is arbitrary, the proof is complete.
The converse of the BorelCantelli lemma is trivially false.
Example 6.1. Let Ω = (0, 1), F = Borel sets, P = Lebesgue measure. If
An = (0, an ) where an → 0 as n → ∞ then lim sup An = ∅, but if an ≥ 1/n, we
have
an = ∞.
The example just given suggests that for general sets we cannot say much more
than the next result.
Exercise 6.6. Prove that P (lim sup An ) ≥ lim sup P (An ) and
P (lim inf An ) ≤ lim inf P (An )
For independent events, however, the necessary condition for P (lim sup An ) > 0
is suﬃcient for P (lim sup An ) = 1.
(6.6) The second BorelCantelli lemma. If the events An are independent
then
P (An ) = ∞ implies P (An i.o.) = 1.
Proof Let M < N < ∞. Independence and 1 − x ≤ e−x imply
N P ∩N=M Ac =
n
n N (1 − P (An )) ≤
n=M exp(−P (An ))
n=M N = exp − P (An ) → 0 as N → ∞ n=M So P (∪∞ M An ) = 1 for all M , and since ∪∞ M An ↓ lim sup An it follows that
n=
n=
P (lim sup An ) = 1.
A typical application of the second BorelCantelli lemma is:
(6.7) Theorem. If X1 , X2 , . . . are i.i.d. with E Xi  = ∞, then P (Xn  ≥
n i.o.) = 1. So if Sn = X1 + · · · + Xn then P (lim Sn /n exists ∈ (−∞, ∞)) = 0.
Proof From (5.7), we get
∞ ∞ E X 1  = P (X1  > x) dx ≤
0 P ( X 1  > n )
n=0 49 50 Chapter 1 Laws of Large Numbers
Since E X1  = ∞ and X1 , X2 , . . . are i.i.d., it follows from the second BorelCantelli lemma that P (Xn  ≥ n i.o.) = 1. To prove the second claim, observe
that
Sn+1
Sn
Xn+1
Sn
−
=
−
n
n+1
n(n + 1) n + 1
and on C ≡ {ω : limn→∞ Sn /n exists ∈ (−∞, ∞)}, Sn /(n(n + 1)) → 0. So, on
C ∩ {ω : Xn  ≥ n i.o.}, we have
Sn+1
Sn
−
> 2/3
n
n+1 i.o. contradicting the fact that ω ∈ C . From the last observation, we conclude that
{ω : Xn  ≥ n i.o.} ∩ C = ∅
and since P (Xn  ≥ n i.o.) = 1, it follows that P (C ) = 0.
(6.7) shows that E Xi  < ∞ is necessary for the strong law of large numbers. The reader will have to wait until (7.1) to see that condition is also
suﬃcient. The next result extends the second BorelCantelli lemma and sharpens its conclusion.
(6.8) Theorem. If A1 , A2 , . . . are pairwise independent and
then as n → ∞
n P (An ) = ∞ n 1Am
m=1 ∞
n=1 P (Am ) → 1 a.s.
m=1 Proof Let Xm = 1Am and let Sn = X1 + · · · + Xn . Since the Am are pairwise
independent, the Xm are uncorrelated and hence (5.1) implies
var(Sn ) = var(X1 ) + · · · + var(Xn )
2
(3.10b) and the fact Xm ∈ {0, 1} imply var(Xm ) ≤ E (Xm ) = E (Xm ), so
var(Sn ) ≤ E (Sn ). Chebyshev’s inequality implies (∗) P (Sn − ESn  > δESn ) ≤ var(Sn )/(δESn )2 ≤ 1/(δ 2 ESn ) → 0 as n → ∞. (Since we have assumed ESn → ∞.)
The last computation shows that Sn /ESn → 1 in probability. To get
almost sure convergence, we have to take subsequences. Let nk = inf {n :
ESn ≥ k 2 }. Let Tk = Snk and note that the deﬁntion and EXm ≤ 1 imply
k 2 ≤ ETk ≤ k 2 + 1. Replacing n by nk in (∗) and using ETk ≥ k 2 shows
P (Tk − ETk  > δETk ) ≤ 1/(δ 2 k 2 ) Section 1.6 BorelCantelli Lemmas
∞ So k=1 P (Tk − ETk  > δETk ) < ∞, and the BorelCantelli lemma implies
P (Tk − ETk  > δETk i.o.) = 0. Since δ is arbitrary, it follows that Tk /ETk → 1
a.s. To show Sn /ESn → 1 a.s., pick an ω so that Tk (ω )/ETk → 1 and observe
that if nk ≤ n < nk+1 then
Tk ( ω )
Sn ( ω )
Tk+1 (ω )
≤
≤
ETk+1
ESn
ETk
To show that the terms at the left and right ends → 1, we rewrite the last
inequalities as
ETk
Tk ( ω )
Sn ( ω )
Tk+1 (ω ) ETk+1
·
≤
≤
·
ETk+1 ETk
ESn
ETk+1
ETk
From this, we see it is enough to show ETk+1 /ETk → 1, but this follows from
k 2 ≤ ETk ≤ ETk+1 ≤ (k + 1)2 + 1
and the fact that {(k + 1)2 + 1}/k 2 = 1 + 2/k + 2/k 2 → 1.
The moral of the proof of (6.8) is that if you want to show that Xn /cn → 1
a.s. for sequences cn , Xn ≥ 0 that are increasing, it is enough to prove the
result for a subsequence n(k ) that has cn(k+1) /cn(k) → 1. For practice with this
technique, try the following.
Exercise 6.7. Let 0 ≤ X1 ≤ X2 . . . be random variables with EXn ∼ anα
with a, α > 0, and var(Xn ) ≤ Bnβ with β < 2α. Show that Xn /nα → a a.s.
Exercise 6.8. Let Xn be independent Poisson r.v.’s with EXn = λn , and let
λn = ∞ then Sn /ESn → 1 a.s.
Sn = X1 + · · · + Xn . Show that if
Example 6.2. Record values. Let X1 , X2 , . . . be a sequence of random
variables and think of Xk as the distance for an individual’s k th high jump or
shotput toss so that Ak = {Xk > supj<k Xj } is the event that a record occurs
at time k . Ignoring the fact that an athelete’s performance may get better with
more experience or that injuries may occur, we will suppose that X1 , X2 , . . .
are i.i.d. with a distribution F (x) that is continuous. Even though it may seem
that the occurrence of a record at time k will make it less likely that one will
occur at time k + 1, we
Claim. The Ak are independent with P (Ak ) = 1/k .
To prove this, we start by observing that since F is continuous P (Xj = Xk ) = 0
n
for any j = k (see Exercise 4.7), so we can let Y1n > Y2n > · · · > Yn be the 51 52 Chapter 1 Laws of Large Numbers
random variables X1 , . . . , Xn put into decreasing order and deﬁne a random
permutation of {1, . . . , n} by πn (i) = j if Xi = Yjn , i.e., if the ith random
variable has rank j . Since the distribution of (X1 , . . . , Xn ) is not aﬀected by
changing the order of the random variables, it is easy to see:
(a) The permutation πn is uniformly distributed over the set of n! possibilities.
Proof of (a) This is “obvious” by symmetry, but if one wants to hear more,
we can argue as follows. Let πn be the permutation induced by (X1 , . . . , Xn ),
and let σn be a randomly chosen permutation of {1, . . . , n} independent of the
X sequence. Then we can say two things about the permutation induced by
(Xσ(1) , . . . , Xσ(n) ): (i) it is πn ◦ σn , and (ii) it has the same distribution as πn .
The desired result follows now by noting that if π is any permutation, π ◦ σn ,
is uniform over the n! possibilities.
Once you believe (a), the rest is easy:
(b) P (An ) = P (πn (n) = 1) = 1/n.
(c) If m < n and im+1 , . . . in are distinct elements of {1, . . . , n} then
P (Am πn (j ) = ij for m + 1 ≤ j ≤ n) = 1/m
Intuitively, this is true since if we condition on the ranks of Xm+1 , . . . , Xn
then this determines the set of ranks available for X1 , . . . , Xm , but all possible
orderings of the ranks are equally likely and hence there is probability 1/m that
the smallest rank will end up at m.
Proof of (c) If we let σm be a randomly chosen permutation of {1, . . . , m}
then (i) πn ◦ σm has the same distribution as πn , and (ii) since the application
of σm randomly rearranges πn (1), . . . , πn (m) the desired result follows.
If we let m1 < m2 . . . < mk then it follows from (c) that
P (Am1 Am2 ∩ . . . ∩ Amk ) = P (Am1 )
and the claim follows by induction.
n From (6.8) and the by now familiar fact that m=1 1/m ∼ log n, it follows that
n
if Rn = m=1 1Am is the number of records at time n then as n → ∞,
(6.9) Rn / log n → 1 a.s. Section 1.6 BorelCantelli Lemmas
The reader should note that the last result is independent of the distribution
F (as long as it is continuous).
Remark. Let X1 , X2 , . . . be i.i.d. with a distribution that is continuous. Let
Yi be the number of j ≤ i with Xj > Xi . It follows from (a) that Yi are
independent random variables with P (Yi = j ) = 1/i for 0 ≤ j < i − 1.
Comic relief. Let X0 , X1 , . . . be i.i.d. and imagine they are the oﬀers you get
for a car you are going to sell. Let N = inf {n ≥ 1 : Xn > X0 }. Symmetry
implies P (N > n) ≥ 1/(n + 1). (When the distribution is continuous this
probability is exactly 1/(n + 1), but our distribution now is general and ties go
to the ﬁrst person who calls.) Using Exercise 5.6 now:
∞ EN = ∞ P (N > n ) ≥
n=0 1
=∞
n+1
n=0 so the expected time you have to wait until you get an oﬀer better than the
ﬁrst one is ∞. To avoid lawsuits, let me hasten to add that I am not suggesting
that you should take the ﬁrst oﬀer you get!
Example 6.3. Head runs. Let Xn , n ∈ Z, be i.i.d. with P (Xn = 1) =
P (Xn = −1) = 1/2. Let n = max{m : Xn−m+1 = . . . = Xn = 1} be the length
of the run of +1’s at time n, and let Ln = max1≤m≤n m be the longest run at
time n. We use a twosided sequence so that for all n, P ( n = k ) = (1/2)k+1
for k ≥ 0. Since 1 < ∞, the result we are going to prove
(6.10) Ln / log2 n → 1 a.s. is also true for a onesided sequence. To prove (6.10), we begin by observing
P( n ≥ (1 + ) log2 n) ≤ n−(1+ ) for any > 0, so it follows from the BorelCantelli lemma that
for n ≥ N . Since is arbitrary, it follows that
lim sup Ln / log2 n ≤ 1 n ≤ (1+ ) log2 n a.s. n→∞ To get a result in the other direction, we break the ﬁrst n trials into disjoint
blocks of length [(1 − ) log2 n] + 1, on which the variables are all 1 with probability 2−[(1− ) log2 n]−1 ≥ n−(1− ) /2, to conclude that if n is large enough so
that [n/{[(1 − ) log2 n] + 1}] ≥ n/ log2 n
P (Ln ≤ (1 − ) log2 n) ≤ (1 − n−(1− ) /2)n/(log2 n) ≤ exp(−n /2 log2 n) 53 54 Chapter 1 Laws of Large Numbers
which is summable, so the BorelCantelli lemma implies
lim inf Ln / log2 n ≥ 1
n→∞ Exercise 6.9. Show that lim supn→∞ n / log2 a.s. n = 1, lim inf n→∞ n = 0 a.s. Exercises
6.10. If Xn is any sequence of random variables, there are constants cn → ∞
so that Xn /cn → 0 a.s.
∞ 6.11. (i) If P (An ) → 0 and n=1 P (Ac ∩ An+1 ) < ∞ then P (An i.o.) = 0. (ii)
n
Find an example of a sequence An to which the result in (i) can be applied but
the BorelCantelli lemma cannot.
6.12. Let An be a sequence of independent events with P (An ) < 1 for all n.
Show that P (∪An ) = 1 implies P (An i.o.) = 1.
6.13. Let X1 , X2 , . . . be independent. Show that sup Xn < ∞ a.s. if and only if
n P (Xn > A) < ∞ for some A.
6.14. Let X1 , X2 , . . . be independent with P (Xn = 1) = pn and P (Xn = 0) =
1 − pn . Show that (i) Xn → 0 in probability if and only if pn → 0, and (ii)
Xn → 0 a.s. if and only if
pn < ∞.
6.15. Let Y1 , Y2 , . . . be i.i.d. Find necessary and suﬃcient conditions for
(i) Yn /n → 0 almost surely, (ii) (maxm≤n Ym )/n → 0 almost surely,
(iii) (maxm≤n Ym )/n → 0 in probability, and (iv) Yn /n → 0 in probability.
6.16. The last two exercises give examples with Xn → X in probability without
Xn → X a.s. There is one situation in which the two notions are equivalent.
Let X1 , X2 , . . . be a sequence of r.v.’s on (Ω, F , P ) where Ω is a countable set
and F consists of all subsets of Ω. Show that Xn → X in probability implies
Xn → X a.s.
6.17. Show that if Xn is the outcome of the nth play of the St. Petersburg
game (Example 5.7) then lim supn→∞ Xn /(n log2 n) = ∞ a.s. and hence the
same result holds for Sn . This shows that the convergence Sn /(n log2 n) → 1
in probability proved in Section 5 does not occur a.s.
6.18. Let X1 , X2 , . . . be i.i.d. with P (Xi > x) = e−x , let Mn = max1≤m≤n Xm .
Show that (i) lim supn→∞ Xn / log n = 1 a.s. and (ii) Mn / log n → 1 a.s.
6.19. Let X1 , X2 , . . . be i.i.d. with distribution F , let λn ↑ ∞, and let An =
{max1≤m≤n Xm > λn }. Show that P (An i.o.) = 0 or 1 according as n≥1 (1 −
F (λn )) < ∞ or = ∞. Section 1.7 Strong Law of Large Numbers
6.20. KochenStone lemma. Suppose
6.6 to show that if n→∞ 2 n P (Ak ) lim sup P (Ak ) = ∞. Use Exercises 3.8 and k=1 1≤j,k≤n P (Aj ∩ Ak ) = α > 0 then P (An i.o.) ≥ α. The case α = 1 contains (6.6). 1.7. Strong Law of Large Numbers
We are now ready to give Etemadi’s (1981) proof of
(7.1) Strong law of large numbers. Let X1 , X2 , . . . be pairwise independent
identically distributed random variables with E Xi  < ∞. Let EXi = µ and
Sn = X1 + . . . + Xn . Then Sn /n → µ a.s. as n → ∞.
Proof As in the proof of the weak law of large numbers, we begin by truncating.
(a) Lemma. Let Yk = Xk 1(Xk ≤k) and Tn = Y1 + · · · + Yn . It is suﬃcient to
prove that Tn /n → µ a.s.
∞ ∞
Proof
k=1 P (Xk  > k ) ≤ 0 P (X1  > t) dt = E X1  < ∞ so P (Xk =
Yk i.o.) = 0. This shows that Sn (ω ) − Tn (ω ) ≤ R(ω ) < ∞ a.s. for all n, from
which the desired result follows. The second step is not so intuitive, but it is an important part of this proof and
the one given in Section 1.8.
(b) Lemma.
Proof ∞
k=1 var(Yk )/k 2 ≤ 4E X1  < ∞. To bound the sum, we observe
∞ var(Yk ) ≤ E (Yk2 ) = k 2yP (Yk  > y ) dy ≤
0 2yP (X1  > y ) dy
0 so using Fubini’s theorem (since everything is ≥ 0 and the sum is just an integral
with respect to counting measure on {1, 2, . . .})
∞ ∞ E (Yk2 )/k 2 ≤
k=1 ∞ k −2
∞ 1(y<k) 2y P (X1  > y ) dy
0 k=1
∞ k −2 1(y<k) =
0 k=1 2yP (X1  > y ) dy 55 56 Chapter 1 Laws of Large Numbers
Since E X1  = ∞
0 P (X1  > y ) dy , we can complete the proof by showing (c) Lemma. If y ≥ 0 then 2y
Proof k>y k −2 ≤ 4. We being with the observation that if m ≥ 2 then
∞ k −2 ≤ x−2 dx = (m − 1)−1
m−1 k≥m When y ≥ 1, the sum starts with k = [y ] + 1 ≥ 2, so
k −2 ≤ 2y/[y ] ≤ 4 2y
k>y since y/[y ] ≤ 2 for y ≥ 1 (the worst case being y close to 2). To cover 0 ≤ y < 1,
we note that in this case
∞ k −2 ≤ 2 1 + 2y
k>y k −2 ≤4 k=2 The ﬁrst two steps, (a) and (b) above, are standard. Etemadi’s inspiration
+
−
was that since Xn , n ≥ 1, and Xn , n ≥ 1, satisfy the assumptions of the
+
−
theorem and Xn = Xn − Xn , we can without loss of generality suppose Xn ≥ 0.
As in the proof of (6.8), we will prove the result ﬁrst for a subsequence and then
use monotonicity to control the values in between. This time, however, we let
α > 1 and k (n) = [αn ]. Chebyshev’s inequality implies that if > 0
∞ ∞ P (Tk(n) − ETk(n)  > k (n)) ≤ −2 n=1 var(Tk(n) )/k (n)2
n=1
k (n) ∞ = −2 k (n)−2
n=1
∞ = −2 var(Ym )
m=1 k (n)−2 var(Ym )
m=1 n:k(n)≥m where we have used Fubini’s theorem to interchange the two summations of
nonnegative terms. Now k (n) = [αn ] and [αn ] ≥ αn /2 for n ≥ 1, so summing
the geometric series and noting that the ﬁrst term is ≤ m−2 :
[αn ]−2 ≤ 4
n:αn ≥m α−2n ≤ 4(1 − α−2 )−1 m−2
n:αn ≥m Section 1.7 Strong Law of Large Numbers
Combining our computations shows
∞ ∞ P (Tk(n) − ETk(n)  > k (n)) ≤ 4(1 − α−2 )−1 −2 n=1 2
E (Ym )m−2 < ∞
m=1 by (b). Since is arbitrary (Tk(n) − ETk(n) )/k (n) → 0 a.s. The dominated
convergence theorem implies EYk → EX1 as k → ∞, so ETk(n) /k (n) → EX1
and we have shown Tk(n) /k (n) → EX1 a.s. To handle the intermediate values,
we observe that if k (n) ≤ m < k (n + 1)
Tk ( n )
Tk(n+1)
Tm
≤
≤
k (n + 1)
m
k ( n)
(here we use Yi ≥ 0), so recalling k (n) = [αn ], we have k (n + 1)/k (n) → α and
1
EX1 ≤ lim inf Tm /m ≤ lim sup Tm /m ≤ αEX1
n→∞
α
m→∞
Since α > 1 is arbitrary, the proof is complete.
The next result shows that the strong law holds whenever EXi exists.
(7.2) Theorem. Let X1 , X2 , . . . be i.i.d. with EXi+ = ∞ and EXi− < ∞. If
Sn = X1 + · · · + Xn then Sn /n → ∞ a.s.
Proof Let M > 0 and XiM = Xi ∧ M . The XiM are i.i.d. with E XiM  < ∞,
M
M
M
M
so if Si = X1 + · · · + Xn then (7.1) implies Sn /n → EXiM . Since Xi ≥ XiM ,
it follows that
M
lim inf Sn /n ≥ lim Sn /n = EXiM
n→∞ n→∞ The monotone convergence theorem implies E (XiM )+ ↑ EXi+ = ∞ as M ↑ ∞,
so EXiM = E (XiM )+ − E (XiM )− ↑ ∞, and we have lim inf n→∞ Sn /n ≥ ∞,
which implies the desired result.
The rest of this section is devoted to applications of the strong law of large
numbers.
Example 7.1. Renewal theory. Let X1 , X2 , . . . be i.i.d. with 0 < Xi < ∞.
Let Tn = X1 + . . . + Xn and think of Tn as the time of nth occurrence of some
event. For a concrete situation, consider a diligent janitor who replaces a light
bulb the instant it burns out. Suppose the ﬁrst bulb is put in at time 0 and
let Xi be the lifetime of the ith lightbulb. In this interpretation, Tn is the time 57 58 Chapter 1 Laws of Large Numbers
the nth light bulb burns out and Nt = sup{n : Tn ≤ t} is the number of light
bulbs that have burnt out by time t.
(7.3) Theorem. If EX1 = µ ≤ ∞ then as t → ∞, Nt /t → 1/µ a.s. (1/∞ = 0)
Proof By (7.1) and (7.2), Tn /n → µ a.s. From the deﬁnition of Nt , it follows
that T (Nt ) ≤ t < T (Nt + 1), so dividing through by Nt gives
T ( Nt )
t
T (Nt + 1) Nt + 1
·
≤
≤
Nt
Nt
Nt + 1
Nt
To take the limit, we note that since Tn < ∞ for all n, we have Nt ↑ ∞
as t → ∞. The strong law of large numbers implies that for ω ∈ Ω0 with
P (Ω0 ) = 1, we have Tn (ω )/n → µ, Nt (ω ) ↑ ∞, and hence
TNt (ω) (ω )
→µ
Nt ( ω ) Nt ( ω ) + 1
→1
Nt ( ω ) From this it follows that for ω ∈ Ω0 that t/Nt (ω ) → µ a.s.
The last argument shows that if Xn → X∞ a.s. and N (n) → ∞ a.s. then
XN (n) → X∞ a.s. We have written this out with care because the analogous
result for convergence in probability is false.
Exercise 7.1. Give an example with Xn ∈ {0, 1}, Xn → 0 in probability,
N (n) ↑ ∞ a.s., and XN (n) → 1 a.s.
Exercise 7.2. Lazy janitor. Suppose the ith light bulb burns for an amount
of time Xi and then remains burned out for time Yi before being replaced. Suppose the Xi , Yi are positive and independent with the X ’s having distribution
F and the Y ’s having distribution G, both of which have ﬁnite mean. Let Rt
be the amount of time in [0, t] that we have a working light bulb. Show that
Rt /t → EXi /(EXi + EYi ) almost surely.
Example 7.2. Empirical distribution functions. Let X1 , X2 , . . . be i.i.d.
with distribution F and let
n Fn (x) = n−1 1(Xm ≤x)
m=1 Fn (x) = the observed frequency of values that are ≤ x , hence the name given
above. The next result shows that Fn converges uniformly to F as n → ∞. Section 1.7 Strong Law of Large Numbers
(7.4) The GlivenkoCantelli theorem. As n → ∞,
sup Fn (x) − F (x) → 0 a.s. x Proof Fix x and let Yn = 1(Xn ≤x) . Since the Yn are i.i.d. with EYn =
P (Xn ≤ x) = F (x), the strong law of large numbers implies that Fn (x) =
n−1 n =1 Ym → F (x) a.s. In general, if Fn is a sequence of nondecreasing
m
functions that converges pointwise to a bounded and continuous limit F then
supx Fn (x) − F (x) → 0. However, the distribution function F (x) may have
jumps, so we have to work a little harder.
Again, ﬁx x and let Zn = 1(Xn <x) . Since the Zn are i.i.d. with EZn =
P (Xn < x) = F (x−) = limy↑x F (y ), the strong law of large numbers implies
n
that Fn (x−) = n−1 m=1 Zm → F (x−) a.s. For 1 ≤ j ≤ k − 1 let xj,k =
inf {y : F (y ) ≥ j/k }. The pointwise convergence of Fn (x) and Fn (x−) imply
that we can pick Nk (ω ) so that if n ≥ Nk (ω ) then
Fn (xj,k ) − F (xj,k ) < k −1 and Fn (xj,k −) − F (xj,k −) < k −1 for 1 ≤ j ≤ k − 1. If we let x0,k = −∞ and xk,k = ∞, then the last two
inequalities hold for j = 0 or k . If x ∈ (xj −1,k , xj,k ) with 1 ≤ j ≤ k and n ≥
Nk (ω ), then using the monotonicity of Fn and F , and F (xj,k −) − F (xj −1,k ) ≤
k −1 , we have
Fn (x) ≤ Fn (xj,k −) ≤ F (xj,k −) + k −1 ≤ F (xj −1,k ) + 2k −1 ≤ F (x) + 2k −1
Fn (x) ≥ Fn (xj −1,k ) ≥ F (xj −1,k ) − k −1 ≥ F (xj,k −) − 2k −1 ≥ F (x) − 2k −1
so supx Fn (x) − F (x) ≤ 2k −1 , and we have proved the result.
Example 7.3. Shannon’s theorem. Let X1 , X2 , . . . ∈ {1, . . . , r} be independent with P (Xi = k ) = p(k ) > 0 for 1 ≤ k ≤ r. Here we are thinking
of 1, . . . , r as the letters of an alphabet, and X1 , X2 , . . . are the successive letters produced by an information source. In this i.i.d. case, it is the proverbial
monkey at a typewriter. Let πn (ω ) = p(X1 (ω )) · · · p(Xn (ω )) be the probability
of the realization we observed in the ﬁrst n trials. Since log πn (ω ) is a sum of
independent random variables, it follows from the strong law of large numbers
that
r −n−1 log πn (ω ) → H ≡ − p(k ) log p(k ) a.s.
k=1 The constant H is called the entropy of the source and is a measure of how
random it is. The last result is the asymptotic equipartition property: If
> 0 then as n → ∞
P {exp(−n(H + )) ≤ πn (ω ) ≤ exp(−n(H − )} → 1 59 60 Chapter 1 Laws of Large Numbers
We will give a more general version of this result in (5.1) of Chapter 6.
Exercises
7.3. Let X0 = (1, 0) and deﬁne Xn ∈ R2 inductively by declaring that Xn+1
is chosen at random from the ball of radius Xn  centered at the origin, i.e.,
Xn+1 /Xn  is uniformly distributed on the ball of radius 1 and independent of
X1 , . . . , Xn . Prove that n−1 log Xn  → c a.s. and compute c.
7.4. Investment problem. We assume that at the beginning of each year you
can buy bonds for $1 that are worth $ a at the end of the year or stocks that
are worth a random amount V ≥ 0. If you always invest a ﬁxed proportion p of
your wealth in bonds, then your wealth at the end of year n +1 is Wn+1 = (ap +
2
−
(1 − p)Vn )Wn . Suppose V1 , V2 , . . . are i.i.d. with EVn < ∞ and E (Vn 2 ) < ∞.
−1
(i) Show that n log Wn → c(p) a.s. (ii) Show that c(p) is concave. [Use
(9.1) in the Appendix to justify diﬀerentiating under the expected value.] (iii)
By investigating c (0) and c (1), give conditions on V that guarantee that the
optimal choice of p is in (0,1). (iv) Suppose P (V = 1) = P (V = 4) = 1/2. Find
the optimal p as a function of a. *1.8. Convergence of Random Series
In this section, we will pursue a second approach to the strong law of large
numbers based on the convergence of random series. This approach has the
advantage that it leads to estimates on the rate of convergence under moment
assumptions, (8.7) and (8.8), and to a negative result for the inﬁnite mean
case, (8.9), which is stronger than the one in (6.7). The ﬁrst two results in this
section are of considerable interest in their own right, although we will see more
general versions in (1.1) of Chapter 3 and (4.2) of Chapter 4.
To state the ﬁrst result, we need some notation. Let Fn = σ (Xn , Xn+1 , . . .)
= the future after time n = the smallest σ ﬁeld with respect to which all the
Xm , m ≥ n are measurable. Let T = ∩n Fn = the remote future, or tail σ ﬁeld. Intuitively, A ∈ T if and only if changing a ﬁnite number of values does
not aﬀect the occurrence of the event. As usual, we turn to examples to help
explain the deﬁnition.
Example 8.1. If Bn ∈ R then {Xn ∈ Bn i.o.} ∈ T . If we let Xn = 1An and
Bn = {1}, this example becomes {An i.o.}.
Example 8.2. Let Sn = X1 + . . . + Xn . It is easy to check that
{limn→∞ Sn exists } ∈ T ,
{lim supn→∞ Sn > 0} ∈ T , 1.8 Convergence of Random Series
{lim supn→∞ Sn /cn > x} ∈ T if cn → ∞.
The next result shows that all examples are trivial.
(8.1) Kolmogorov’s 01 law. If X1 , X2 , . . . are independent and A ∈ T then
P (A) = 0 or 1.
Proof We will show that A is independent of itself, that is, P (A ∩ A) =
P (A)P (A), so P (A) = P (A)2 , and hence P (A) = 0 or 1. We will sneak up on
this conclusion in two steps:
(a) A ∈ σ (X1 , . . . , Xk ) and B ∈ σ (Xk+1 , Xk+2 , . . .) are independent.
Proof of (a) If B ∈ σ (Xk+1 , . . . , Xk+j ) for some j , this follows from (4.5).
Since σ (X1 , . . . , Xk ) and ∪j σ (Xk+1 , . . . , Xk+j ) are π systems that contain Ω
(a) follows from (4.2).
(b) A ∈ σ (X1 , X2 , . . .) and B ∈ T are independent.
Proof of (b) Since T ⊂ σ (Xk+1 , Xk+2 , . . .), if A ∈ σ (X1 , . . . , Xk ) for some
k , this follows from (a). ∪k σ (X1 , . . . , Xk ) and T are π systems that contain Ω,
so (b) follows from (4.2).
Since T ⊂ σ (X1 , X2 , . . .), (b) implies an A ∈ T is independent of itself and
(8.1) follows.
If A1 , A2 , . . . are independent then (8.1) implies P (An i.o.) = 0 or 1. Applying (8.1) to Example 8.2 gives P (limn→∞ Sn exists) = 0 or 1. The next
result will help us prove the probability is 1 in certain situations.
(8.2) Kolmogorov’s maximal inequality. Suppose X1 , . . . , Xn are independent with EXi = 0 and var(Xi ) < ∞. If Sn = X1 + · · · + Xn then
P max Sk  ≥ x 1≤k≤n ≤ x−2 var(Sn ) Remark. Under the same hypotheses, Chebyshev’s inequality (3.4) gives only
P (Sn  ≥ x) ≤ x−2 var(Sn )
Proof Let Ak = {Sk  ≥ x but Sj  < x for j < k }, i.e., we break things down
according to the time that Sk  ﬁrst exceeds x. Since the Ak are disjoint and 61 62 Chapter 1 Laws of Large Numbers
(Sn − Sk )2 ≥ 0,
n n 2
ESn ≥ 2
Sn dP =
k=1
n Ak 2
Sk + 2Sk (Sn − Sk ) + (Sn − Sk )2 dP
Ak k=1
n
2
Sk dP + ≥
k=1 Ak 2Sk 1Ak · (Sn − Sk ) dP
k=1 Sk 1Ak ∈ σ (X1 , . . . , Xk ) and Sn − Sk ∈ σ (Xk+1 , . . . , Xn ) are independent by
(4.5), so using (4.8) and E (Sn − Sk ) = 0 shows
2Sk 1Ak · (Sn − Sk ) dP = E (2Sk 1Ak ) · E (Sn − Sk ) = 0
Using now the fact that Sk  ≥ x on Ak and the Ak are disjoint,
n n 2
ESn ≥ 2
Sk dP ≥
k=1 Ak x2 P (Ak ) = x2 P
k=1 max Sk  ≥ x 1≤k≤n Exercise 8.1. Suppose X1 , X2 , . . . are i.i.d. with EXi = 0, var(Xi ) = C < ∞.
Use (8.2) with n = mα where α(2p−1) > 1 to conclude that if Sn = X1 +· · ·+Xn
and p > 1/2 then Sn /np → 0 almost surely.
We turn now to our results on convergence of series. To state them, we
∞
N
need a deﬁnition. We say that n=1 an converges if limN →∞ n=1 an exists.
(8.3) Theorem. Suppose X1 , X2 , . . . are independent with EXn = 0. If
∞
∞
n=1 var(Xn ) < ∞ then with probability one
n=1 Xn (ω ) converges.
Proof Let SN = N
n=1 Xn . From (8.2), we get
N P max Sm − SM  > ≤ −2 M ≤m≤N var(SN − SM ) = −2 var(Xn )
n=M +1 Letting N → ∞ in the last result, we get
∞ P sup Sm − SM  > ≤ m≥M −2 var(Xn ) → 0 as M → ∞
n=M +1 If we let wM = supm,n≥M Sm − Sn  then wM ↓ as M ↑ and
P (wM > 2 ) ≤ P sup Sm − SM  >
m≥M →0 1.8 Convergence of Random Series
as M → ∞ so wM ↓ 0 almost surely. But wM (ω ) ↓ 0 implies Sn (ω ) is a Cauchy
sequence and hence limn→∞ Sn (ω ) exists, so we have proved (8.3).
Example 8.3. Let X1 , X2 , . . . be independent with
P (Xn = n−α ) = P (Xn = −n−α ) = 1/2
Xn
EXn = 0 and var(Xn ) = n−2α so if α > 1/2 it follows from (8.3) that
converges. (8.4) shows that α > 1/2 is also necessary for this conclusion. Notice
that there is absolute convergence, i.e.,
Xn  < ∞, if and only if α > 1.
(8.3) is suﬃcient for all of our applications, but our treatment would not be
complete if we did not mention the last word on convergence of random series.
(8.4) Kolmogorov’s threeseries theorem. Let X1 , X2 , . . . be independent.
∞
Let A > 0 and let Yi = Xi 1(Xi ≤A) . In order that n=1 Xn converges a.s., it
is necessary and suﬃcient that
∞ (i) ∞ ∞ P (Xn  > A) < ∞, (ii) EYn converges, and (iii) n=1 n=1 var(Yn ) < ∞
n=1 Proof We will prove the necessity in Example 4.7 of Chapter 2 as an application of the central limit theorem. To prove the suﬃciency, let µn = EYn .
∞
(iii) and (8.3) imply that n=1 (Yn − µn ) converges a.s. Using (ii) now gives
∞
that n=1 Yn converges a.s. (i) and the BorelCantelli lemma imply P (Xn =
Yn i.o.) = 0, so ∞ Xn converges a.s.
n=1
The link between convergence of series and the strong law of large numbers
is provided by
∞
n=1 (8.5) Kronecker’s lemma. If an ↑ ∞ and xn /an converges then n a−1
n xm → 0
m=1 Proof Let a0 = 0, b0 = 0, and for m ≥ 1, let bm =
xm = am (bm − bm−1 ) and so
n a−1
n n xm = a−1
n m=1 am bm − am bm−1
m=1 n an bn + n am−1 bm−1 −
m=2 n = bn − xk /ak . Then n m=1 = a−1
n m
k=1 (am − am−1 )
bm−1
an
m=1 am bm−1
m=1 63 64 Chapter 1 Laws of Large Numbers
(Recall a0 = 0.) By hypothesis, bn → b∞ as n → ∞. Since am − am−1 ≥ 0, the
last sum is an average of b0 , . . . , bn . Intuitively, if > 0 and M < ∞ are ﬁxed
and n is large, the average assigns mass ≥ 1 − to the bm with m ≥ M , so
n (am − am−1 )
bm−1 → b∞
an
m=1
To argue formally, let B = sup bn , pick M so that bm − b∞  < /2 for m ≥ M ,
then pick N so that aM /an < /4B for n ≥ N . Now if n ≥ N , we have
n n (am − am−1 )
(am − am−1 )
bm−1 − b∞ ≤
bm−1 − b∞ 
an
an
m=1
m=1
≤
proving the desired result since an − aM
aM
· 2B +
·<
an
an
2 is arbitrary. (8.6) The strong law of large numbers. Let X1 , X2 , . . . be i.i.d. random
variables with E Xi  < ∞. Let EXi = µ and Sn = X1 + . . . + Xn . Then
Sn /n → µ a.s. as n → ∞.
Proof Let Yk = Xk 1(Xk ≤k) and Tn = Y1 + · · · + Yn . By (a) in the proof
of (7.1) it suﬃces to show that Tn /n → µ. Let Zk = Yk − EYk and note that
(3.10b) and (b) in the proof of (7.1) imply
∞ ∞ var(Zk )/k 2 ≤
k=1 2
EYk /k 2 < ∞
k=1 Applying (8.3) now, we conclude that ∞
k=1 n n−1 (Yk − EYk ) → 0
k=1 and hence Zk /k converges a.s. so (8.5) implies
Tn
− n−1
n n EYk → 0 a.s.
k=1 The dominated convergence theorem implies EYk → µ as k → ∞. From this,
it follows easily that n−1 n=1 EYk → µ and hence Tn /n → µ.
k
Rates of convergence. As mentioned earlier, one of the advantages of the
random series proof is that it provides estimates on the rate of convergence
of Sn /n → µ. By subtracting µ from each random variable, we can and will
suppose without loss of generality that µ = 0. 1.8 Convergence of Random Series
(8.7) Theorem. Let X1 , X2 , . . . be i.i.d. random variables with EXi = 0 and
EXi2 = σ 2 < ∞. Let Sn = X1 + . . . + Xn . If > 0 then
Sn /n1/2 (log n)1/2+ → 0 a.s. Remark. Kolmogorov’s test, (9.6) in Chapter 7, will show that
√
lim sup Sn /n1/2 (log log n)1/2 = σ 2 a.s.
n→∞ so the last result is not far from the best possible.
Proof Let an = n1/2 (log n)1/2+ for n ≥ 2 and a1 > 0.
∞ ∞ 1
1
+
a2 n=2 n(log n)1+2
1 var(Xn /an ) = σ 2
n=1 so applying (8.3) we get
follows from (8.5). ∞
n=1 <∞ Xn /an converges a.s. and the indicated result The next result due to Marcinkiewicz and Zygmund treats the situation in
which EXi2 = ∞ but E Xi p < ∞ for some 1 < p < 2.
(8.8) Theorem. Let X1 , X2 , . . . be i.i.d. with EX1 = 0 and E X1 p < ∞ where
1 < p < 2. If Sn = X1 + . . . + Xn then Sn /n1/p → 0 a.s.
Proof Let Yk = Xk 1(Xk ≤k1/p ) and Tn = Y1 + · · · + Yn .
∞ ∞ P (  X k p > k ) ≤ E X k p < ∞ P ( Y k = Xk ) =
k=1 k=1 so the BorelCantelli lemma implies P (Yk = Xk i.o.) = 0, and it suﬃces to show
Tn /n1/p → 0. Using (3.10b), (5.7) with p = 2, P (Ym  > y ) ≤ P (X1  > y ), and
Fubini’s theorem (everything is ≥ 0) we have
∞ ∞ var(Ym /m1/p ) ≤
m=1 2
EYm /m2/p
m=1
∞m n1/p ≤
1/p
m=1 n=1 (n−1)
1/p
∞
∞
n =
n=1 (n−1)1/p 2y
P (X1  > y ) dy
m2/p 2y
P (X1  > y ) dy
m2/p
m=n 65 66 Chapter 1 Laws of Large Numbers
To bound the integral, we note that for n ≥ 2 comparing the sum with the
integral of x−2/p
∞ p
(n − 1)(p−2)/p ≤ Cy p−2
2−p m−2/p ≤
m=n when y ∈ [(n − 1)1/p , n1/p ]. Since E Xi p =
follows that ∞
0 pxp−1 P (Xi  > x) dx < ∞, it ∞ var(Ym /m1/p ) < ∞
m=1 If we let µm = EYm and apply (8.3) and (8.5) it follows that
n n−1/p ( Ym − µ m ) → 0 a.s. m=1 To estimate µm , we note that since EXm = 0, µm = −E (Xi ; Xi  > m1/p ), so
µm  ≤ E (X ; Xi  > m1/p ) = m1/p E (X /m1/p ; Xi  > m1/p ) dx
≤ m1/p E ((X /m1/p )p ; Xi  > m1/p ) dx
≤ m−1+1/p p−1 E (Xi p ; Xi  > m1/p )
Now
n−1/p n
−1+1/p
m=1 m
n
m=1 µm → 0 ≤ Cn1/p and E (Xi p ; Xi  > m1/p ) → 0 as m → ∞, so
and the desired result follows. Exercise 8.2. The converse of the last result is much easier. Let p > 0. If
Sn /n1/p → 0 a.s. then E X1 p < ∞.
Inﬁnite Mean. The St. Petersburg game, discussed in Example 5.7 and
Exercise 6.17, is a situation in which EXi = ∞, Sn /n log2 n → 1 in probability
but
lim sup Sn /(n log2 n) = ∞ a.s.
n→∞ The next result, due to Feller (1946), shows that when E X1  = ∞, Sn /an
cannot converge almost surely to a nonzero limit. In (6.7), we considered the
special case an = n.
(8.9) Theorem. Let X1 , X2 , . . . be i.i.d. with E X1  = ∞ and let Sn = X1 +
· · · + Xn . Let an be a sequence of positive numbers with an /n increasing. Then
lim supn→∞ Sn /an = 0 or ∞ according as n P (X1  ≥ an ) < ∞ or = ∞. 1.8 Convergence of Random Series
Proof Since an /n ↑, akn ≥ kan for any integer k . Using this and an ↑,
∞ ∞ P (X1  ≥ kan ) ≥
n=1 P (X1  ≥ akn ) ≥
n=1 1
k ∞ P (X1  ≥ am )
m=k The last observation shows that if the sum is inﬁnite, lim supn→∞ Xn /an = ∞.
Since max{Sn−1 , Sn } ≥ Xn /2, it follows that lim supn→∞ Sn /an = ∞.
To prove the other half, we begin with the identity
∞ ∞ (∗) mP (am−1 ≤ Xi  < am ) =
m=1 P (Xi  ≥ an−1 )
n=1 To see this, write m = m 1 and then use Fubini’s theorem. We now let
n=1
Yn = Xn 1(Xn <an ) , and Tn = Y1 + . . . + Yn . When the sum is ﬁnite, P (Yn =
Xn i.o.) = 0, and it suﬃces to investigate the behavior of the Tn . To do this,
we let a0 = 0 and compute
∞ ∞
2
EYn /a2
n var(Yn /an ) ≤
n=1 n=1
∞ n a−2
n =
n=1
∞ y 2 dF (y )
m=1 [am−1 ,am )
∞ y 2 dF (y ) =
m=1
∞
n=m Since an ≥ nam /m, we have [am−1 ,am ) −
an 2
n=m a−2 ≤ (m2 /a2 )
n
m ∞
n=m n−2 ≤ Cma−2 , so
m ∞ ≤C m
m=1
dF (y )
[am−1 ,am ) ∞ Using (∗) now, we conclude n=1 var(Yn /an ) < ∞.
The last step is to show ETn /an → 0. To begin, we note that if E Xi  = ∞,
∞
n=1 P (Xi  > an ) < ∞, and an /n ↑ we must have an /n ↑ ∞. To estimate
ETn /an now, we observe that
n a−1
n n EYm ≤ a−1 n
n
m=1 E (Xm ; Xm  < am )
m=1 ≤ naN
n
+
E (Xi ; aN ≤ Xi  < an )
an
an 67 68 Chapter 1 Laws of Large Numbers
where the last inequality holds for any ﬁxed N . Since an /n → ∞, the ﬁrst term
converges to 0. Since m/am ↓, the second is
n ≤
m=N +1
∞ ≤ m
E (Xi ; am−1 ≤ Xi  < am )
am
mP (am−1 ≤ Xi  < am ) m=N +1 (∗) shows that the sum is ﬁnite, so it is small if N is large and the desired result
follows.
Exercises
8.3. Let X1 , X2 , . . . be i.i.d. standard normals. Show that for any t
∞ Xn ·
n=1 sin(nπt)
n converges a.s. We will see this series again at the end of Section 7.1.
2
8.4. Let X1 , X2 , . . . be independent with EXn = 0, var(Xn ) = σn . (i) Show that
n
2
2
−1
if n σn /n < ∞ then n Xn /n converges a.s. and hence n
m=1 Xm → 0
2
2
a.s. (ii) Suppose
σn /n2 = ∞ and without loss of generality that σn ≤ n2 for
all n. Show that there are independent random variables Xn with EXn = 0
2
and var(Xn ) ≤ σn so that Xn /n and hence n−1 m≤n Xm does not converge
to 0 a.s. 8.5. Let Xn ≥ 0 be independent for n ≥ 1. The following are equivalent:
(i) ∞ Xn < ∞ a.s. (ii) ∞ [P (Xn > 1) + E (Xn 1(Xn ≤1) )] < ∞
n=1
n=1
∞
(iii) n=1 E (Xn /(1 + Xn )) < ∞.
8.6. Let ψ (x) = x2 when x ≤ 1 and = x when x ≥ 1. Show that if X1 , X2 , . . .
∞
∞
are independent with EXn = 0 and n=1 Eψ (Xn ) < ∞ then n=1 Xn converges a.s.
8.7. Suppose ∞ E Xn p(n) < ∞ where 0 < p(n) ≤ 2 for all n and EXn = 0
n=1
∞
when p(n) > 1. Show that n=1 Xn converges a.s.
8.8. Let X1 , X2 , . . . be i.i.d. and not ≡ 0. Then the radius of convergence of the
power series n≥1 Xn (ω )z n (i.e., r(ω ) = sup{c :
Xn (ω )cn < ∞}) is 1 a.s. or
+
0 a.s., according as E log X1  < ∞ or = ∞ where log+ x = max(log x, 0).
8.9. Let X1 , X2 , . . . be independent and let Sm,n = Xm+1 + . . . + Xn . Then
() P max Sm,j  > 2a m<j ≤n min P (Sk,n  ≤ a) ≤ P (Sm,n  > a) m<k≤n 1.9 Large Deviations 8.10. Use ( ) to prove a theorem of P. L´vy: Let X1 , X2 , . . . be independent
e
and let Sn = X1 + . . . + Xn . If limn→∞ Sn exists in probability then it also
exists a.s.
8.11. Let X1 , X2 , . . . be i.i.d. and Sn = X1 + . . . + Xn . Use ( ) to conclude that
if Sn /n → 0 in probability then (max1≤m≤n Sm )/n → 0 in probability.
8.12. Let X1 , X2 , . . . be i.i.d. and Sn = X1 + . . . + Xn . Suppose an ↑ ∞ and
a(2n )/a(2n−1 ) is bounded. (i) Use ( ) to show that if Sn /a(n) → 0 in probability and S2n /a(2n ) → 0 a.s. then Sn /a(n) → 0 a.s. (ii) Suppose in addition that
2
EX1 = 0 and EX1 < ∞. Use the previous exercise and Chebyshev’s inequality
to conclude that Sn /n1/2 (log2 n)1/2+ → 0 a.s. *1.9. Large Deviations
Let X1 , X2 , . . . be i.i.d. and let Sn = X1 + · · · + Xn . In this section, we will investigate the rate at which P (Sn > na) → 0 for a > µ = EXi . We will ultimately
conclude that if the momentgenerating function ϕ(θ) = E exp(θXi ) < ∞
for some θ > 0, P (Sn ≥ na) → 0 exponentially rapidly and we will identify
γ (a) = lim n→∞ 1
log P (Sn ≥ na)
n Our ﬁrst step is to prove that the limit exists. This is based on an observation that will be useful several times below. Let πn = P (Sn ≥ na).
πm+n ≥ P (Sm ≥ ma, Sn+m − Sm ≥ na) = πm πn
since Sm and Sn+m − Sm are independent. Letting γn = log πn transforms
multiplication into addition.
(9.1) Lemma. If γm+n ≥ γm + γn then as n → ∞, γn /n → supm γm /m.
Proof Clearly, lim sup γn /n ≤ sup γm /m. To complete the proof, it suﬃces to
prove that for any m liminf γn /n ≥ γm /m. Writing n = km + with 0 ≤ < m
and making repeated use of the hypothesis gives γn ≥ kγm + γ . Dividing by
n = km + gives
γ ( n)
km
γ ( m) γ ( )
≥
+
n
km +
m
n
Letting n → ∞ and recalling n = km +
result. with 0 ≤ < m gives the desired 69 70 Chapter 1 Laws of Large Numbers
1
(9.1) implies that limn→∞ n log P (Sn ≥ na) = γ (a) exists ≤ 0. It follows
from the formula for the limit that P (Sn ≥ na) ≤ enγ (a) (9.2) The last two observations give us some useful information about γ (a).
Exercise 9.1. The following are equivalent: (a) γ (a) = −∞, (b) P (X1 ≥ a) =
0, and (c) P (Sn ≥ na) = 0 for all n.
Exercise 9.2. Use the deﬁnition to conclude that if λ ∈ [0, 1] is rational then
γ (λa + (1 − λ)b) ≥ λγ (a) + (1 − λ)γ (b). Use monotonicity to conclude that
the last relationship holds for all λ ∈ [0, 1] so γ is concave and hence Lipschitz
continuous on compact subsets of γ (a) > −∞.
The conclusions above are valid for any distribution. For the rest of this
section, we will suppose:
(H1) ϕ(θ) = E exp(θXi ) < ∞ for some θ > 0 Let θ+ = sup{θ : ϕ(θ) < ∞}, θ− = inf {θ : ϕ(θ) < ∞} and note that ϕ(θ) < ∞
for θ ∈ (θ− , θ+ ). (H1) implies that EXi+ < ∞ so µ = EX + − EX − ∈ [−∞, ∞).
If θ > 0 Chebyshev’s inequality implies
eθna P (Sn ≥ na) ≤ E exp(θSn ) = ϕ(θ)n
or letting κ(θ) = log ϕ(θ)
(9.3) P (Sn ≥ na) ≤ exp(−n{aθ − κ(θ)}) Our ﬁrst goal is to show:
(9.4) Lemma. If a > µ and θ > 0 is small then aθ − κ(θ) > 0.
Proof κ(0) = log ϕ(0) = 0, so it suﬃces to show that (i) κ is continuous at 0,
(ii) diﬀerentiable on (0, θ+ ), and (iii) κ (θ) → µ as θ → 0. For then
θ a − κ (x) dx > 0 aθ − κ(θ) =
0 for small θ. The ﬁrst step is to show that the derivative exists. Let F (x) =
P (Xi ≤ x). Since
hx ehx − 1 = ey dy ≤ hx(ehx + 1)
0 1.9 Large Deviations (the two terms are to cover the cases h > 0 and h < 0), an application of the
dominated convergence theorem shows that
ϕ ( θ + h) − ϕ ( θ )
h
ehx − 1 θx
= lim
e dF (x)
h→0
h ϕ (θ) = lim h→0 xeθx dF (x) = for θ ∈ (θ− , θ+ ) From the last equation, it follows that κ(θ) = log ϕ(θ) has κ (θ) = ϕ (θ)/ϕ(θ).
To take the limit as θ ↓ 0, we note that using the monotone convergence theorem
for x ≤ 0 and the dominated convergence theorem for x ≥ 0 shows that ϕ (θ) →
µ as θ → 0. A similar but simpler argument shows that ϕ(θ) → 1 as θ ↓ 0, so
we have shown (i)–(iii) and proved (9.4).
Having found an upper bound on P (Sn ≥ na), it is natural to optimize it
by ﬁnding the maximum of θa − κ(θ):
d
{θa − log ϕ(θ)} = a − ϕ (θ)/ϕ(θ)
dθ
so (assuming things are nice) the maximum occurs when a = ϕ (θ)/ϕ(θ). To
turn the parenthetical clause into a mathematical hypothesis we begin by deﬁning
x
1
eθy dF (y )
F θ ( x) =
ϕ(θ) −∞
whenever ϕ(θ) < ∞. It follows from the proof of (9.4) that if θ ∈ (θ− , θ+ ), Fθ
is a distribution function with mean
x dFθ (x) = ∞ 1
ϕ(θ ) xeθx dF (x) = ϕ (θ)/ϕ(θ)
−∞ Repeating the proof in (9.4), it is easy to see that if θ ∈ (θ− , θ+ ) then
∞ x2 eθx dF (x) ϕ (θ ) =
−∞ So we have
d ϕ (θ )
ϕ (θ )
=
−
dθ ϕ(θ)
ϕ(θ ) ϕ (θ )
ϕ(θ ) 2 2 = x2 dFθ (x) − x dFθ (x) ≥0 71 72 Chapter 1 Laws of Large Numbers
since the last expression is the variance of Fθ . If we assume
(H2) the distribution F is not a point mass at µ then ϕ (θ)/ϕ(θ) is strictly increasing and aθ − log ϕ(θ) is concave. Since we have
ϕ (0)/ϕ(0) = µ, this shows that for each a > µ there is at most one θa ≥ 0 that
solves a = ϕ (θa )/ϕ(θa ), and this value of θ maximizes aθ − log ϕ(θ). Before
discussing the existence of θa we will consider some examples.
Example 9.1. Normal distribution.
eθx (2π )−1/2 exp(−x2 /2) dx = exp(θ2 /2) (2π )−1/2 exp(−(x − θ)2 /2) dx The integrand in the last integral is the density of a normal distribution with
mean θ and variance 1, so ϕ(θ) = exp(θ2 /2), θ ∈ (−∞, ∞). In this case,
ϕ (θ)/ϕ(θ) = θ and Fθ is a normal distribution with mean θ and variance 1.
Example 9.2. Exponential distribution with parameter λ. If θ < λ
∞ eθx λe−λx dx = λ/(λ − θ)
0 ϕ (θ)/ϕ(θ) = 1/(λ − θ) and Fθ is an exponential distribution with parameter
λ − θ and hence mean 1/(λ − θ).
Example 9.3. Coin ﬂips. P (Xi = 1) = P (Xi = −1) = 1/2
ϕ(θ) = (eθ + e−θ )/2
ϕ (θ)/ϕ(θ) = (eθ − e−θ )/(eθ + e−θ )
Fθ ({1}) = eθ /(eθ + e−θ ) and Fθ ({−1}) = e−θ /(eθ + e−θ ).
Example 9.4. Perverted exponential. Let g (x) = Cx−3 e−x for x ≥ 1,
g (x) = 0 otherwise, and choose C so that g is a probability density. In this
case,
ϕ(θ ) = eθx g (x)dx < ∞ if and only if θ ≤ 1, and when θ ≤ 1, we have
∞ ∞ Cx−2 dx ϕ (θ)/ϕ(θ) ≤ ϕ (1)/ϕ(1) =
1 Cx−3 dx = 2
1 1.9 Large Deviations Recall θ+ = sup{θ : ϕ(θ) < ∞}. In Examples 9.1 and 9.2, we have
ϕ (θ)/ϕ(θ) ↑ ∞ as θ ↑ θ+ so we can solve a = ϕ (θ)/ϕ(θ) for any a > µ. In
Example 9.3, ϕ (θ)/ϕ(θ) ↑ 1 as θ → ∞, but we cannot hope for much more
since F and hence Fθ is supported on {−1, 1}.
Exercise 9.3. Let xo = sup{x : F (x) < 1}. Show that if xo < ∞ then
ϕ(θ) < ∞ for all θ > 0 and ϕ (θ)/ϕ(θ) → xo as θ ↑ ∞.
Example 9.4 presents a problem since we cannot solve a = ϕ (θ)/ϕ(θ) when
a > 2. (9.6) will cover this problem case, but ﬁrst we will treat the cases in
which we can solve the equation.
(9.5) Theorem. Suppose in addition to (H1) and (H2) that there is a θa ∈
(0, θ+ ) so that a = ϕ (θa )/ϕ(θa ). Then, as n → ∞,
n−1 log P (Sn ≥ na) → −aθa + log ϕ(θa )
Proof The fact that the limsup of the lefthand side ≤ the righthand side folλ
λ
lows from (9.3). To prove the other inequality, pick λ ∈ (θa , θ+ ), let X1 , X2 , . . .
λ
λ
λ
be i.i.d. with distribution Fλ and let Sn = X1 + · · · + Xn . Writing dF/dFλ
for the RadonNikodym derivative of the associated measures, it is immediate
n
from the deﬁnition that dF/dFλ = e−λx ϕ(λ). If we let Fλ and F n denote the
λ
distributions of Sn and Sn , then
n
Lemma. dF n /dFλ = e−λx ϕ(λ)n . Proof We will prove this by induction. The result holds when n = 1. For
n > 1, we note that
∞ F n = F n−1 ∗ F (z ) = z −x dF n−1 (x)
−∞ = n
dFλ −1 (x) = E 1(S λ n −1 dF (y )
−∞ dFλ (y ) 1(x+y≤z) e−λ(x+y) ϕ(λ)n λ
λ
−λ(Sn−1 +Xn )
ϕ ( λ) n
λ
+Xn ≤z ) e z
n
dFλ (u)e−λu ϕ(λ)n =
−∞ λ
λ
λ
where in the last two equalities we have used (3.9) for (Sn−1 , Xn ) and Sn . If ν > a, then the lemma and monotonicity imply
nν
n
n
n
e−λx ϕ(λ)n dFλ (x) ≥ ϕ(λ)n e−λnν (Fλ (nν ) − Fλ (na)) (∗) P (Sn ≥ na) ≥
na 73 74 Chapter 1 Laws of Large Numbers
Fλ has mean ϕ (λ)/ϕ(λ), so if we have a < ϕ (λ)/ϕ(λ) < ν , then the weak law
of large numbers implies
n
n
Fλ (nν ) − Fλ (na) → 1 as n → ∞ From the last conclusion and (∗) it follows that
lim inf n−1 log P (Sn > na) ≥ −λν + log ϕ(λ)
n→∞ Since λ > θa and ν > a are arbitrary, the proof is complete.
To get a feel for what the answers look like, we consider our examples. To
prepare for the computations, we recall some important information:
κ(θ) = log ϕ(θ) κ (θ) = ϕ (θ)/ϕ(θ) θa solves κ (θa ) = a γ (a) = lim (1/n) log P (Sn ≥ na) = −aθa + κ(θa )
n→∞ Normal distribution (Example 9.1)
κ(θ) = θ2 /2 κ (θ ) = θ θa = a
2 γ (a) = −aθa + κ(θa ) = −a /2
Exercise 9.4. Check the last result by observing that Sn has a normal distribution with mean 0 and variance n, and then using (1.4).
Exponential distribution (Example 9.2) with λ = 1
κ(θ) = − log(1 − θ) κ (θ) = 1/(1 − θ) θa = 1 − 1/a γ (a) = −aθa + κ(θa ) = −a + 1 + log a
With these two examples as models, the reader should be able to do
Exercise 9.5. Let X1 , X2 , . . . be i.i.d. Poisson with mean 1, and let Sn =
X1 + · · · + Xn . Find limn→∞ (1/n) log P (Sn ≥ na) for a > 1. The answer and
another proof can be found in Exercise 1.4 of Chapter 2.
Coin ﬂips (Example 9.3). Here we take a diﬀerent approach. To ﬁnd the θ
that makes the mean of Fθ = a, we set Fθ ({1}) = eθ /(eθ + e−θ ) = (1 + a)/2.
Letting x = eθ gives
2x = (1 + a)(x + x−1 ) (a − 1)x2 + (1 + a) = 0 1.9
So x = Large Deviations (1 + a)/(1 − a), θa = log x = {log(1 + a) − log(1 − a)}/2. ϕ(θa ) = eθa + e−θa
eθa
=
=
2
1+a 1
(1 + a)(1 − a) γ (a) = −aθa + κ(θa ) = −{(1 + a) log(1 + a) + (1 − a) log(1 − a)}/2
In Exercise 1.3 of Chapter 2, this result will be proved by a direct computation.
Since the formula for γ (a) is rather ugly, the following simpler bound is useful.
Exercise 9.6. Show that for coin ﬂips ϕ(θ) ≤ exp(ϕ(θ) − 1) ≤ exp(βθ2 )
∞
for θ ≤ 1 where β =
n=1 1/(2n)! ≈ .586, and use (9.3) to conclude that
2
P (Sn ≥ an) ≤ exp(−na /4β ) for all a ∈ [0, 1]. It is customary to simplify this
∞
further by using β ≤ n=1 2−n = 1.
Turning now to the problematic values for which we cannot solve a =
ϕ (θa )/ϕ(θa ), we begin by observing that if xo = sup{x : F (x) < 1} and F is
not a point mass at xo then ϕ (θ)/ϕ(θ) ↑ x0 as θ ↑ ∞ but ϕ (θ)/ϕ(θ) < x0 for
all θ < ∞. However, the result for a = xo is trivial:
1
log P (Sn ≥ nxo ) = log P (Xi = xo )
n for all n Exercise 9.7. Show that as a ↑ xo , γ (a) ↓ log P (Xi = xo ).
When xo = ∞, ϕ (θ)/ϕ(θ) ↑ ∞ as θ ↑ ∞, so the only case that remains is
covered by
(9.6) Theorem. Suppose xo = ∞, θ+ < ∞, and ϕ (θ)/ϕ(θ) increases to a
ﬁnite limit a0 as θ ↑ θ+ . If a0 ≤ a < ∞
n−1 log P (Sn ≥ na) → −aθ+ + log ϕ(θ+ )
i.e., γ (a) is linear for a ≥ a0 .
Proof Since (log ϕ(θ)) = ϕ (θ)/ϕ(θ), integrating from 0 to θ+ shows that
log(ϕ(θ+ )) < ∞. Letting θ = θ+ in (9.3) shows that the limsup of the lefthand side ≤ the righthand side. To get the other direction we will use the
transformed distribution Fλ , for λ = θ+ . Letting θ ↑ θ+ and using the dominated convergence theorem for x ≤ 0 and the monotone convergence theorem
for x ≥ 0, we see that Fλ has mean a0 . From (∗) in the proof of (9.5), we see
that if a0 ≤ a < ν = a + 3
n
n
P (Sn ≥ na) ≥ ϕ(λ)n e−nλν (Fλ (nν ) − Fλ (na)) 75 76 Chapter 1 Laws of Large Numbers
and hence
1
1
λ
log P (Sn ≥ na) ≥ log ϕ(λ) − λν + log P (Sn ∈ (na, nν ])
n
n
λ
λ
λ
λ
λ
Letting X1 , X2 , . . . be i.i.d. with distribution Fλ and Sn = X1 + · · · + Xn , we
have
λ
λ
P (Sn ∈ (na, nν ]) ≥ P {Sn−1 ∈ ((a0 − )n, (a0 + )n]}
λ
· P {Xn ∈ ((a − a0 + )n, (a − a0 + 2 )n]} ≥ 1
λ
P {Xn ∈ ((a − a0 + )n, (a − a0 + )(n + 1)]}
2 for large n by the weak law of large numbers. To get a lower bound on the
righthand side of the last equation, we observe that
lim sup
n→∞ 1
λ
log P (X1 ∈ ((a − a0 + )n, (a − a0 + )(n + 1)]) = 0
n λ
for if the lim sup was < 0, we would have E exp(ηX1 ) < ∞ for some η > 0 and
hence E exp((λ + η )X1 ) < ∞, contradicting the deﬁnition of λ = θ+ . To ﬁnish
the argument now, we recall that (9.1) implies that lim n→∞ 1
log P (Sn ≥ na) = γ (a)
n exists, so our lower bound on the lim sup is good enough.
By adapting the proof of the last result, you can show that (H1) is necessary
for exponential convergence:
Exercise 9.8. Suppose EXi = 0 and E exp(θXi ) = ∞ for all θ > 0. Then
1
log P (Sn ≥ na) → 0 for all a > 0
n
Exercise 9.9. Suppose EXi = 0. Show that if > 0 then lim inf P (Sn ≥ na)/nP (X1 ≥ n(a + )) ≥ 1
n→∞ Hint: Let Fn = {Xi ≥ n(a + ) for exactly one i ≤ n}. 2 Central Limit Theorems The ﬁrst four sections of this chapter develop the central limit theorem. The
last ﬁve treat various extensions and complements. We begin this chapter by
considering special cases of these results that can be treated by elementary
computations. 2.1. The De MoivreLaplace Theorem
Let X1 , X2 , . . . be i.i.d. with P (X1 = 1) = P (X1 = −1) = 1/2 and let Sn =
X1 + · · · + Xn . In words, we are betting $1 on the ﬂipping of a fair coin and
Sn is our winnings at time n. If n and k are integers
P (S2n = 2k ) = 2n
2−2n
n+k since S2n = 2k if and only if there are n + k ﬂips that are +1 and n − k ﬂips that
are −1 in the ﬁrst 2n. The ﬁrst factor gives the number of such outcomes and
the second the probability of each one. Stirling’s formula (see Feller, Vol. I.
(1968), p. 52) tells us
√
(1.1)
n! ∼ nn e−n 2πn as n → ∞
where an ∼ bn means an /bn → 1 as n → ∞, so
= (2n)!
(n + k )!(n − k )! ∼ 2n
n+k (2n)2n
(2π (2n))1/2
·
(n + k )n+k (n − k )n−k (2π (n + k ))1/2 (2π (n − k ))1/2 and we have (1.2) 2n
2−2n ∼
n+k 1+ k
n −n−k · (πn)−1/2 · 1 + · 1−
k
n k
n −n+k −1/2 · 1− k
n −1/2 78 Chapter 2 Central Limit Theorems
The ﬁrst two terms on the right are
= 1− k2
n2 −n · 1+ −k k
n · 1− k k
n A little calculus shows that:
(1.3) Lemma. If cj → 0, aj → ∞ and aj cj → λ then (1 + cj )aj → eλ .
Proof As x → 0, log(1 + x)/x → 1, so aj log(1 + cj ) → λ and the desired
result follows.
Exercise 1.1. Generalize the last proof to conclude that if max1≤j ≤n cj,n  →
n
n
n
0, j =1 cj,n → λ, and supn j =1 cj,n  < ∞ then j =1 (1 + cj,n ) → eλ .
√
Using (1.3) now we see that if 2k = x 2n, i.e., k = x
1− k2
n2 −n k
n −k n/2, then 1+ = 1 − x2 /2n k
1−
n −n √
= 1 + x/ 2n k 2 → ex
√ −x n/2 /2 2 → e−x /2 √ √ x n/2 = 1 − x/ 2n 2 → e−x /2 For this choice of k , k/n → 0, so
1+ k
n −1/2 · 1− k
n −1/2 →1 and putting things together gives:
√
2
(1.4) Theorem. If 2k/ 2n → x then P (S2n = 2k ) ∼ (πn)−1/2 e−x /2 .
Our next step is to compute
√
√
P (a 2n ≤ S2n ≤ b 2n) = P (S2n = m)
√
√
m∈[a 2n,b 2n]∩2Z √
Changing variables m = x 2n, we have that the above is
2 (2π )−1/2 e−x ≈
√
x∈[a,b]∩(2Z/ 2n) /2 · (2/n)1/2 Section 2.1 The De MoivreLaplace Theorem
√
√
√
where 2Z/ 2n = {2z/ 2n : z ∈ Z}. We have multiplied and divided by 2
since the space between points in the sum is (2/n)1/2 , so if n is large the sum
above is
b 2 (2π )−1/2 e−x ≈ /2 dx a The integrand is the density of the (standard) normal distribution, so changing
notation we can write the last quantity as P (a ≤ χ ≤ b) where χ is a random
variable with that distribution.
It is not hard to ﬁll in the details to get:
(1.5) The De MoivreLaplace Theorem. If a < b then as m → ∞
b √
P (a ≤ Sm / m ≤ b) → 2 (2π )−1/2 e−x /2 dx a (To remove the restriction to even integers observe S2n+1 = S2n ± 1.) The last
result is a special case of the central limit theorem given in Section 2.4, so
further details are left to the reader.
Another special case that can be treated with Stirling’s formula is
Exercise 1.2. Let X1 , X2 , . . . be independent and have a Poisson distribution
with mean 1. Then Sn = X1 + · · · + Xn has a Poisson distribution with mean n,
√
i.e., P (Sn = k ) = e−n nk /k ! Use Stirling’s formula to show that if (k − n)/ n →
x then
√
2πnP (Sn = k ) → exp(−x2 /2)
As in the case of coin ﬂips it follows that
b √
P (a ≤ (Sn − n)/ n ≤ b) → 2 (2π )−1/2 e−x /2 dx a but proving the last conclusion is not part of the exercise.
Stirling’s formula can also be used to compute some large deviations probabilities considered in Section 1.9. In the next two exercises, X1 , X2 , . . . are
i.i.d. and Sn = X1 + · · · + Xn . In each case you should begin by considering
P (Sn = k ) when k/n → a and then relate P (Sn = j + 1) to P (Sn = j ) to show
P (Sn ≥ k ) ≤ CP (Sn = k ).
Exercise 1.3. Suppose P (Xi = 1) = P (Xi = −1) = 1/2. Show that if
a ∈ (0, 1)
1
log P (S2n ≥ 2na) → −γ (a)
2n 79 80 Chapter 2 Central Limit Theorems
where γ (a) = 1 {(1 + a) log(1 + a) + (1 − a) log(1 − a)}.
2
Exercise 1.4. Suppose P (Xi = k ) = e−1 /k ! for k = 0, 1, . . . Show that if a > 1
1
log P (Sn ≥ na) → a − 1 − a log a
n 2.2. Weak Convergence
In this section, we will deﬁne the type of convergence that appears in the central
limit theorem and explore some of its properties. A sequence of distribution
functions is said to converge weakly to a limit F (written Fn ⇒ F ) if Fn (y ) →
F (y ) for all y that are continuity points of F . A sequence of random variables
Xn is said to converge weakly or converge in distribution to a limit X∞
(written Xn ⇒ X∞ ) if their distribution functions Fn (x) = P (Xn ≤ x) converge
weakly. To see that convergence at continuity points is enough to identify the
limit, observe that F is right continuous and by Exercise 1.8 in Chapter 1, the
discontinuities of F are at most a countable set. a. Examples
Two examples of weak convergence that we have seen earlier are:
Example 2.1. Let X1 , X2 , . . . be i.i.d. with P (Xi = 1) = P (Xi = −1) = 1/2
and let Sn = X1 + · · · + Xn . Then (1.5) implies
√
F n ( y ) = P ( Sn / n ≤ y ) → y 2 (2π )−1/2 e−x /2 dx −∞ Example 2.2. Let X1 , X2 , . . . be i.i.d. with distribution F . The GlivenkoCantelli theorem ((7.4) in Chapter 1) implies that for almost every ω ,
n Fn (y ) = n−1 1(Xm (ω)≤y) → F (y ) for all y
m=1 In the last two examples convergence occurred for all y , even though in
the second case the distribution function could have discontinuities. The next
example shows why we restrict our attention to continuity points.
Example 2.3. Let X have distribution F . Then X + 1/n has distribution
Fn (x) = P (X + 1/n ≤ x) = F (x − 1/n) Section 2.2 Weak Convergence
As n → ∞, Fn (x) → F (x−) = limy↑x F (y ) so convergence only occurs at
continuity points.
Example 2.4. Waiting for rare events. Let Xp be the number of trials
needed to get a success in a sequence of independent trials with success probability p. Then P (Xp ≥ n) = (1 − p)n−1 for n = 1, 2, 3, . . . and it follows from
(1.3) that as p → 0,
P (pXp > x) → e−x for all x ≥ 0 In words, pXp converges weakly to an exponential distribution.
Example 2.5. Birthday problem. Let X1 , X2 , . . . be independent and uniformly distributed on {1, . . . , N }, and let TN = min{n : Xn = Xm for some
m < n}.
n
m−1
1−
P ( TN > n) =
N
m=2
When N = 365 this is the probability that two people in a group of size n do
not have the same birthday (assuming all birthdays are equally likely). Using
Exercise 1.1, it is easy to see that
P (TN /N 1/2 > x) → exp(−x2 /2) for all x ≥ 0
√
Taking N = 365 and noting 22/ 365 = 1.1515 and (1.1515)2/2 = 0.6630, this
says that
P (T365 > 22) ≈ e−.6630 ≈ .515
This answer is 2% smaller than the true probability .524.
Before giving our sixth example, we need a simple result called Scheﬀ´’s
e
Theorem. Suppose we have probability densities fn , 1 ≤ n ≤ ∞, and fn → f∞
pointwise as n → ∞. Then for all Borel sets B
fn (x)dx −
B f∞ (x)dx ≤ fn (x) − f∞ (x)dx B =2 (f∞ (x) − fn (x))+ dx → 0 by the dominated convergence theorem, the equality following from the fact that
the fn ≥ 0 and have integral = 1. Writing µn for the corresponding measures,
we have shown that the total variation norm
µn − µ∞ ≡ sup µn (B ) − µ∞ (B ) → 0
B 81 82 Chapter 2 Central Limit Theorems
a conclusion stronger than weak convergence. (Take B = (−∞, x].) The example µn = a point mass at 1/n (with 1/∞ = 0) shows that we may have
µn ⇒ µ∞ with µn − µ∞ = 1 for all n.
Exercise 2.1. Give an example of random variables Xn with densities fn so
that Xn ⇒ a uniform distribution on (0,1) but fn (x) does not converge to 1 for
any x ∈ [0, 1].
Example 2.6. Central order statistic. Put (2n + 1) points at random in
(0,1), i.e., with locations that are independent and uniformly distributed. Let
Vn+1 be the (n + 1)th largest point. It is easy to see that
Lemma. Vn+1 has density
(2n + 1) 2n n
x (1 − x)n
n Proof There are 2n + 1 ways to pick the observation that falls at x, then we
n
have to pick n indices for observations < x, which can be done in 2n ways.
Once we have decided on the indices that will land < x and > x, the probability
the corresponding random variables will do what we want is xn (1 − x)n , and
the probability density that the remaining one will land at x is 1. If you don’t
like the previous sentence compute the probability X1 < x − , . . . , Xn < x − ,
x − < Xn+1 < x + , Xn+2 > x + , . . . X2n+1 > x + then let → 0.
√
To compute the density function of Yn = 2(Vn+1 −1/2) √2n, we use Exercise
√
1.10 in Chapter 1, or simply change variables x = 1/2+ y/2 2n, dx = dy/2 2n
to get
(2n + 1)
= y
1
+√
2 2 2n n y
1
−√
2 2 2n
2n + 1
2n −2n
·
2
· (1 − y 2 /2n)n ·
2n
n
2n
n n 1
√
2 2n
n
2 The ﬁrst factor is P (S2n = 0) for a simple random walk so (1.4) and (1.3) imply
that
P (Yn = y ) → (2π )−1/2 exp(−y 2 /2) as n → ∞
Here and in what follows we write P (Yn = y ) for the density function of Yn . Using Scheﬀ´’s theorem now, we conclude that Yn converges weakly to a standard
e
normal distribution. Section 2.2 Weak Convergence
Exercise 2.2. Convergence of maxima. Let X1 , X2 , . . . be independent
with distribution F , and let Mn = maxm≤n Xm . Then P (Mn ≤ x) = F (x)n .
Prove the following limit laws for Mn :
(i) If F (x) = 1 − x−α for x ≥ 1 where α > 0 then for y > 0
P (Mn /n1/α ≤ y ) → exp(−y −α )
(ii) If F (x) = 1 − xβ for −1 ≤ x ≤ 0 where β > 0 then for y < 0
P (n1/β Mn ≤ y ) → exp(−y β )
(iii) If F (x) = 1 − e−x for x ≥ 0 then for all y ∈ (−∞, ∞)
P (Mn − log n ≤ y ) → exp(−e−y )
The limits that appear above are called the extreme value distributions.
The last one is called the double exponential or Gumbel distribution.
Necessary and suﬃcient conditions for (Mn − bn )/an to converge to these limits
were obtained by Gnedenko (1943). For a recent treatment, see Resnick (1987).
Exercise 2.3. Let X1 , X2 , . . . be i.i.d. and have the standard normal distribution. (i) From (1.4) in Chapter 1, we know
P ( X i > x) ∼ √ 2
1
e−x /2
2π x as x → ∞ Use this to conclude that for any real number θ
P (Xi > x + (θ/x))/P (Xi > x) → e−θ
(ii) Show that if we deﬁne bn by P (Xi > bn ) = 1/n
P (bn (Mn − bn ) ≤ x) → exp(−e−x )
(iii) Show that bn ∼ (2 log n)1/2 and conclude Mn /(2 log n)1/2 → 1 in probability. b. Theory
The next result is useful for proving things about weak convergence.
(2.1) Theorem. If Fn ⇒ F∞ then there are random variables Yn , 1 ≤ n ≤ ∞,
with distribution Fn so that Yn → Y∞ a.s. 83 84 Chapter 2 Central Limit Theorems
Proof Let Ω = (0, 1), F = Borel sets, P = Lebesgue measure, and let Yn (x) =
sup{y : Fn (y ) < x}. By (1.1) in Chapter 1, Yn has distribution Fn . We will
now show that Yn (x) → Y∞ (x) for all but a countable number of x. To do this,
−
it is convenient to write Yn (x) as Fn 1 (x) and drop the subscript when n = ∞.
We begin by identifying the exceptional set. Let ax = sup{y : F (y ) < x},
bx = inf {y : F (y ) > x}, and Ω0 = {x : (ax , bx ) = ∅} where (ax , bx ) is the open
interval with the indicated endpoints. Ω − Ω0 is countable since the (ax , bx )
are disjoint and each nonempty interval contains a diﬀerent rational number. If
x ∈ Ω0 then F (y ) < x for y < F −1 (x) and F (z ) > x for z > F −1 (x). To prove
−
that Fn 1 (x) → F −1 (x) for x ∈ Ω0 , there are two things to show:
−
(a) lim inf n→∞ Fn 1 (x) ≥ F −1 (x) Proof of (a) Let y < F −1 (x) be such that F is continuous at y . Since x ∈ Ω0 ,
−
F (y ) < x and if n is suﬃciently large Fn (y ) < x, i.e., Fn 1 (x) ≥ y . Since this
holds for all y satisfying the indicated restrictions, the result follows.
−
(b) lim supn→∞ Fn 1 (x) ≤ F −1 (x) Proof of (b) Let y > F −1 (x) be such that F is continuous at y . Since x ∈ Ω0 ,
−
F (y ) > x and if n is suﬃciently large Fn (y ) > x, i.e., Fn 1 (x) ≤ y . Since this
holds for all y satisfying the indicated restrictions, the result follows and we
have completed the proof of (2.1).
(2.1) allows us to immediately generalize some of our earlier results.
Exercise 2.4. Fatou’s lemma. Let g ≥ 0 be continuous. If Xn ⇒ X∞ then
lim inf Eg (Xn ) ≥ Eg (X∞ )
n→∞ Exercise 2.5. Integration to the limit. Suppose g, h are continuous with
g (x) > 0, and h(x)/g (x) → 0 as x → ∞. If Fn ⇒ F and g (x) dFn (x) ≤
C < ∞ then
h(x) dFn (x) → h(x)dF (x) The next result illustrates the usefulness of (2.1) and gives an equivalent
deﬁnition of weak convergence that makes sense in any topological space.
(2.2) Theorem. Xn ⇒ X∞ if and only if for every bounded continuous function
g we have Eg (Xn ) → Eg (X∞ ). Section 2.2 Weak Convergence
Proof Let Yn have the same distribution as Xn and converge a.s. Since g is
continuous g (Yn ) → g (Y∞ ) a.s. and the bounded convergence theorem implies
Eg (Xn ) = Eg (Yn ) → Eg (Y∞ ) = Eg (X∞ )
To prove the converse let
gx, (y ) = 1
y≤x
0
y ≥x+
linear x ≤ y ≤ x + Since gx, (y ) = 1 for y ≤ x, gx, is continuous, and gx, (y ) = 0 for y > x + ,
lim sup P (Xn ≤ x) ≤ lim sup Egx, (Xn ) = Egx, (X∞ ) ≤ P (X∞ ≤ x + )
n→∞ n→∞ Letting → 0 gives lim supn→∞ P (Xn ≤ x) ≤ P (X∞ ≤ x). The last conclusion
is valid for any x. To get the other direction, we observe
lim inf P (Xn ≤ x) ≥ lim inf Egx− , (Xn ) = Egx− , (X∞ ) ≥ P (X∞ ≤ x − )
n→∞ n→∞ Letting → 0 gives lim inf n→∞ P (Xn ≤ x) ≥ P (X∞ < x) = P (X∞ ≤ x) if x is
a continuity point. The results for the lim sup and the lim inf combine to give
the desired result.
The next result is a trivial but useful generalization of (2.2).
(2.3) Continuous mapping theorem. Let g be a measurable function and
Dg = {x : g is discontinuous at x}. If Xn ⇒ X∞ and P (X∞ ∈ Dg ) = 0 then
g (Xn ) ⇒ g (X ). If in addition g is bounded then Eg (Xn ) → Eg (X∞ ).
Remark. Dg is always a Borel set. See Exercise 2.6 in Chapter 1.
Proof Let Yn =d Xn with Yn → Y∞ a.s. If f is continuous then Df ◦g ⊂ Dg so
P (Y∞ ∈ Df ◦g ) = 0 and it follows that f (g (Yn )) → f (g (Y∞ ) a.s. If, in addition,
f is bounded then the bounded convergence theorem implies Ef (g (Yn )) →
Ef (g (Y∞ ). Since this holds for all bounded continuous functions, it follows
from (2.2) that g (Xn ) ⇒ g (X∞ ).
The second conclusion is easier. Since P (Y∞ ∈ Dg ) = 0, f (g (Yn )) →
f (g (Y∞ ) a.s., and the desired result follows from the bounded convergence theorem.
The next result provides a number of useful alternative deﬁntions of weak
convergence. 85 86 Chapter 2 Central Limit Theorems
(2.4) Theorem. The following statements are equivalent: (i) Xn ⇒ X∞
(ii) For all open sets G, lim inf n→∞ P (Xn ∈ G) ≥ P (X∞ ∈ G).
(iii) For all closed sets K , lim supn→∞ P (Xn ∈ K ) ≤ P (X∞ ∈ K ).
(iv) For all sets A with P (X∞ ∈ ∂A) = 0, limn→∞ P (Xn ∈ A) = P (X∞ ∈ A).
Remark. To help remember the directions of the inequalities in (ii) and (iii),
consider the special case in which P (Xn = xn ) = 1. In this case, if xn ∈ G and
xn → x∞ ∈ ∂G, then P (Xn ∈ G) = 1 for all n but P (X∞ ∈ G) = 0. Letting
K = Gc gives an example for (iii).
Proof We will prove four things and leave it to the reader to check that we
have proved the result given above.
(i) implies (ii): Let Yn have the same distribution as Xn and Yn → Y∞ a.s.
Since G is open
lim inf 1G (Yn ) ≥ 1G (Y∞ )
n→∞ so Fatou’s Lemma implies
lim inf P (Yn ∈ G) ≥ P (Y∞ ∈ G)
n→∞ (ii) is equivalent to (iii): This follows easily from: A is open if and only if Ac is
closed and P (A) + P (Ac ) = 1.
¯
(ii) and (iii) imply (iv): Let K = A and G = Ao be the closure and interior of
¯
A respectively. The boundary of A, ∂A = A − Ao and P (X∞ ∈ ∂A) = 0 so
P (X∞ ∈ K ) = P (X∞ ∈ A) = P (X∞ ∈ G)
Using (ii) and (iii) now
lim sup P (Xn ∈ A) ≤ lim sup P (Xn ∈ K ) ≤ P (X∞ ∈ K ) = P (X∞ ∈ A)
n→∞ n→∞ lim inf P (Xn ∈ A) ≥ lim inf P (Xn ∈ G) ≥ P (X∞ ∈ G) = P (X∞ ∈ A)
n→∞ n→∞ (iv) implies (i): Let x be such that P (X∞ = x) = 0 and let A = (−∞, x].
The next result is useful in studying limits of sequences of distributions.
(2.5) Helly’s selection theorem. For every sequence Fn of distribution functions, there is a subsequence Fn(k) and a right continuous nondecreasing function F so that limk→∞ Fn(k) (y ) = F (y ) at all continuity points y of F . Section 2.2 Weak Convergence
Remark. The limit may not be a distribution function. For example if a +
b + c = 1 and Fn (x) = a 1(x≥n) + b 1(x≥−n) + c G(x) where G is a distribution
function, then Fn (x) → F (x) = b + cG(x),
lim F (x) = b and x↓−∞ lim F (x) = b + c = 1 − a x↑∞ In words, an amount of mass a escapes to +∞, and mass b escapes to −∞.
The type of convergence that occurs in (2.5) is sometimes called vague convergence, and will be denoted here by ⇒v .
Proof of (2.5) The ﬁrst step is a diagonal argument. Let q1 , q2 , . . . be an
enumeration of the rationals. Since for each k , Fm (qk ) ∈ [0, 1] for all m, there
is a sequence mk (i) → ∞ that is a subsequence of mk−1 (j ) (let m0 (j ) ≡ j ) so
that
Fmk (i) (qk ) converges to G(qk ) as i → ∞
Let Fn(k) = Fmk (k) . By construction Fn(k) (q ) → G(q ) for all rational q . The
function G may not be right continuous but F (x) = inf {G(q ) : q ∈ Q, q > x} is
since
lim F (xn ) = inf {G(q ) : q ∈ Q, q > xn for some n}
xn ↓x = inf {G(q ) : q ∈ Q, q > x} = F (x)
To complete the proof, let x be a continuity point of F . Pick rationals r1 , r2 , s
with r1 < r2 < x < s so that
F (x) − < F (r1 ) ≤ F (r2 ) ≤ F (x) ≤ F (s) < F (x) +
Since Fn (r2 ) → G(r2 ) ≥ F (r1 ), and Fn (s) → G(s) ≤ F (s) it follows that if k is
large
F (x) − < Fnk (r2 ) ≤ Fnk (x) ≤ Fnk (s) < F (x) +
which is the desired conclusion.
The last result raises a question: When can we conclude that no mass is
lost in the limit in (2.5)?
(2.6) Theorem. Every subsequential limit is the distribution function of a
probability measure if and only if the sequence Fn is tight, i.e., for all > 0
there is an M so that
lim sup 1 − Fn (M ) + Fn (−M ) ≤
n→∞ 87 88 Chapter 2 Central Limit Theorems
Proof Suppose the sequence is tight and Fn(k) ⇒v F . Let r < −M and
s > M be continuity points of F . Since Fn (r) → F (r) and Fn (s) → F (s), we
have
1 − F (s) + F (r) = lim 1 − Fn(k) (s) + Fn(k) (r)
k→∞ ≤ lim sup 1 − Fn (M ) + Fn (−M ) ≤
n→∞ The last result implies lim supx→∞ 1 − F (x) + F (−x) ≤ . Since is arbitrary
it follows that F is the distribution function of a probability measure.
To prove the converse now suppose Fn is not tight. In this case, there is
an > 0 and a subsequence n(k ) → ∞ so that
1 − Fn(k) (k ) + Fn(k) (−k ) ≥
for all k . By passing to a further subsequence Fn(kj ) we can suppose that
Fn(kj ) ⇒v F . Let r < 0 < s be continuity points of F .
1 − F (s) + F (r) = lim 1 − Fn(kj ) (s) + Fn(kj ) (r)
j →∞ ≥ lim inf 1 − Fn(kj ) (kj ) + Fn(kj ) (−kj ) ≥
j →∞ Letting s → ∞ and r → −∞, we see that F is not the distribution function of
a probability measure.
The following suﬃcient condition for tightness is often useful.
(2.7) Theorem. If there is a ϕ ≥ 0 so that ϕ(x) → ∞ as x → ∞ and
C = sup ϕ(x)dFn (x) < ∞ n then Fn is tight.
Proof 1 − Fn (M ) + Fn (−M ) ≤ C/ inf x≥M ϕ(x). Exercises
2.6. If Fn ⇒ F and F is continuous then supx Fn (x) − F (x) → 0.
2.7. If F is any distribution function there is a sequence of distribution functions
n
of the form m=1 an,m 1(xn,m ≤x) with Fn ⇒ F . Hint: use (7.4) in Chapter 1.
2.8. Let Xn , 1 ≤ n ≤ ∞, be integer valued. Show that Xn ⇒ X∞ if and only
if P (Xn = m) → P (X∞ = m) for all m. Section 2.3 Characteristic Functions
2.9. Show that if Xn → X in probability then Xn ⇒ X and that, conversely, if
Xn ⇒ c, where c is a constant then Xn → c in probability.
2.10. Converging together lemma. If Xn ⇒ X and Yn ⇒ c, where c is a
constant then Xn + Yn ⇒ X + c. A useful consequence of this result is that if
Xn ⇒ X and Zn − Xn ⇒ 0 then Zn ⇒ X.
2.11. Suppose Xn ⇒ X , Yn ≥ 0, and Yn ⇒ c, where c > 0 is a constant then
Xn Yn ⇒ cX. This result is true without the assumptions Yn ≥ 0 and c > 0.
We have imposed these only to make the proof less tedious.
n
2.12. Show that if Xn =√Xn , . . . , Xn ) is uniformly distributed over the surface
(1
n
1
of the sphere of radius n in R then Xn ⇒ a standard normal. Hint: Let
n
i
2
Y1 , Y2 , . . . be i.i.d. standard normals and let Xn = Yi (n/ m=1 Ym )1/2 .
α
β
2.13. Suppose Yn ≥ 0, EYn → 1 and EYn → 1 for some 0 < α < β . Show that
Yn → 1 in probability. 2.14. For each K < ∞ and y < 1 there is a cy,K > 0 so that EX 2 = 1 and
EX 4 ≤ K implies P (X  > y ) ≥ cy,K .
2.15. The L´vy Metric. Show that
e
ρ(F, G) = inf { : F (x − ) − ≤ G(x) ≤ F (x + ) + for all x}
deﬁnes a metric on the space of distributions and ρ(Fn , F ) → 0 if and only if
Fn ⇒ F.
2.16. The Ky Fan metric on random variables is deﬁned by
α(X, Y ) = inf { ≥ 0 : P (X − Y  > ) ≤ }
Show that if α(X, Y ) = α then the corresponding distributions have L´vy dise
tance ρ(F, G) ≤ α.
2.17. Let α(X, Y ) be the metric in the previous exercise and let β (X, Y ) =
E (X − Y /(1 + X − Y )) be the metric of Exercise 6.4 in Chapter 1. If
α(X, Y ) = a then
a2 /(1 + a) ≤ β (X, Y ) ≤ a + (1 − a)a/(1 + a) 2.3. Characteristic Functions
This long section is divided into ﬁve parts. The ﬁrst three are required reading,
the last two are optional. In part a we show that the characteristic function 89 90 Chapter 2 Central Limit Theorems
ϕ(t) = E exp(itX ) determines F (x) = P (X ≤ x), and we give recipes for
computing F from ϕ. In part b we relate weak convergence of distributions to
the behavior of the corresponding characteristic functions. In part c we relate
the behavior of ϕ(t) at 0 to the moments of X . In part d we prove Polya’s
criterion and use it to construct some famous and some strange examples of
characteristic functions. Finally in part e we consider the moment problem,
i.e., when is a distribution characterized by its moments. a. Deﬁnition, Inversion Formula
If X is a random variable we deﬁne its characteristic function (ch.f.) by
ϕ(t) = EeitX = E cos tX + iE sin tX
The last formula requires taking the expected value of a complex valued random
variable but as the second equality may suggest no new theory is required. If Z
is complex valued we deﬁne EZ = E (Re Z ) + iE (Im Z ) where Re (a + bi) = a
is the real part and Im (a + bi) = b is the imaginary part. Some properties
are immediate from the deﬁnition:
(3.1a) ϕ(0) = 1
(3.1b) ϕ(−t) = E (cos(−tX ) + i sin(−tX )) = ϕ(t)
where z denotes the complex conjugate of z , a + bi = a − bi.
¯
(3.1c) ϕ(t) = EeitX  ≤ E eitX  = 1
Here z  denotes the modulus of the complex number z , a + bi = (a2 + b2 )1/2 .
The inequality follows from Exercise 3.4 in Chapter 1 since ϕ(x, y ) = (x2 +y 2 )1/2
is convex.
(3.1d) ϕ(t + h) − ϕ(t) = E (ei(t+h)X − eitX )
≤ E ei(t+h)X − eitX  = E eihX − 1 since zw = z  · w. The last quantity → 0 as h → 0 by the bounded convergence theorem, so ϕ(t) is uniformly continuous on (−∞, ∞).
(3.1e) Eeit(aX +b) = eitb Eei(ta)X = eitb ϕ(at)
(3.1e) and (3.1b) imply that if X has ch.f. ϕ(t) then −X has ch.f. ϕ(−t) = ϕ(t). Section 2.3 Characteristic Functions
(3.1f) If X1 and X2 are independent and have ch.f.’s ϕ1 and ϕ2 then X1 + X2
has ch.f. ϕ1 (t)ϕ2 (t).
Proof Eeit(X1 +X2 ) = E (eitX1 eitX2 ) = EeitX1 EeitX2 . The next order of business is to give some examples.
Example 3.1. Coin ﬂips. If P (X = 1) = P (X = −1) = 1/2 then
EeitX = (eit + e−it )/2 = cos t
Example 3.2. Poisson distribution. If P (X = k ) = e−λ λk /k ! for k =
0, 1, 2, . . . then
∞ EeitX = e−λ
k=0 λk eitk
= exp(λ(eit − 1))
k! Example 3.3. Normal distribution
Density
Ch.f. (2π )−1/2 exp(−x2 /2)
exp(−t2 /2) Combining this result with (3.1e), we see that a normal distribution with mean
µ and variance σ 2 has ch.f. exp(iµt − σ 2 t2 /2). Similar scalings can be applied
to other examples so we will often just give the ch.f. for one member of the
family.
Physics Proof
2 eitx (2π )−1/2 e−x /2 dx = e−t 2 /2 (2π )−1/2 e−(x−it) 2 /2 dx The integral is 1 since the integrand is the normal density with mean it and
variance 1.
Math Proof Now that we have cheated and ﬁgured out the answer we can
verify it by a formal calculation that gives very little insight into why it is true.
Let
ϕ(t) = 2 eitx (2π )−1/2 e−x /2 dx = 2 cos tx (2π )−1/2 e−x /2 dx since i sin tx is an odd function. Diﬀerentiating with respect to t (referring to
Example 9.1 in the Appendix for the justiﬁcation) and then integrating by parts 91 92 Chapter 2 Central Limit Theorems
gives
2 −x sin tx (2π )−1/2 e−x ϕ (t) =
=−
This implies 2 t cos tx (2π )−1/2 e−x d
2
dt {ϕ(t) exp(t /2)} /2 dx /2 dx = −tϕ(t) = 0 so ϕ(t) exp(t2 /2) = ϕ(0) = 1. In the next three examples, the density is 0 outside the indicated range.
Example 3.4. Uniform distribution on (a, b)
Density
Ch.f. 1/(b − a)
x ∈ (a, b)
(eitb − eita )/ it(b − a) In the special case a = −c, b = c the ch.f. is (eitc − e−itc )/2cit = (sin ct)/ct.
Proof Once you recall that
this is immediate. b λx
ae dx = (eλb − eλa )/λ holds for complex λ, Example 3.5. Triangular distribution
Density
Ch.f. 1 − x
x ∈ (−1, 1)
2(1 − cos t)/t2 Proof To see this, notice that if X and Y are independent and uniform on
(−1/2, 1/2) then X + Y has a triangular distribution. Using Example 3.4 now
and (3.1f) it follows that the desired ch.f. is
{(eit/2 − e−it/2 )/it}2 = {2 sin(t/2)/t}2
Using the trig identity cos 2θ = 1 − 2 sin2 θ with θ = t/2 converts the answer
into the form given above.
Example 3.6. Exponential distribution
Density
Ch.f.
Proof x ∈ (0, ∞)
e−x
1/(1 − it) Integrating gives
∞ eitx e−x dx =
0 since exp((it − 1)x) → 0 as x → ∞. e(it−1)x
it − 1 ∞ =
0 1
1 − it Section 2.3 Characteristic Functions
Example 3.7. Bilateral exponential
Density
Ch.f.
Proof 1 −x
2e x ∈ (−∞, ∞) 1/(1 + t2 ) This follows from a more general fact: (3.1g) If F1 , . . . , Fn have ch.f. ϕ1 , . . . , ϕn and λi ≥ 0 have λ1 + . . . + λn = 1
n
n
then i=1 λi Fi has ch.f.
i=1 λi ϕi .
Applying (3.1g) with F1 the distribution of an exponential random variable X ,
F2 the distribution of −X , and λ1 = λ2 = 1/2 then using (3.1b) we see the
desired ch.f. is
1
1
(1 + it) + (1 − it)
1
+
=
=
2)
2(1 − it) 2(1 + it)
2(1 + t
(1 + t2 ) Exercise 3.1. Show that if ϕ is a ch.f. then Re ϕ and ϕ2 are also.
The ﬁrst issue to be settled is that the characteristic function uniquely
determines the distribution. This and more is provided by
eitx µ(dx) where µ is a probability (3.2) The inversion formula. Let ϕ(t) =
measure. If a < b then
T lim (2π )−1 T →∞ −T 1
e−ita − e−itb
ϕ(t) dt = µ(a, b) + µ({a, b})
it
2 Remark. The existence of the limit is part of the conclusion. If µ = δ0 , a
point mass at 0, ϕ(t) ≡ 1. In this case, if a = −1 and b = 1, the integrand is
(2 sin t)/t and the integral does not converge absolutely.
Proof Let
T IT =
−T e−ita − e−itb
ϕ(t) dt =
it T e−ita − e−itb itx
e µ(dx) dt
it −T The integrand may look bad near t = 0 but if we observe that
e−ita − e−itb
=
it b e−ity dy
a 93 94 Chapter 2 Central Limit Theorems
we see that the modulus of the integrand is bounded by b − a. Since µ is
a probability measure and [−T, T ] is a ﬁnite interval it follows from Fubini’s
theorem, cos(−x) = cos x, and sin(−x) = − sin x that
T IT =
−T e−ita − e−itb itx
e dt µ(dx)
it
T =
−T Introducing R(θ, T ) =
(∗) IT = If we let S (T ) =
that T sin(t(x − a))
dt −
t
T
(sin θt)/t dt,
−T −T sin(t(x − b))
dt µ(dx)
t we can write the last result as {R(x − a, T ) − R(x − b, T )}µ(dx) T
(sin x)/x dx
0 then for θ > 0 changing variables t = x/θ shows
Tθ R(θ, T ) = 2
0 sin x
dx = 2S (T θ)
x while for θ < 0, R(θ, T ) = −R(θ, T ). Introducing the function sgn x, which
is 1 if x > 0, −1 if x < 0, and 0 if x = 0, we can write the last two formulas
together as
R(θ, T ) = 2(sgn θ)S (T θ)
As T → ∞, S (T ) → π/2 (see Exercise 6.6 in the Appendix), so we have
R(θ, T ) → π sgn θ and
2π
π
0 R(x − a, T ) − R(x − b, T ) → a<x<b
x = a or x = b
x < a or x > b R(θ, T ) ≤ 2 supy S (y ) < ∞, so using the bounded convergence theorem with
(∗) implies
1
(2π )−1 IT → µ(a, b) + µ({a, b})
2
proving (3.2).
Exercise 3.2. (i) Imitate the proof of (3.2) to show that
1
T →∞ 2T T e−ita ϕ(t) dt µ({a}) = lim −T Section 2.3 Characteristic Functions
(ii) If P (X ∈ hZ) = 1 where h > 0 then its ch.f. has ϕ(2π/h + t) = ϕ(t) so
P ( X = x) = π /h h
2π e−itx ϕ(t) dt for x ∈ hZ −π/h (iii) If X = Y + b then E exp(itX ) = eitb E exp(itY ). So if P (X ∈ b + hZ) = 1,
the inversion formula in (ii) is valid for x ∈ b + hZ.
Two trivial consequences of the inversion formula are:
Exercise 3.3. If ϕ is real then X and −X have the same distribution.
Exercise 3.4. If Xi , i = 1, 2 are independent and have normal distributions
2
with mean 0 and variance σi , then X1 + X2 has a normal distribution with
2
2
mean 0 and variance σ1 + σ2 .
The inversion formula is simpler when ϕ is integrable, but as the next result
shows this only happens when the underlying measure is nice.
(3.3) Theorem. If ϕ(t) dt < ∞ then µ has bounded continuous density
f (y ) = Proof 1
2π e−ity ϕ(t) dt As we observed in the proof of (3.2)
e−ita − e−itb
=
it b e−ity dy ≤ b − a
a so the integral in (3.2) converges absolutely in this case and
1
1
µ(a, b) + µ({a, b}) =
2
2π ∞ (b − a)
e−ita − e−itb
ϕ(t) dt ≤
it
2π −∞ The last result implies µ has no point masses and
µ(x, x + h) = 1
2π = 1
2π
x+h =
x e−itx − e−it(x+h)
ϕ(t) dt
it
x+h e−ity dy ϕ(t) dt
x 1
2π e−ity ϕ(t) dt dy ∞ ϕ(t)dt
−∞ 95 96 Chapter 2 Central Limit Theorems
by Fubini’s theorem, so the distribution µ has density function
f (y ) = 1
2π e−ity ϕ(t) dt The dominated convergence theorem implies f is continuous and the proof is
complete.
Exercise 3.5. Give an example of a measure µ with a density but for which
ϕ(t)dt = ∞. Hint: Two of the examples above have this property.
Exercise 3.6. Show that if X1 , . . . , Xn are independent and uniformly distributed on (−1, 1), then for n ≥ 2, X1 + · · · + Xn has density
f ( x) = 1
π ∞ (sin t/t)n cos tx dt
0 Although it is not obvious from the formula, f is a polynomial in each interval
(k, k + 1), k ∈ Z and vanishes on [−n, n]c .
(3.3) and the next result show that the behavior of ϕ at inﬁnity is related
to the smoothness of the underlying measure.
Exercise 3.7. Suppose X and Y are independent and have ch.f. ϕ and distribution µ. Apply Exercise 3.2 to X − Y and use Exercise 4.7 in Chapter 1 to
get
T
1
ϕ(t)2 dt = P (X − Y = 0) =
µ({x})2
lim
T →∞ 2T −T
x
Remark. The last result implies that if ϕ(t) → 0 as t → ∞, µ has no point
masses. Exercise 3.13 gives an example to show that the converse is false. The
RiemannLebesgue Lemma (Exercise 4.4 in the Appendix) shows that if µ has
a density, ϕ(t) → 0 as t → ∞.
Applying the inversion formula (3.3) to the ch.f. in Examples 3.5 and 3.7
gives us two more examples of ch.f. The ﬁrst one does not have an oﬃcial name
so we gave it one to honor its role in the proof of Polya’s criterion (see (3.10)).
Example 3.8. Polya’s distribution
Density
Ch.f. (1 − cos x)/πx2
(1 − t)+ Section 2.3 Characteristic Functions
Proof (3.3) implies 1
2π
Now let s = x, y = −t. 2(1 − cos s) −isy
e
ds = (1 − y )+
s2 Example 3.9. The Cauchy distribution
Density
Ch.f.
Proof 1/π (1 + x2 )
exp(−t) (3.3) implies 1
1
1
e−isy ds = e−y
2π
1 + s2
2
Now let s = x, y = −t and multiply each side by 2.
Exercise 3.8. Use the last result to conclude that if X1 , X2 , . . . are independent and have the Cauchy distribution, then (X1 + · · · + Xn )/n has the same
distribution as X1 . b. Weak Convergence
Our next step toward the central limit theorem is to relate convergence of
characteristic functions to weak convergence.
(3.4) Continuity theorem. Let µn , 1 ≤ n ≤ ∞ be probability measures
with ch.f. ϕn . (i) If µn ⇒ µ∞ then ϕn (t) → ϕ∞ (t) for all t. (ii) If ϕn (t)
converges pointwise to a limit ϕ(t) that is continuous at 0, then the associated
sequence of distributions µn is tight and converges weakly to the measure µ
with characteristic function ϕ.
Remark. To see why continuity of the limit at 0 is needed in (ii), let µn
have a normal distribution with mean 0 and variance n. In this case ϕn (t) =
exp(−nt2 /2) → 0 for t = 0, and ϕn (0) = 1 for all n, but the measures do not
converge weakly since µn ((−∞, x]) → 1/2 for all x.
Proof (i) is easy. eitx is bounded and continuous so if µn ⇒ µ∞ then (2.2)
implies ϕn (t) → ϕ∞ (t). To prove (ii), our ﬁrst goal is to prove tightness. We
begin with some calculations that may look mysterious but will prove to be
very useful.
u
u
2 sin ux
1 − eitx dt = 2u −
(cos tx + i sin tx) dt = 2u −
x
−u
−u 97 98 Chapter 2 Central Limit Theorems
Dividing both sides by u, integrating µn (dx), and using Fubini’s theorem on
the lefthand side gives
u u−1 (1 − ϕn (t)) dt = 2
−u 1− sin ux
ux µn (dx) To bound the righthand side, we note that  sin x ≤ x for all x so we have 1 −
(sin ux/ux) ≥ 0. Discarding the integral over (−2/u, 2/u) and using  sin ux ≤
1 on the rest, the righthand side is
≥2 1−
x≥2/u 1
ux µn (dx) ≥ µn ({x : x > 2/u}) Since ϕ(t) → 1 as t → 0,
u u−1 (1 − ϕ(t)) dt → 0 as u → 0
−u Pick u so that the integral is < . Since ϕn (t) → ϕ(t) for each t, it follows from
the dominated convergence theorem that for n ≥ N
u 2 ≥ u−1 (1 − ϕn (t)) dt ≥ µn {x : x > 2/u}
−u Since is arbitrary, the sequence µn is tight.
To complete the proof now we observe that if µn(k) ⇒ µ, then it follows
from the ﬁrst sentence of the proof that µ has ch.f. ϕ. The last observation
and tightness imply that every subsequence has a further subsequence that
converges to µ. I claim that this implies the whole sequence converges to µ.
To see this, observe that we have shown that if f is bounded and continuous
then every subsequence of f dµn has a further subsequence that converges to
f dµ, so (6.3) in Chapter 1 implies that the whole sequence converges to that
limit. This shows f dµn → f dµ for all bounded continuous functions f so
the desired result follows from (2.2).
Exercise 3.9. Suppose that Xn ⇒ X and Xn has a normal distribution with
2
2
mean 0 and variance σn . Prove that σn → σ 2 ∈ [0, ∞).
Exercise 3.10. Show that if Xn and Yn are independent for 1 ≤ n ≤ ∞,
Xn ⇒ X∞ , and Yn ⇒ Y∞ , then Xn + Yn ⇒ X∞ + Y∞ .
Exercise 3.11. Let X1 , X2 , . . . be independent and let Sn = X1 + · · · + Xn .
Let ϕj be the ch.f. of Xj and suppose that Sn → S∞ a.s. Then S∞ has
∞
ch.f. j =1 ϕj (t). Section 2.3 Characteristic Functions
Exercise 3.12. Using the identity sin t = 2 sin(t/2) cos(t/2) repeatedly leads
∞
to (sin t)/t = m=1 cos(t/2m ). Prove the last identity by interpreting each side
as a characteristic function.
Exercise 3.13. Let X1 , X2 , . . . be independent taking values 0 and 1 with
probability 1/2 each. X = 2 j ≥1 Xj /3j has the Cantor distribution. Compute
the ch.f. ϕ of X and notice that ϕ has the same value at t = 3k π for k =
0, 1, 2, . . . c. Moments and Derivatives
In the proof of (3.4), we derived the inequality
u µ{x : x > 2/u} ≤ u−1 (3.5) (1 − ϕ(t)) dt
−u which shows that the smoothness of the characteristic function at 0 is related
to the decay of the measure at ∞. The next result continues this theme. We
leave the proof to the reader. (Use (9.1) in the Appendix.)
Exercise 3.14. If xn µ(dx) < ∞ then its characteristic function ϕ has a
continuous derivative of order n given by ϕ(n) (t) = (ix)n eitx µ(dx).
Exercise 3.15. Use the last exercise and the series expansion for e−t
show that the standard normal distribution has 2 /2 to EX 2n = (2n)!/2n n! = (2n − 1)(2n − 3) · · · 3 · 1 ≡ (2n − 1)!!
The result in Exercise 3.14 shows that if E X n < ∞, then its characteristic
function is n times diﬀerentiable at 0, and ϕn (0) = E (iX )n . Expanding ϕ in a
Taylor series about 0 leads to
n ϕ(t) = E (itX )m
+ o(tn )
m!
m=0 where o(tn ) indicates a quantity g (t) that has g (t)/tn → 0 as t → 0. For our
purposes below, it will be important to have a good estimate on the error term,
so we will now derive the last result. The starting point is a little calculus.
n (3.6) Lemma. eix − (ix)m
≤ min
m!
m=0 xn+1 2xn
,
(n + 1)! n! 99 100 Chapter 2 Central Limit Theorems
The ﬁrst term on the right is the usual order of magnitude we expect in the
correction term. The second is better for large x and will help us prove the
central limit theorem without assuming ﬁnite third moments.
Proof Integrating by parts gives
x (x − s)n eis ds =
0 x xn+1
i
+
n+1 n+1 (x − s)n+1 eis ds
0 When n = 0, this says
x x eis ds = x + i
0 (x − s)eis ds
0 The lefthand side is (eix − 1)/i, so rearranging gives
x eix = 1 + ix + i2 (x − s)eis ds
0 Using the result for n = 1 now gives
eix = 1 + ix + i 2 x2
i3
+
2
2 x (x − s)2 eis ds
0 and iterating we arrive at
n eix − (a) in+1
(ix)m
=
m!
n!
m=0 x (x − s)n eis ds
0 To prove (3.6) now it only remains to estimate the “error term” on the righthand side. Since eis  ≤ 1 for all s,
in+1
n! (b) x (x − s)n eis ds ≤ xn+1 /(n + 1)!
0 The last estimate is good when x is small. The next is designed for large x.
Integrating by parts
x i
n (x − s)n eis ds = −
0 Noticing xn /n =
in+1
n! x
(x
0 x (x − s)n−1 eis ds
0 − s)n−1 ds now gives x (x − s)n eis ds =
0 xn
+
n in
(n − 1)! x (x − s)n−1 (eis − 1)ds
0 Section 2.3 Characteristic Functions
and since eix − 1 ≤ 2, it follows that
in+1
n! (c) x (x − s)n eis ds ≤
0 2
(n − 1)! x (x − s)n−1 ds ≤ 2xn /n!
0 Combining (a), (b), and (c) we have (3.6).
Taking expected values, using Jensen’s inequality, applying (3.6) to x = tX ,
gives
n E eitX −
(3.7) n E
m=0 (itX )m
(itX )m
≤ E eitX −
m!
m!
m=0
≤ E min tX n+1 2tX n
,
(n + 1)!
n! In the next section, the following special case will be useful.
(3.8) Theorem. If E X 2 < ∞ then
ϕ(t) = 1 + itEX − t2 E (X 2 )/2 + o(t2 )
Proof The error term is ≤ t2 E (t·X 3 ∧ 6X 2 )/3!. The variable in parentheses is smaller than 6X 2 and converges to 0 as t → 0, so the desired conclusion
follows from the dominated convergence theorem.
Remark. The point of the estimate in (3.7) which involves the minimum of two
terms rather than just the ﬁrst one which would result from a naive application
of Taylor series, is that we get the conclusion in (3.8) under the assumption
E X 2 < ∞, i.e., we do not have to assume E X 3 < ∞.
Exercise 3.16. (i) Suppose that the family of measures {µi , i ∈ I } is tight,
i.e., supi µi ([−M, M ]c ) → 0 as M → ∞. Use (3.1d) and (3.7) with n = 0 to
show that their ch.f.’s ϕi are equicontinuous, i.e., if > 0 we can pick δ > 0 so
that if h < δ then ϕi (t + h) − ϕi (t) < . (ii) Suppose µn ⇒ µ∞ . Use (3.4)
and equicontinuity to conclude that the ch.f.’s ϕn → ϕ∞ uniformly on compact
sets. [Argue directly. You don’t need to go to AA.] (iii) Give an example to
show that the convergence need not be uniform on the whole real line.
Exercise 3.17. Let X1 , X2 , . . . be i.i.d. with characteristic function ϕ. (i)
If ϕ (0) = ia and Sn = X1 + · · · + Xn then Sn /n → a in probability. (ii)
If Sn /n → a in probability then ϕ(t/n)n → eiat . Use this to conclude that 101 102 Chapter 2 Central Limit Theorems
ϕ (0) = ia, so the weak law holds if and only if ϕ (0) exists. This is due to
E.J.G. Pitman (1956).
The last exercise in combination with Exercise 5.4 from Chapter 1 shows
that ϕ (0) may exist when E X  = ∞.
∞ Exercise 3.18. 2 0 (1 − Re ϕ(t))/(πt2 ) dt = y dF (y ). Hint: Change variables x = y t in the density function of Example 3.8, which integrates to 1.
The next result shows that the existence of second derivatives implies the existence of second moments.
(3.9) Theorem. If lim suph↓0 {ϕ(h) − 2ϕ(0) + ϕ(−h)}/h2 > −∞, then E X 2 <
∞.
Proof (eihx −2+e−ihx)/h2 = −2(1−cos hx)/h2 ≤ 0 and 2(1−cos hx)/h2 → x2
as h → 0 so Fatou’s lemma and Fubini’s theorem imply
1 − cos hx
dF (x)
h2
ϕ(h) − 2ϕ(0) + ϕ(−h)
= − lim sup
<∞
h2
h→0 x2 dF (x) ≤ 2 lim inf
h→0 Exercise 3.19. Show that if limt↓0 (ϕ(t) − 1)/t2 = c > −∞ then EX = 0 and
E X 2 = −2c < ∞. In particular, if ϕ(t) = 1 + o(t2 ) then ϕ(t) ≡ 1.
Exercise 3.20. If Yn are r.v.’s with ch.f.’s ϕn then Yn ⇒ 0 if and only if there
is a δ > 0 so that ϕn (t) → 1 for t ≤ δ .
Exercise 3.21. Let X1 , X2 , . . . be independent. If Sn = m≤n Xm converges
in distribution then it converges in probability (and hence a.s. by Exercise
8.11 in Chapter 1). Hint: The last exercise implies that if m, n → ∞ then
Sm − Sn → 0 in probability. Now use Exercise 8.10 in Chapter 1. *d. Polya’s Criterion
The next result is useful for constructing examples of ch.f.’s.
(3.10) Polya’s criterion. Let ϕ(t) be real nonnegative and have ϕ(0) = 1,
ϕ(t) = ϕ(−t), and ϕ is decreasing and convex on (0, ∞) with
lim ϕ(t) = 1,
t ↓0 lim ϕ(t) = 0 t↑∞ Section 2.3 Characteristic Functions
Then there is a probability measure ν on (0, ∞), so that
∞ (∗) t
s 1− ϕ(t) =
0 + ν (ds) and hence ϕ is a characteristic function.
Remark. Before we get lost in the details of the proof, the reader should
note that (∗) displays ϕ as a convex combination of ch.f.’s of the form given in
Example 3.8, so an extension of (3.1g) (to be proved below) implies that this is
a ch.f.
The assumption that limt→0 ϕ(t) = 1 is necessary because the function
ϕ(t) = 1{0} (t) which is 1 at 0 and 0 otherwise satisﬁes all the other hypotheses.
We could allow limt→∞ ϕ(t) = c > 0 by having a point mass of size c at 0, but
we leave this extension to the reader.
Proof Let ϕ be the right derivative of ϕ, i.e.,
ϕ (t) = lim
h ↓0 ϕ ( t + h) − ϕ ( t )
h Since ϕ is convex this exists and is right continuous and increasing. So we can
let µ be the measure on (0, ∞) with µ(a, b] = ϕ (b) − ϕ (a) for all 0 ≤ a < b < ∞,
and let ν be the measure on (0, ∞) with dν/dµ = s.
Now ϕ (t) → 0 as t → ∞ (for if ϕ (t) ↓ − we would have ϕ(t) ≤ 1 − t for
all t), so Exercise 8.7 in the Appendix implies
∞ r−1 ν (dr) −ϕ (s) =
s Integrating again and using Fubini’s theorem we have for t ≥ 0
∞ ∞ ∞ r−1 ν (dr) ds = ϕ(t) =
t s
∞ 1− =
t t
r r r−1
t
∞ 1− ν (dr) =
0 ds ν (dr)
t t
r + ν (dr) Using ϕ(−t) = ϕ(t) to extend the formula to t ≤ 0 we have (∗). Setting t = 0
in (∗) shows ν has total mass 1.
If ϕ is piecewise linear, ν has a ﬁnite number of atoms and the result follows
from Example 3.8 and (3.1g). To prove the general result, let νn be a sequence 103 104 Chapter 2 Central Limit Theorems
of measures on (0, ∞) with a ﬁnite number of atoms that converges weakly to
ν (see Exercise 2.7) and let
∞ ϕn (t) = 1−
0 t
s + νn (ds) Since s → (1 − t/s)+ is bounded and continuous, ϕn (t) → ϕ(t) and the desired
result follows from part (ii) of (3.4).
A classic application of Polya’s criterion is:
Exercise 3.22. Show that exp(−tα ) is a characteristic function for 0 < α ≤ 1.
(The case α = 1 corresponds to the Cauchy distribution.) The next argument,
which we learned from Frank Spitzer, proves that this is true for 0 < α ≤ 2.
The case α = 2 corresponds to a normal distribution, so that case can be safely
ignored in the proof.
Example 3.10. exp(−tα ) is a characteristic function for 0 < α < 2.
Proof A little calculus shows that for any β and x < 1
∞ (1 − x)β =
n=0 where β
n = β
(−x)n
n β (β − 1) · · · (β − n + 1)
1 · 2···n
∞
n
n=1 cn (cos t) Let ψ (t) = 1 − (1 − cos t)α/2 =
cn = where α/2
(−1)n+1
n cn ≥ 0 (here we use α < 2), and ∞ cn = 1 (take t = 0 in the deﬁnition of
n=1
ψ ). cos t is a characteristic function (see Example 3.1) so an easy extension of
(3.1g) shows that ψ is a ch.f. We have 1 − cos t ∼ t2 /2 as t → 0, so
1 − cos(t · 21/2 · n−1/α ) ∼ n−2/α t2
Using (1.2) and (ii) of (3.4) now, it follows that
exp(−tα ) = lim {ψ (t · 21/2 · n−1/α )}n
n→∞ Section 2.3 Characteristic Functions
is a ch.f.
Exercise 3.19 shows that exp(−tα ) is not a ch.f. when α > 2. A reason for
interest in these characteristic functions is explained by the following generalization of Exercise 3.8.
Exercise 3.23. If X1 , X2 , . . . are independent and have characteristic function
exp(−tα ) then (X1 + · · · + Xn )/n1/α has the same distribution as X1 .
We will return to this topic in Section 2.7. Polya’s criterion can also be used
to construct some “pathological examples.”
Exercise 3.24. Let ϕ1 and ϕ2 be ch.f’s. Show that A = {t : ϕ1 (t) = ϕ2 (t)} is
closed, contains 0, and is symmetric about 0. Show that if A is a set with these
properties and ϕ1 (t) = e−t there is a ϕ2 so that {t : ϕ1 (t) = ϕ2 (t)} = A.
Example 3.11. For some purposes, it is nice to have an explicit example of
two ch.f.’s that agree on [−1, 1]. From Example 3.8, we know that (1 − t)+ is
the ch.f. of the density (1 − cos x)/πx2 . Deﬁne ψ (t) to be equal to ϕ on [−1, 1]
and periodic with period 2, i.e., ψ (t) = ψ (t + 2). The Fourier series for ψ is
∞ ψ ( u) = 2
1
+
exp(i(2n − 1)πu)
2 n=−∞ π 2 (2n − 1)2 The righthand side is the ch.f. of a discrete distribution with
P (X = 0) = 1/2 and P (X = (2n − 1)π ) = 2π −2 (2n − 1)−2 n ∈ Z. Exercise 3.25. Find independent r.v.’s X , Y , and Z so that Y and Z do not
have the same distribution but X + Y and X + Z do.
Exercise 3.26. Show that if X and Y are independent and X + Y and X have
the same distribution then Y = 0 a.s.
For more curiosities, see Feller, Vol. II (1971), Section XV.2a. *e. The Moment Problem
Suppose xk dFn (x) has a limit µk for each k . Then the sequence of distributions is tight by (2.7) and every subsequential limit has the moments µk by
Exercise 2.5, so we can conclude the sequence converges weakly if there is only 105 106 Chapter 2 Central Limit Theorems
one distribution with these moments. It is easy to see that this is true if F is
concentrated on a ﬁnite interval [−M, M ] since every continuous function can
be approximated uniformly on [−M, M ] by polynomials. The result is false in
general.
Counterexample 1. Heyde (1963) Consider the lognormal density
f0 (x) = (2π )−1/2 x−1 exp(−(log x)2 /2) x≥0 and for −1 ≤ a ≤ 1 let
fa (x) = f0 (x){1 + a sin(2π log x)}
To see that fa is a density and has the same moments as f0 , it suﬃces to show
that
∞
xr f0 (x) sin(2π log x) dx = 0 for r = 0, 1, 2, . . .
0 Changing variables x = exp(s + r), s = log x − r, ds = dx/x the integral becomes
∞ (2π )−1/2 exp(rs + r2 ) exp(−(s + r)2 /2) sin(2π (s + r)) ds
−∞
∞ = (2π )−1/2 exp(r2 /2) exp(−s2 /2) sin(2πs) ds = 0
−∞ The two equalities holding because r is an integer and the integrand is odd.
From the proof, it should be clear that we could let
∞ g (x) = f0 (x) 1 + ∞ ak sin(kπ log x) if k=1 ak  ≤ 1
k=1 to get a large family of densities having the same moments as the lognormal.
The moments of the lognormal are easy to compute. Recall that if χ has the
standard normal distribution, then Exercise 1.11 in Chapter 1 implies exp(χ)
has the lognormal distribution.
EX n = E exp(nχ) =
= en 2 /2 2 enx (2π )−1/2 e−x (2π )−1/2 e−(x−n) 2 /2 /2 dx dx = exp(n2 /2) since the last integrand is the density of the normal with mean n and variance
1. Somewhat remarkably there is a family of discrete random variables with
these moments. Let a > 0 and
P (Ya = aek ) = a−k exp(−k 2 /2)/ca for k ∈ Z Section 2.3 Characteristic Functions
where ca is chosen to make the total mass 1.
n
exp(−n2 /2)EYa = exp(−n2 /2) (aek )n a−k exp(−k 2 /2)/ca
k −(k−n) = a exp(−(k − n)2 /2)/ca = 1 k by the deﬁnition of ca .
The lognormal density decays like exp(−(log x)2 /2) as x → ∞. The next
counterexample has more rapid decay. Since the exponential distribution, e−x
for x ≥ 0, is determined by its moments (see Exercise 3.28 below) we cannot
hope to do much better than this.
Counterexample 2. Let λ ∈ (0, 1) and for −1 ≤ a ≤ 1 let
fa,λ (x) = cλ exp(−xλ ){1 + a sin(β xλ sgn (x))}
where β = tan(λπ/2) and 1/cλ = exp(−xλ ) dx. To prove that these are
density functions and that for a ﬁxed value of λ they have the same moments,
it suﬃces to show
xn exp(−xλ ) sin(β xλ sgn (x)) dx = 0 for n = 0, 1, 2, . . . This is clear for even n since the integrand is odd. To prove the result for odd
n, it suﬃces to integrate over [0, ∞). Using the identity
∞ tp−1 e−qt dt = Γ(p)/q p when Re q > 0 0 with p = (n + 1)/λ, q = 1 + βi, and changing variables t = xλ , we get
Γ((n + 1)/λ)/(1 + β i)(n+1)/λ
∞ xλ{(n+1)/λ−1} exp(−(1 + βi)xλ )λ xλ−1 dx =
0 ∞ ∞ xn exp(−xλ ) cos(βxλ )dx − iλ =λ
0 xn exp(−xλ ) sin(βxλ ) dx
0 Since β = tan(λπ/2)
(1 + βi)(n+1)/λ = (cos λπ/2)−(n+1)/λ (exp(iλπ/2))(n+1)/λ 107 108 Chapter 2 Central Limit Theorems
The righthand side is real since λ < 1 and (n + 1) is even, so
∞ xn exp(−xλ ) sin(βxλ ) dx = 0
0 A useful suﬃcient condition for a distribution to be determined by its
moments is
1/2k (3.11) Theorem. If lim supk→∞ µ2k /2k = r < ∞ then there is at most one
d.f. F with µk = xk dF (x) for all positive integers k .
Remark. This is slightly stronger than Carleman’s condition
∞
1/2k 1/µ2k =∞ k=1 which is also suﬃcient for the conclusion of (3.11).
Proof Let F be any d.f. with the moments µk and let νk =
2
CauchySchwarz inequality implies ν2k+1 ≤ µ2k µ2k+2 so xk dF (x). The 1/k lim sup(νk )/k = r < ∞
k→∞ Taking x = tX in (3.6) and multiplying by eiθX , we have
n−1 eiθX eitX − (itX )m
m!
m=0 ≤ tX n
n! Taking expected values and using Exercise 3.14 gives
ϕ(θ + t) − ϕ(θ) − tϕ (θ) . . . − t n
tn−1
ϕ(n−1) (θ) ≤
νn
(n − 1)!
n! Using the last result, the fact that νk ≤ (r + )k k k for large k , and the trivial
bound ek ≥ k k /k ! (expand the lefthand side in its power series), we see that
for any θ
∞ (∗) ϕ(θ + t) = ϕ(θ ) + t m ( m)
ϕ (θ )
m!
m=1 for t < 1/er Let G be another distribution with the given moments and ψ its ch.f. Since
ϕ(0) = ψ (0) = 1, it follows from (∗) and induction that ϕ(t) = ψ (t) for t ≤
k/3r for all k , so the two ch.f.’s coincide and the distributions are equal. Section 2.3 Characteristic Functions
Combining (3.11) with the discussion that began our consideration of the
moment problem.
(3.12) Theorem. Suppose xk dFn (x) has a limit µk for each k and
1/2k lim sup µ2k /2k < ∞
k→∞ then Fn converges weakly to the unique distribution with these moments.
Exercise 3.27. Let G(x) = P (X  < x), λ = sup{x : G(x) < 1}, and
1/k
νk = E X k . Show that νk → λ, so (3.12) holds if λ < ∞.
Exercise 3.28. Suppose X  has density Cxα exp(−xλ ) on (0, ∞). Changing
variables y = xλ , dy = λxλ−1 dx
∞ E  X n = Cλy (n+α)/λ exp(−y )y 1/λ−1 dy = CλΓ((n + α + 1)/λ)
0 Use the identity Γ(x + 1) = xΓ(x) for x ≥ 0 to conclude that (3.12) is satisﬁed
for λ ≥ 1 but not for λ < 1. This shows the normal (λ = 2) and gamma (λ = 1)
distributions are determined by their moments.
Our results so far have been for the socalled Hamburger moment problem. If we assume a priori that the distribution is concentrated on [0, ∞), we
have the Stieltjes moment problem. There is a 11 correspondence between
X ≥ 0 and symmetric distributions on R given by X → ξX 2 where ξ ∈ {−1, 1}
is independent of X and takes its two values with equal probability. From this
we see that
1/2k
lim sup νk /2k < ∞
k→∞ is suﬃcient for there to be a unique distribution on [0, ∞) with the given moments. The next example shows that for nonnegative random variables, the
last result is close to the best possible.
Counterexample 3. Let λ ∈ (0, 1/2), β = tan(λπ ), −1 ≤ a ≤ 1 and
fa (x) = cλ exp(−xλ )(1 + a sin(βxλ ))
where 1/cλ = ∞
0 for x ≥ 0 exp(−xλ ) dx. By imitating the calculations in Counterexample 2, it is easy to see that the fa
are probability densities that have the same moments. This example seems to
be due to Stoyanov (1987) p. 92–93. The special case λ = 1/4 is widely known. 109 110 Chapter 2 Central Limit Theorems 2.4. Central Limit Theorems
We are now ready for the main business of the chapter. We will ﬁrst prove the
central limit theorem for a. i.i.d. Sequences
(4.1) Theorem. Let X1 , X2 , . . . be i.i.d. with EXi = µ, var(Xi ) = σ 2 ∈ (0, ∞).
If Sn = X1 + · · · + Xn then
(Sn − nµ)/σn1/2 ⇒ χ
where χ has the standard normal distribution.
Proof By considering Xi = Xi − µ, it suﬃces to prove the result when µ = 0.
From (3.8)
σ 2 t2
+ o(t2 )
ϕ(t) = E exp(itX1 ) = 1 −
2
so
n
t2
+ o(n−1 )
E exp(itSn /σn1/2 ) = 1 −
2n
From (1.3) it should be clear that the last quantity → exp(−t2 /2) as n →
∞, which with (3.4) completes the proof. However, (1.3) is a fact about real
numbers, so we need to extend it to the complex case to complete the proof.
(4.2) Theorem. If cn → c ∈ C then (1 + cn /n)n → ec .
The proof is based on two simple facts:
(4.3) Lemma. Let z1 , . . . , zn and w1 , . . . , wn be complex numbers of modulus
≤ θ. Then
n n n wm ≤ θn−1 zm −
m=1 Proof zm − wm 
m=1 The result is true for n = 1. To prove it for n > 1 observe that n n n zm −
m=1 m=1 n wm ≤ z1
m=1 zm − z1
m=2
n m=2 m=2 n wm − w1
m=2 n wm + θn−1 z1 − w1  zm − ≤θ n wm + z1 m=2 wm
m=2 Section 2.4 Central Limit Theorems
and use induction.
(4.4) Lemma. If b is a complex number with b ≤ 1 then eb − (1 + b) ≤ b2 .
Proof eb − (1 + b) = b2 /2! + b3 /3! + b4 /4! + . . . so if b ≤ 1 then
eb − (1 + b) ≤ b2
(1 + 1/2 + 1/22 + . . .) = b2
2 Proof of (4.2) Let zm = (1 + cn/n), wm = exp(cn /n), and γ > c. For large
n, cn  < γ and cn /n ≤ 1, so it follows from (4.3) and (4.4) that as n → ∞
γ (1 + cn /n)n − ecn  ≤ e n n−1 n cn
n 2 ≤ eγ γ2
→0
n To get a feel for what the central limit theorem says, we will look at some
concrete cases.
Example 4.1. Roulette. A roulette wheel has slots numbered 1–36 (18 red
and 18 black) and two slots numbered 0 and 00 that are painted green. Players
can bet $1 that the ball will land in a red (or black) slot and win $1 if it does.
If we let Xi be the winnings on the ith play then X1 , X2 , . . . are i.i.d. with
P (Xi = 1) = 18/38 and P (Xi = −1) = 20/38.
EXi = −1/19 and var(X ) = EX 2 − (EX )2 = 1 − (1/19)2 = .9972 We are interested in
P (Sn ≥ 0) = P −nµ
Sn − nµ
√
≥√
σn
σn Taking n = 361 = 192 and replacing σ by 1 to keep computations simple,
361 · (1/19)
−nµ
√=
√
=1
σn
361
So the central limit theorem and our table of the normal distribution in the
back of the book tells us that
P (Sn ≥ 0) ≈ P (χ ≥ 1) = 1 − .8413 = .1587 111 112 Chapter 2 Central Limit Theorems
In words, after 361 spins of the roulette wheel the casino will have won $19 of
your money on the average, but there is a probability of about 0.16 that you
will be ahead.
Example 4.2. Coin ﬂips. Let X1 , X2 , . . . be i.i.d. with P (Xi = 0) = P (Xi =
1) = 1/2. If Xi = 1 indicates that a heads occured on the ith toss then
Sn = X1 + · · · + Xn is the total number of heads at time n.
EXi = 1/2 and var(X ) = EX 2 − (EX )2 = 1/2 − 1/4 = 1/4
So the central limit theorem tells us (Sn − n/2)/ n/4 ⇒ χ. Our table of the
normal distribution tells us that
P (χ > 2) = 1 − .9773 = .0227
so P (χ ≤ 2) = 1 − 2(.0227) = .9546, or plugging into the central limit theorem
√√
.95 ≈ P ((Sn − n/2)/ n/4 ∈ [−2, 2]) = P (Sn − n/2 ∈ [− n, n])
Taking n = 10, 000 this says that 95% of the time the number of heads will be
between 4900 and 5100.
Example 4.3. Normal approximation to the binomial. Let X1 , X2 , . . .
and Sn be as in the previous example. To estimate P (S16 = 8) using the√
central
limit theorem, we regard 8 as the interval [7.5, 8.5]. Since µ = 1/2, and σ n = 2
for n = 16
P (S16 − 8 ≤ .5) = P Sn − nµ
√
≤ .25
σn ≈ P (χ ≤ .25) = 2(.5987 − .5) = .1974
Even though n is small, this agrees well with the exact probability
16 −16 13 · 11 · 10 · 9
2
= .1964.
=
8
65, 536
The computations above motivate the histogram correction, which is important in using the normal approximation for small n. For example, if we
are going to approximate P (S16 ≤ 11), then we regard this probability as
P (S16 ≤ 11.5). One obvious reason for doing this is to get the same answer if
we regard P (S16 ≤ 11) = 1 − P (S16 ≥ 12).
Exercise 4.1. Suppose you roll a die 180 times. Use the normal approximation
(with the histogram correction) to estimate the probability you will get less than
25 sixes. Section 2.4 Central Limit Theorems
Example 4.4. Normal approximation to the Poisson. Let Zλ have a
Poisson distribution with mean λ. If X1 , X2 , . . . are independent and have
Poisson distributions with mean 1, then Sn = X1 + · · · + Xn has a Poisson
distribution with mean n. Since var(Xi ) = 1, the central limit theorem implies:
(Sn − n)/n1/2 ⇒ χ as n → ∞ To deal with values of λ that are not integers, let N1 , N2 , N3 be independent
Poisson with means [λ], λ − [λ], and [λ]+1 − λ. If we let S[λ] = N1 , Zλ = N1 + N2
and S[λ]+1 = N1 + N2 + N3 then S[λ] ≤ Zλ ≤ S[λ]+1 and using the limit theorem
for the Sn it follows that
(Zλ − λ)/λ1/2 ⇒ χ as λ → ∞ Example 4.5. Pairwise independence is good enough for the strong law of
large numbers (see (7.1) in Chapter 1). It is not good enough for the central
limit theorem. Let ξ1 , ξ2 , . . . be i.i.d. with P (ξi = 1) = P (ξi = −1) = 1/2. We
will arrange things so that for n ≥ 1
S2n = ξ1 (1 + ξ2 ) · · · (1 + ξn+1 ) = ±2n
0 with prob 2−n−1
with prob 1 − 2−n To do this we let X1 = ξ1 , X2 = ξ1 ξ2 , and for m = 2n−1 + j , 0 < j ≤ 2n−1 ,
n ≥ 2 let Xm = Xj ξn+1 . Each Xm is a product of a diﬀerent set of ξj ’s so they
are pairwise independent.
Exercises
4.2. Let X1 , X2 , . . . be i.i.d. with EXi = 0, 0 < var(Xi ) < ∞, and let Sn =
X1 + · · · + Xn . (a) Use the central limit theorem and Kolmogorov’s zeroone law
√
to conclude that limsup Sn / n = ∞ a.s. (b) Use an argument by contradiction
√
to show that Sn / n does not converge in probability. Hint: Consider n = m!.
√
4.3. Let X1 , X2 , . . . be i.i.d. and let Sn = X1 + · · · + Xn . Assume that Sn / n ⇒
a limit and conclude that EXi2 < ∞. Sketch: Suppose EXi2 = ∞. Let
X1 , X2 , . . . be an independent copy of the original sequence. Let Yi = Xi − Xi ,
Ui = Yi 1(Yi ≤A) , Vi = Yi 1(Yi >A) , and observe that for any K
n P √
Ym ≥ K n n ≥P m=1 ≥ 1
P
2 n √
Um ≥ K n, m=1
n m=1 Vm ≥ 0
m=1 √
Um ≥ K n ≥ 1
5 113 114 Chapter 2 Central Limit Theorems
for large n if A is large enough. Since K is arbitrary, this is a contradiction.
4.4. Let X1 , X2 , . . . be i.i.d. with Xi ≥ 0, EXi = 1, and var(Xi ) = σ 2 ∈ (0, ∞).
√
√
Show that 2( Sn − n) ⇒ σχ.
4.5. Selfnormalized sums. Let X1 , X2 , . . . be i.i.d. with EXi = 0 and
EXi2 = σ 2 ∈ (0, ∞). Then
n 1/2 n
2
Xm Xm
m=1 ⇒χ m=1 4.6. Random index central limit theorem. Let X1 , X2 , . . . be i.i.d. with
EXi = 0 and EXi2 = σ 2 ∈ (0, ∞), and let Sn = X1 + · · · + Xn . Let Nn be a
sequence of nonnegative integervalued random variables and an a sequence of
integers with an → ∞ and Nn /an → 1 in probability. Show that
√
SNn /σ an ⇒ χ
Hint: Use Kolmogorov’s inequality ((8.2) in Chapter 1) to conclude that if
√
√
Yn = SNn /σ an and Zn = San /σ an , then Yn − Zn → 0 in probability.
4.7. A central limit theorem in renewal theory. Let Y1 , Y2 , . . . be i.i.d. positive random variables with EYi = µ and var(Yi ) = σ 2 ∈ (0, ∞). Let Sn =
Y1 + · · · + Yn and Nt = sup{m : Sm ≤ t}. Apply the previous exercise to
Xi = Yi − µ to prove that as t → ∞
(µNt − t)/(σ 2 t/µ)1/2 ⇒ χ
4.8. A second proof of the renewal CLT. Let Y1 , Y2 , . . ., Sn , and Nt be as
in the last exercise. Let u = [t/µ], Dt = Su − t. Use Kolmogorov’s inequality
to show
P (Su+m − (Su + mµ) > t2/5 for some m ∈ [−t3/5 , t3/5 ]) → 0 as t → ∞
Conclude Nt − (t − Dt )/µ/ t1/2 → 0 in probability and then obtain the result
in the previous exercise.
Our next step is to generalize the central limit theorem to: b. Triangular Arrays
(4.5) The LindebergFeller theorem. For each n, let Xn,m , 1 ≤ m ≤ n, be
independent random variables with EXn,m = 0. Suppose Section 2.4 Central Limit Theorems
(i) n
m=1 2
EXn,m → σ 2 > 0 (ii) For all n
m=1 > 0, limn→∞ E (Xn,m 2 ; Xn,m  > ) = 0. Then Sn = Xn,1 + · · · + Xn,n ⇒ σχ as n → ∞
Remarks. In words, the theorem says that a sum of a large number of small
independent eﬀects has approximately a normal distribution. To see that (4.5)
contains our ﬁrst central limit theorem, let Y1 , Y2 . . . be i.i.d. with EYi = 0 and
2
EYi2 = σ 2 ∈ (0, ∞), and let Xn,m = Ym /n1/2 . Then n =1 EXn,m = σ 2 and if
m
>0
n E (Xn,m 2 ; Xn,m  > ) = nE (Y1 /n1/2 2 ; Y1 /n1/2  > )
m=1 = E (Y1 2 ; Y1  > n1/2 ) → 0
by the dominated convergence theorem since EY12 < ∞.
2
2
Proof Let ϕn,m (t) = E exp(itXn,m ), σn,m = EXn,m . By (3.4), it suﬃces to
show that
n ϕn,m (t) → exp(−t2 σ 2 /2)
m=1
2
Let zn,m = ϕn,m (t) and wn,m = (1 − t2 σn,m /2). By (3.7) zn,m − wn,m  ≤ E (tXn,m 3 /3! ∧ 2tXn,m 2 /2!)
≤ E (tXn,m 3 /6; Xn,m ≤ ) + E (tXn,m 2 ; Xn,m  > )
≤ t3
E (Xn,m 2 ; Xn,m  ≤ ) + t2 E (Xn,m 2 ; Xn,m  > )
6 Summing m = 1 to n, letting n → ∞, and using (i) and (ii) gives
n lim sup
n→∞ zn,m − wn,m  ≤
m=1 t3 2
σ
6 Since > 0 is arbitrary, it follows that the sequence converges to 0. Our next
step is to use (4.3) with θ = 1 to get
n n ϕn,m (t) −
m=1 2
(1 − t2 σn,m /2) → 0
m=1 115 116 Chapter 2 Central Limit Theorems
To check the hypotheses of (4.3), note that since ϕn,m is a ch.f. ϕn,m (t) ≤ 1
for all n, m. For the terms in the second product we note that
2
σn,m ≤ 2 + E (Xn,m 2 ; Xn,m  > ) 2
and is arbitrary so (ii) implies supm σn,m → 0 and thus if n is large 1 ≥
22
1 − t σn,m /2 > −1 for all m.
2
To complete the proof now, we apply Exercise 1.1 with cm,n = −t2 σn,m /2.
2
We have just shown supm σn,m → 0. (i) implies
n cm,n → −σ 2 t2 /2
m=1 so n
m=1 (1 2
− t2 σn,m /2) → exp(−t2 σ 2 /2) and the proof is complete. Example 4.6. Cycles in a random permutation and record values.
Continuing the analysis of Examples 5.4 and 6.2 in Chapter 1, let Y1 , Y2 , . . . be
independent with P (Ym = 1) = 1/m, and P (Ym = 0) = 1 − 1/m. EYm = 1/m
and var(Ym ) = 1/m − 1/m2 . So if Sn = Y1 + · · · + Yn then ESn ∼ log n and
var(Sn ) ∼ log n. Let
Xn,m = (Ym − 1/m)/(log n)1/2
EXn,m = 0, n
m=1 2
EXn,m → 1, and for any >0 n E (Xn,m 2 ; Xn,m  > ) → 0
m=1 since the sum is 0 as soon as (log n)−1/2 < . Applying (4.5) now gives
n (log n)−1/2 Sn − 1
m
m=1 ⇒χ Observing that
n−1 1
≥
m
m=1
shows log n − n
m=1 1/m n n x−1 dx = log n ≥
1 1
m
m=2 ≤ 1 and the conclusion can be written as (Sn − log n)/(log n)1/2 ⇒ χ Section 2.4 Central Limit Theorems
Example 4.7. The converse of the three series theorem. Recall the set
up of (8.4) in Chapter 1. Let X1 , X2 , . . . be independent, let A > 0, and let
∞
N
Ym = Xm 1(Xm ≤A) . In order that n=1 Xn converges (i.e., limN →∞ n=1 Xn
exists) it is necessary that:
∞ ∞ P (Xn  > A) < ∞, (ii) (i)
n=1 ∞ EYn converges, and (iii)
n=1 var(Yn ) < ∞
n=1 Proof The necessity of the ﬁrst condition is clear. For if that sum is inﬁnite,
P (Xn  > A i.o.) > 0 and limn→∞ n =1 Xm cannot exist. Suppose next that
m
the sum in (i) is ﬁnite but the sum in (iii) is inﬁnite. Let
n cn = and Xn,m = (Ym − EYm )/c1/2
n var(Ym )
m=1 EXn,m = 0, n
m=1 2
EXn,m = 1, and for any >0 n E (Xn,m 2 ; Xn,m  > ) → 0
m=1
1/2 since the sum is 0 as soon as 2A/cn < . Applying (4.5) now gives that if
Sn = Xn,1 + · · · + Xn,n then Sn ⇒ χ. Now
(i) if limn→∞ n
m=1 (ii) if we let Tn = ( Xm exists, limn→∞
1/2
m≤n Ym )/cn n
m=1 Ym exists. then Tn ⇒ 0. The last two results and Exercise 2.10 imply (Sn − Tn ) ⇒ χ. Since S n − Tn = − m≤n EYm /c1/2
n is not random, this is absurd.
Finally, assume the series in (i) and (iii) are ﬁnite. (8.3) in Chapter 1
n
n
implies that limn→∞ m=1 (Ym − EYm ) exists, so if limn→∞ m=1 Xm and
n
hence limn→∞ m=1 Ym does, taking diﬀerences shows that (ii) holds.
Example 4.8. Inﬁnite variance. Suppose X1 , X2 , . . . are i.i.d. and have
P (X1 > x) = P (X1 < −x) and P (X1  > x) = x−2 for x ≥ 1.
∞ E X 1 2 = 2xP (X1  > x) dx = ∞
0 117 118 Chapter 2 Central Limit Theorems
but it turns out that when Sn = X1 +· · ·+Xn is suitably normalized it converges
to a normal distribution. Let
Yn,m = Xm 1(Xm ≤n1/2 log log n)
The truncation level cn = n1/2 log log n is chosen large enough to make
n P (Yn,m = Xm ) ≤ nP (X1  > cn ) → 0
m=1 However, we want the variance of Yn,m to be as small as possible, so we keep
the truncation close to the lowest possible level.
2
Our next step is to show EYn,m ∼ log n. For this we need upper and lower
bounds. Since P (Yn,m  > x) ≤ P (X1  > x) and is 0 for x > cn , we have
cn
2
EYn,m ≤ cn 2yP (X1  > y ) dy = 1 +
0 2/y dy
1 = 1 + 2 log cn = 1 + log n + 2 log log log n ∼ log n
In the other direction, we observe P (Yn,m  > x) = P (X1  > x) − P (X1  > cn )
√
and the righthand side is ≥ (1 − (log log n)−2 )P (X1  > x) when x ≤ n so
√
2
EYn,m ≥ (1 − (log log n) −2 ) n 2/y dy ∼ log n
1 If Sn = Yn,1 + · · · + Yn,n then var(Sn ) ∼ n log n, so we apply (4.5) to
Xn,m = Yn,m /(n log n)1/2 . Things have been arranged so that (i) is satisﬁed.
Since Yn,m  ≤ n1/2 log log n, the sum in (ii) is 0 for large n, and it follows that
Sn /(n log n)1/2 ⇒ χ. Since the choice of cn guarantees P (Sn = Sn ) → 0, the
same result holds for Sn .
Remark. In Section 2.6, we will see that if we replace P (X1  > x) = x−2 in
Example 4.9 by P (X1  > x) = x−α where 0 < α < 2, then Sn /n1/α ⇒ to a
limit which is not χ. The last word on convergence to the normal distribution
is the next result due to L´vy.
e
(4.6) Theorem. Let X1 , X2 , . . . be i.i.d. and Sn = X1 + · · · + Xn . In order that
there exist constants an and bn > 0 so that (Sn − an )/bn ⇒ χ, it is necessary
and suﬃcient that
y 2 P (X1  > y )/E (X1 2 ; X1  ≤ y ) → 0. Section 2.4 Central Limit Theorems
A proof can be found in Gnedenko and Kolmogorov (1954), a reference that
contains the last word on many results about sums of independent random
variables.
Exercises
In the next ﬁve problems X1 , X2 , . . . are independent and Sn = X1 + · · · + Xn .
4.9. Suppose P (Xm = m) = P (Xm = −m) = m−2 /2,
P (Xm = 1) = P (Xm = −1) = (1 − m−2 )/2
√
Show that var(Sn )/n → 2 but Sn / n ⇒ χ. The trouble here is that Xn,m =
√
Xm / n does not satisfy (ii) of Theorem (4.5).
4.10. Show that if Xi  ≤ M and n var(Xn ) = ∞ then (Sn − ESn )/ var(Sn ) ⇒ χ
4.11. Suppose √ i = 0, EXi2 = 1 and E Xi 2+δ ≤ C for some 0 < δ, C < ∞.
EX
Show that Sn / n ⇒ χ.
4.12. Prove Lyapunov’s Theorem. Let αn = {var(Sn )}1/2 . If there is a
δ > 0 so that
n lim α−(2+δ)
n n→∞ E (Xm − EXm 2+δ ) = 0
m=1 then (Sn − ESn )/αn ⇒ χ. Note that the previous exercise is a special case of
this result.
4.13. Suppose P (Xj = j ) = P (Xj = −j ) = 1/2j β and P (Xj = 0) = 1 − j −β
where β > 0. Show that (i) If β > 1 then Sn → S∞ a.s. (ii) if β < 1 then
Sn /n(3−β )/2 ⇒ cχ. (iii) if β = 1 then Sn /n ⇒ ℵ where
1 x−1 (1 − cos xt) dx E exp(itℵ) = exp −
0 *c. Prime Divisors (Erd¨sKac)
o
Our aim here is to prove that an integer picked at random from {1, 2, . . . , n}
has about
log log n + χ(log log n)1/2 119 120 Chapter 2 Central Limit Theorems
prime divisors. Since exp(e4 ) = 5.15 × 1023 , this result does not apply to most
numbers we encounter in “everyday life.” The ﬁrst step in deriving this result
is to give a
Second proof of (4.5) The ﬁrst step is to let
n
2
E (Xn,m ; Xn,m  > ) hn ( ) =
m=1 and observe
(4.7) Lemma. hn ( ) → 0 for each ﬁxed
hn ( n ) → 0. > 0 so we can pick n → 0 so that Proof Let Nm be chosen so that hn (1/m) ≤ 1/m for n ≥ Nm and m → Nm
is increasing. Let n = 1/m for Nm ≤ n < Nm+1 , and = 1 for n < N1 . When
Nm ≤ n < Nm+1 , n = 1/m, so hn ( n ) = hn (1/m) ≤ 1/m and the desired
result follows.
Let Xn,m = Xn,m 1(Xn,m > n ) , Yn,m = Xn,m 1(Xn,m ≤ n ) , and Zn,m =
Yn,m − EYn,m . Clearly Zn,m  ≤ 2 n . Using Xn,m = Xn,m + Yn,m , Zn,m =
Yn,m − EYn,m , EYn,m = −EXn,m , the variance of the sum is the sum of the
variances, and var(W ) ≤ EW 2 , we have
n E 2 n Xn,m −
m=1 Zn,m 2 n =E m=1
n Xn,m − EXn,m
m=1
n E (Xn,m − EXn,m )2 ≤ =
m=1 E (Xn,m )2 → 0
m=1 as n → ∞, by the choice of n .
n
n
Let Sn = m=1 Xn,m and Tn = m=1 Zn,m . The last computation shows
Sn − Tn → 0 in L2 and hence in probability by (5.3) in Chapter 1. Thus, by
2
Exercise 2.10, it suﬃces to show Tn ⇒ σχ. (i) implies ESn → σ 2 . We have
2
just shown that E (Sn − Tn ) → 0, so the triangle inequality for the L2 norm
2
implies ETn → σ 2 . To compute higher moments, we observe
r
r
Tn =
k=1 ri where
and r!
1
r1 ! · · · rk ! k ! r1
rk
Zn,i1 · · · Zn,ik
ij extends over all k tuples of positive integers with r1 + · · · + rk = r
extends over all k tuples of distinct integers with 1 ≤ i ≤ n. If we let ri
ij rk
r1
EZn,i1 · · · EZn,ik An (r1 , ..., rk ) =
ij Section 2.4 Central Limit Theorems
then r r!
1
An (r1 , ...rk )
r1 ! · · · rk ! k ! r
ETn =
k=1 ri r
To evaluate the limit of ETn we observe: (a) If some rj = 1, then An (r1 , ...rk ) = 0 since EZn,ij = 0.
(b) If all rj = 2 then
k n
2
EZn,i1 2
· · · EZn,ik 2
EZn,m ≤ → σ 2k m=1 ij To argue the other inequality, we note that for any 1 ≤ a < b ≤ k we can
2
estimate the sum over all the i1 , . . . , ik with ia = ib by replacing EZn,ia by
k
(2 n )2 to get (the factor 2 giving the number of ways to pick 1 ≤ a < b ≤ k )
k n
2
EZn,m
m=1 2
EZn,i1 − 2
· · · EZn,ik ij ≤ k
(2 n )2
2 k−1 n
2
EZn,m →0 m=1 (c) If all the ri ≥ 2 but some rj > 2 then using
2
E Zn,ij rj ≤ (2 n )rj −2 EZn,ij we have
E Zn,i1 r1 · · · E Zn,ik rk An (r1 , ...rk ) ≤
ij ≤ (2 n )r−2k An (2, ...2) → 0
r
When r is odd, some rj must be = 1 or ≥ 3 so ETn → 0 by (a) and (c). If
r = 2k is even, (a)–(c) imply r
ETn → σ 2k (2k )!
= E (σχ)r
2k k ! and the result follows from (3.12).
Turning to the result for prime divisors, let Pn denote the uniform distribution on {1, . . . , n}. If P∞ (A) ≡ lim Pn (A) exists the limit is called the density
of A ⊂ Z. Let Ap be the set of integers divisible by p. Clearly, if p is a prime
P∞ (Ap ) = 1/p and q = p is another prime
P∞ (Ap ∩ Aq ) = 1/pq = P∞ (Ap )P∞ (Aq ) 121 122 Chapter 2 Central Limit Theorems
Even though P∞ is not a probability measure (since P ({i}) = 0 for all i), we
can interpret this as saying that the events of being divisible by p and q are
independent. Let δp (n) = 1 if n is divisible by p, and = 0 otherwise, and
δp (n) g ( n) = be the number of prime divisors of n p≤n this and future sums on p being over the primes. Intuitively, the δp (n) behave
like Xp that are i.i.d. with
P (Xp = 1) = 1/p and P (Xp = 0) = 1 − 1/p
The mean and variance of p≤n Xp are 1/p and
p≤n 1/p(1 − 1/p)
p≤n respectively. It is known that
1/p = log log n + O(1) (∗)
p≤n (see Hardy and Wright (1959), Chapter XXII), while anyone can see p 1/p2 <
∞, so applying (4.5) to Xp and making a small leap of faith gives us:
(4.8) Erd¨sKac central limit theorem. As n → ∞
o
Pn m ≤ n : g (m) − log log n ≤ x(log log n)1/2 → P (χ ≤ x)
Proof We begin by showing that we can ignore the primes “near” n. Let
αn = n1/ log log n
log αn = log n/ log log n
log log αn = log log n − log log log n The sequence αn has two nice properties:
(a) αn <p≤n Proof of (a) 1/p /(log log n)1/2 → 0 by (∗)
By (∗)
1/p =
αn <p≤n 1/p −
p≤n 1/p
p≤αn = log log n − log log αn + O(1)
= log log log n + O(1) Section 2.4 Central Limit Theorems
(b) If > 0 then αn ≤ n for large n and hence αr /n → 0 for all r < ∞.
n Proof of (b) 1/ log log n → 0 as n → ∞. Let gn (m) = p≤αn δp (m) and let En denote expected value w.r.t. Pn . En αn <p≤n δp = Pn (m : δp (m) = 1) ≤
αn <p≤n 1/p
αn <p≤n so by (a) it is enough to prove the result for gn . Let
Sn = Xp
p≤αn where the Xp are the independent random variables introduced above. Let
bn = ESn and a2 = var(Sn ). (a) tells us that bn and a2 are both
n
n
log log n + o((log log n)1/2 )
so it suﬃces to show
Pn (m : gn (m) − bn ≤ xan ) → P (χ ≤ x)
An application of (4.5) shows (Sn − bn )/an ⇒ χ, and since Xp  ≤ 1 it follows
from the second proof of (4.5) that
r E ((Sn − bn )/an ) → Eχr for all r Using notation from that proof (and replacing ij by pj )
r
r
ESn =
k=1 ri r!
1
r1 ! · · · rk ! k ! r1
rk
E ( X p1 · · · X pk )
pj Since Xp ∈ {0, 1}, the summand is
E (Xp1 · · · Xpk ) = 1/(p1 · · · pk )
A little thought reveals that
En (δp1 · · · δpk ) ≤ 1
[n/(p1 · · · pk )]
n 123 124 Chapter 2 Central Limit Theorems
The two moments diﬀer by ≤ 1/n, so
r
r
r
E (Sn ) − En (gn ) =
k=1 ri ≤ 1
n r!
1
r1 ! · · · rk ! k !
r 1 ≤ αr
n p≤αn n pj 1
n →0 by (b). Now
r E (Sn − bn )r =
m=0
r E (gn − bn )r =
m=0 r
m
E Sn (−bn )r−m
m
r
m
E gn (−bn )r−m
m r
r
so subtracting and using our bound on E (Sn ) − En (gn ) with r = m
r E (Sn − bn )r − E (gn − bn )r  ≤
m=0 r 1 m r−m
αb
= (αn + bn )r /n → 0
mnnn since bn ≤ αn . This is more than enough to conclude that
E ((gn − bn )/an )r → Eχr
and the desired result follows from (3.12). *d. Rates of Convergence (BerryEsseen)
2
(4.9) Theorem. Let X1 , X2 , . . . be i.i.d. with EXi = 0, EX√ = σ 2 , and
i
3
E Xi  = ρ < ∞. If Fn (x) is the distribution of (X1 + · · · + Xn )/σ n and N (x)
is the standard normal distribution, then
√
Fn (x) − N (x) ≤ 3ρ/σ 3 n Remarks. The reader should note that the inequality holds for all n and x,
but since ρ ≥ σ 3 it only has nontrivial content for n ≥ 10. It is easy to see that
the rate cannot be faster than n−1/2 . When P (Xi = 1) = P (Xi = −1) = 1/2,
symmetry and (1.4) imply
F2n (0) = 1
1
{1 + P (S2n = 0)} = (1 + (πn)−1/2 ) + o(n−1/2 )
2
2 The constant 3 is not the best known (van Beek (1972) gets 0.8), but as Feller
brags, “our streamlined method yields a remarkably good bound even though it Section 2.4 Central Limit Theorems
avoids the usual messy numerical calculations.” The hypothesis E X 3 is needed
to get the rate n−1/2 . Heyde (1967) has shown that for 0 < δ < 1
∞ n−1+δ/2 sup Fn (x) − N (x) < ∞
x n=1 if and only if E X 2+δ < ∞. For this and more on rates of convergence, see
Hall (1982).
Proof Since neither side of the inequality is aﬀected by scaling, we can suppose without loss of generality that σ 2 = 1. The ﬁrst phase of the argument is
to derive an inequality, (4.11), that relates the diﬀerence between the two distributions to the distance between their ch.f.’s. Polya’s density (see Example
3.8 and use (3.1e))
1 − cos Lx
h L ( x) =
πLx2
+
has ch.f. ωL (θ) = (1 − θ/L) for θ ≤ L. We will use HL for its distribution
function. We will convolve the distributions under consideration with HL to
get ch.f. that have compact support. The ﬁrst step is to show that convolution
with HL does not reduce the diﬀerence between the distributions too much.
(4.10) Lemma. Let F and G be distribution functions with G (x) ≤ λ < ∞.
Let ∆(x) = F (x) − G(x), η = sup ∆(x), ∆L = ∆ ∗ HL , and ηL = sup ∆L (x).
Then
12λ
η
24λ
or η ≤ 2ηL +
ηL ≥ −
2
πL
πL
Proof ∆ goes to 0 at ±∞, G is continuous, and F is a d.f., so there is an
x0 with ∆(x0 ) = η or ∆(x0 −) = −η . By looking at the d.f.’s of (−1) times
the r.v.’s in the second case, we can suppose without loss of generality that
∆(x0 ) = η . Since G (x) ≤ λ and F is nondecreasing, ∆(x0 + s) ≥ η − λs.
Letting δ = η/2λ, and t = x0 + δ , we have
∆(t − x) ≥ (η/2) + λx
−η for x ≤ δ
otherwise To estimate the convolution ∆L , we observe
∞ 2 ∞ 2/(πLx2 )dx = 4/(πLδ ) hL (x) dx ≤ 2
δ δ Looking at (−δ, δ ) and its complement separately and noticing symmetry imδ
plies −δ xhL (x) dx = 0, we have
ηL ≥ ∆L (t) ≥ η
2 1− 4
πLδ −η 4
η
6η
η
12λ
=−
=−
πLδ
2 πLδ
2
πL 125 126 Chapter 2 Central Limit Theorems
proving (4.10).
(4.11) Lemma. Let K1 and K2 be d.f. with mean 0 whose ch.f. κi are integrable
K1 (x) − K2 (x) = (2π )−1 −e−itx κ1 ( t ) − κ2 ( t )
dt
it Proof Since the κi are integrable, the inversion formula (3.3) implies that the
density ki (x) has
ki (y ) = (2π )−1 e−ity κi (t) dt Subtracting the last expression with i = 2 from the one with i = 1 then integrating from a to x and letting ∆K = K1 − K2 gives
x ∆K (x) − ∆K (a) = (2π )−1 e−ity {κ1 (t) − κ2 (t)} dt dy
a = (2π )−1 {e−ita − e−itx } κ1 ( t ) − κ2 ( t )
dt
it the application of Fubini’s theorem being justiﬁed since the κi are integrable in
t and we are considering a bounded interval in y .
The factor 1/it could cause problems near zero, but we have supposed
that the Ki have mean 0, so {1 − κi (t)}/t → 0 by Exercise 3.14, and hence
(κ1 (t) − κ2 (t))/it is bounded and continuous. The factor 1/it improves the
integrability for large t so (κ1 (t) − κ2 (t))/it is integrable. Letting a → −∞
and using the RiemannLebesgue lemma (Exercise 4.5 in the Appendix) gives
(4.11).
Let ϕF and ϕG be the ch.f.’s of F and G. Applying (4.11) to FL = F ∗ HL
and GL = G ∗ HL , gives
FL (x) − GL (x) ≤ 1
2π ≤ 1
2π ϕ F ( t ) ω L ( t ) − ϕ G ( t ) ω L ( t ) 
L ϕ F ( t ) − ϕ G ( t ) 
−L dt
t  dt
t  since ωL (t) ≤ 1. Using (4.10) now, we have
F (x) − G(x) ≤ 1
π L ϕ F ( θ ) − ϕ G ( θ ) 
−L 24λ
dθ
+
θ 
πL where λ = supx G (x). Plugging in F = Fn and G = N gives
(4.12) F n ( x ) − N ( x )  ≤ 1
π L √
24λ
dθ
+
ϕn (θ/ n) − ψ (θ)
θ 
πL
−L Section 2.4 Central Limit Theorems
and it remains to estimate the righthand side. This phase of the argument is
fairly routine, but there is a fair amount of algebra. To save the reader from
trying to improve the inequalities along the way in hopes of getting a better
bound, we would like to observe that we have used the fact that C = 3 to get
rid of the cases n ≤ 9, and we use n ≥ 10 in (e).
To estimate the second term in (4.12), we observe that
(a) sup G (x) = G (0) = (2π )−1/2 = .39894 < 2/5
x For the ﬁrst, we observe that if α, β  ≤ γ
n−1 (b) α n − β n  ≤ αn−m β m − αn−m−1 β m+1  ≤ nα − β γ n−1
m=0 Using (3.7) now gives (recall we are supposing σ 2 = 1)
(c) ϕ(t) − 1 + t2 /2 ≤ ρt3 /6 (d) ϕ(t) ≤ 1 − t2 /2 + ρt3 /6 if t2 ≤ 2 √
√
Let L = 4 n/3ρ. If θ ≤ L then by (d) and the fact ρθ/ n ≤ 4/3
√
ϕ(θ/ n) ≤ 1 − θ2 /2n + ρθ3 /6n3/2
≤ 1 − 5θ2 /18n ≤ exp(−5θ2 /18n)
since 1 − x ≤ e−x . We will now apply (b) with
√
β = exp(−θ2 /2n)
α = ϕ(θ/ n) γ = exp(−5θ2 /18n) Since we are supposing n ≥ 10
(e) γ n−1 ≤ exp(−θ2 /4) For the other part of (b), we write
√
nα − β  ≤ nϕ(θ/ n) − 1 + θ2 /2n + n1 − θ2 /2n − exp(−θ2 /2n)
To bound the ﬁrst term on the righthand side, observe (c) implies
√
nϕ(θ/ n) − 1 + θ2 /2n ≤ ρθ3 /6n1/2
For the second term, note that if 0 < x < 1 then we have an alternating series
with decreasing terms so
e−x − (1 − x) = − x3
x2
x2
+
−... ≤
2!
3!
2 127 128 Chapter 2 Central Limit Theorems
Taking x = θ2 /2n it follows that for θ ≤ L ≤ √
2n n1 − θ2 /2n − exp(−θ2 /2n) ≤ θ4 /8n
Combining this with our estimate on the ﬁrst term gives
nα − β  ≤ ρθ3 /6n1/2 + θ4 /8n (f) Using (f) and (e) in (b), gives (g) 1n√
θ 3
ρθ2
ϕ (θ/ n) − exp(−θ2 /2) ≤ exp(−θ2 /4)
+
θ 
8n
6n1/2
2
1
θ 3
2θ
≤ exp(−θ2 /4)
+
L
9
18 √
√
√
since ρ/ n = 4/3L, and 1/n = 1/ n · 1/ n ≤ 4/3L · 1/3 since ρ ≥ 1 and
n ≥ 10. Using (g) and (a) in (4.12) gives
πLFn (x) − N (x) ≤ exp(−θ2 /4)  θ 3
2θ2
+
9
18 dθ + 9.6 √
Recalling L = 4 n/3ρ, we see that the last result is of the form Fn (x)−N (x) ≤
√
Cρ/ n. To evaluate the constant, we observe
(2πa)−1/2 x2 exp(−x2 /2a)dx = a
and writing x3 = 2x2 · x/2 and integrating by parts
∞ ∞ x3 exp(−x2 /4) dx = 2 2
0 4x exp(−x2 /4) dx
0
2 = −16e−x /4 ∞ = 16
0 This gives us
F n ( x ) − N ( x )  ≤ 13
·
π4 √
16
2
· 2 · 4π +
+ 9.6
9
18 ρ
ρ
√ < 3√
n
n For the last step, you have to get out your calculator or trust Feller. 2.5 Local Limit Theorems *2.5. Local Limit Theorems
In Section 2.1 we saw that if X1 , X2 , . . . are i.i.d. with P (X1 = 1) = P (X1 =
−1) = 1/2 and kn is a sequence of integers with 2kn /(2n)1/2 → x then
P (S2n = 2kn ) ∼ (πn)−1/2 exp(−x2 /2)
In this section, we will prove two theorems that generalize the last result. We
begin with two deﬁnitions. A random variable X has a lattice distribution
if there are constants b and h > 0 so that P (X ∈ b + hZ) = 1, where b + hZ =
{b + hz : z ∈ Z}. The largest h for which the last statement holds is called the
span of the distribution.
Example 5.1. If P (X = 1) = P (X = −1) = 1/2 then X has a lattice
distribution with span 2. When h is 2, one possible choice is b = −1.
The next result relates the last deﬁnition to the characteristic function. To
check (ii) in its statement, note that in the last example E (eitX ) = cos t has
 cos(t) = 1 when t = nπ .
(5.1) Theorem. Let ϕ(t) = EeitX . There are only three possibilities.
(i) ϕ(t) < 1 for all t = 0.
(ii) There is a λ > 0 so that ϕ(λ) = 1 and ϕ(t) < 1 for 0 < t < λ. In this
case, X has a lattice distribution with span 2π/λ.
(iii) ϕ(t) = 1 for all t. In this case, X = b a.s. for some b.
Proof We begin with (ii). It suﬃces to show that ϕ(t) = 1 if and only if
P (X ∈ b + (2π/t)Z) = 1 for some b. First, if P (X ∈ b + (2π/t)Z) = 1 then
ϕ(t) = EeitX = eitb ei2πn P (X = b + (2π/t)n) = eitb
n ∈Z Conversely, if ϕ(t) = 1, then there is equality in the inequality EeitX  ≤
E eitX  so the distribution of eitX must be concentrated at some point eitb , and
P (X ∈ b + (2π/t)Z) = 1.
To prove trichotomy now, we suppose that (i) and (ii) do not hold, i.e.,
there is a sequence tn ↓ 0 so that ϕ(tn ) = 1. The ﬁrst paragraph shows that
there is a bn so that P (X ∈ bn + (2π/tn )Z) = 1. Without loss of generality,
we can pick bn ∈ (−π/tn , π/tn ]. As n → ∞, P (X ∈ (−π/tn , π/tn ]) → 0 so it
/
follows that P (X = bn ) → 1. This is only possible if bn = b for n ≥ N , and
P (X = b) = 1. 129 130 Chapter 2 Central Limit Theorems
We call the three cases in (5.1): (i) nonlattice, (ii) lattice, and (iii)
degenerate. The reader should notice that this means that lattice random
variables are by deﬁnition nondegenerate. Before we turn to the main business
of this section, we would like to introduce one more special case. If X is a
lattice distribution and we can take b = 0, i.e., P (X ∈ hZ) = 1, then X is said
to be arithmetic. In this case, if λ = 2π/h then ϕ(λ) = 1 and ϕ is periodic:
ϕ ( t + λ) = ϕ ( t ) .
Our ﬁrst local limit theorem is for the lattice case. Let X1 , X2 , . . . be
i.i.d. with EXi = 0, EXi2 = σ 2 ∈ (0, ∞), and having a common lattice distribution with span h. If Sn = X1 + · · · + Xn and P (Xi ∈ b + hZ) = 1 then
P (Sn ∈ nb + hZ) = 1. We put
√
pn (x) = P (Sn / n = x) √
for x ∈ Ln = {(nb + hz )/ n : z ∈ Z} and
n(x) = (2πσ 2 )−1/2 exp(−x2 /2σ 2 ) for x ∈ (−∞, ∞) (5.2) Theorem. Under the hypotheses above, as n → ∞
sup
x∈Ln n1/2
pn (x) − n(x) → 0
h Remark. To explain the statement, note that if we followed the approach in
Example 4.3 then we would conclude that for x ∈ Ln
√
x+h/2 n pn (x) ≈ √
x−h/2 n h
n(y ) dy ≈ √ n(x)
n Proof Let Y be a random variable with P (Y ∈ a + θZ) = 1 and ψ (t) =
E exp(itY ). It follows from part (iii) of Exercise 3.2 that
P ( Y = x) = 1
2π/θ π /θ e−itx ψ (t) dt
−π/θ √
√
√
Using this formula with θ = h/ n, ψ (t) = E exp(itSn / n) = ϕn (t/ n), and
then multiplying each side by 1/θ gives
n1/2
1
pn (x) =
h
2π √
π n/h
√
−π n/h √
e−itx ϕn (t/ n) dt 2.5 Local Limit Theorems
Using the inversion formula (3.3) for n(x), which has ch.f. exp(−σ 2 t2 /2), gives
n( x) = 1
2π e−itx exp(−σ 2 t2 /2) dt Subtracting the last two equations gives (recall π > 1, e−itx  ≤ 1)
√
π n/h n1/2
pn (x) − n(x) ≤
h √
−π n/h
∞ + √
π n/h √
ϕn (t/ n) − exp(−σ 2 t2 /2) dt exp(−σ 2 t2 /2) dt The righthand side is independent of x, so to prove (5.2) it suﬃces to show
that it approaches 0. The second integral clearly → 0. To estimate the ﬁrst
√
integral, we observe that ϕn (t/ n) → exp(−σ 2 t2 /2), so the integrand goes to 0
and it is now just a question of “applying the dominated convergence theorem.”
To do this, we will divide the integral into three pieces. The bounded
convergence theorem implies that for any A < ∞ the integral over (−A, A)
approaches 0. To estimate the integral over (−A, A)c , we observe that since
EXi = 0 and EXi2 = σ 2 , formula (3.7) and the triangle inequality imply that
ϕ(u) ≤ 1 − σ 2 u2 /2 + u2
E (min(u · X 3 , 6X 2 ))
2 The last expected value → 0 as u → 0. This means we can pick δ > 0 so that
if u < δ , it is ≤ σ 2 /2 and hence
ϕ(u) ≤ 1 − σ 2 u2 /2 + σ 2 u2 /4 = 1 − σ 2 u2 /4 ≤ exp(−σ 2 u2 /4)
√
√
since 1 − x ≤ e−x . Applying the last result to u = t/ n we see that for t ≤ δ n
(5.3) √
ϕ(t/ n)n  ≤ exp(−σ 2 t2 /4) √
√
So the integral over (−δ n, δ n) − (−A, A) is smaller than
√
δn exp(−σ 2 t2 /4) dt 2
A which is small if A is large.
To estimate the rest of the integral we observe that since X has span h,
(5.1) implies ϕ(u) = 1 for u ∈ [δ, π/h]. ϕ is continuous so there is an η < 1 so 131 132 Chapter 2 Central Limit Theorems
√
that ϕ(u) ≤ η < 1 for u ∈ [δ, π/h]. Letting u = t/ n again, we see that the
√
√
√
√
integral over [−π n/h, π n/h] − (−δ n, δ n) is smaller than
√
π n/h 2 η n + exp(−σ 2 t2 /2) dt √ δn which → 0 as n → ∞. This completes the proof of (5.2).
We turn now to the nonlattice case. Let X1 , X2 , . . . be i.i.d. with EXi =
0, EXi2 = σ 2 ∈ (0, ∞), and having a common characteristic function ϕ(t)
that has ϕ(t) < 1 for all t = 0. Let Sn = X1 + · · · + Xn and n(x) =
(2πσ 2 )−1/2 exp(−x2 /2σ 2 ).
√
(5.4) Theorem. Under the hypotheses above, if xn / n → x and a < b
√
nP (Sn ∈ (xn + a, xn + b)) → (b − a)n(x)
Remark. The proof of (5.4) has to be a little devious because the assumption
above does not give us much control over the behavior of ϕ. For a bad example,
let q1 , q2 , . . . be an enumeration of the positive rationals which has qn ≤ n.
Suppose
P (X = qn ) = P (X = −qn ) = 1/2n+1
In this case EX = 0, EX 2 < ∞, and the distribution is nonlattice. However,
the characteristic function has lim supt→∞ ϕ(t) = 1.
Proof To tame bad ch.f.’s we use a trick. Let δ > 0
h0 ( y ) = 1 1 − cos δy
·
π
δy 2 be the density of the Polya’s distribution and let hθ (x) = eiθx h0 (x). If we
introduce the Fourier transform
g ( u) =
ˆ eiuy g (y ) dy then it follows from Example 3.8 that
ˆ
h0 ( u) = 1 − u/δ  if u ≤ δ
0
otherwise ˆ
ˆ
and it is easy to see that hθ (u) = h0 (u + θ). We will show that for any θ
(∗) √ n Ehθ (Sn − xn ) → n(x) hθ (y ) dy 2.5 Local Limit Theorems
Before proving (∗), we will show it implies (5.4). Let
√
µn (A) = nP (Sn − xn ∈ A), and µ(A) = n(x)A
where A = the Lebesgue measure of A. Let
αn = √ n Eh0 (Sn − xn ) and α = n(x) h0 (y ) dy = n(x) Finally, deﬁne probability measures by
νn (B ) = 1
αn h0 (y )µn (dy ), and ν (B ) = B 1
α h0 (y )µ(dy )
B Taking θ = 0 in (∗) we see αn → α and so (∗) implies
(∗∗) eiθy νn (dy ) → eiθy ν (dy ) Since this holds for all θ, it follows from (3.4) that νn ⇒ ν . Now if a, b < 2π/δ
then the function
1
· 1(a,b) (y )
k (y ) =
h0 ( y )
is bounded and continuous a.s. with respect to ν so it follows from (2.3) that
k (y )νn (dy ) → k (y )ν (dy ) Since αn → α, this implies
√
nP (Sn ∈ (xn + a, xn + b)) → (b − a)n(x)
which is the conclusion of (5.4).
Turning now to the proof of (∗), the inversion formula (3.3) implies
h 0 ( x) = 1
2π ˆ
e−iux h0 (u) du Recalling the deﬁnition of hθ , using the last result, and changing variables
u = v + θ we have
1
2π
1
=
2π hθ (x) = eiθx h0 (x) = ˆ
e−i(u−θ)x h0 (u) du
ˆ
e−ivx hθ (v ) dv 133 134 Chapter 2 Central Limit Theorems
ˆ
ˆ
since hθ (v ) = h0 (v +θ). Letting Fn be the distribution of Sn −xn and integrating
gives
1
ˆ
Ehθ (Sn − xn ) =
e−iux hθ (u) du dFn (x)
2π
1
ˆ
e−iux dFn (x)hθ (u) du
=
2π
ˆ
by Fubini’s theorem. (Recall hθ (u) has compact support and Fn is a distribution
function.) Using (3.1e), we see that the last expression
= 1
2π ˆ
ϕ(−u)n eiuxn hθ (u) du To take the limit as n → ∞ of this integral, let [−M, M ] be an interval with
ˆ
hθ (u) = 0 for u ∈ [−M, M ]. By (5.3) above, we can pick δ so that for u < δ
/
ϕ(u) ≤ exp(−σ 2 u2 /4) (5.5) Let I = [−δ, δ ] and J = [−M, M ] − I . Since ϕ(u) < 1 for u = 0 and ϕ is
continuous, there is a constant η < 1 so that ϕ(u) ≤ η < 1 for u ∈ J . Since
ˆ
hθ (u) ≤ 1, this implies that
√ √ n
2π n iuxn ˆ ϕ(−u) e hθ (u) du ≤ J n
· 2M η n → 0
2π √
as n → ∞. For the integral over I , change variables u = t/ n to get
1
2π √
δn
√ √
√
√
ˆ
ϕ(−t/ n)n eitxn / n hθ (t/ n) dt −δ n √
The central limit theorem implies ϕ(−t/ n)n → exp(−σ 2 t2 /2). Using (5.5)
√
now and the dominated convergence theorem gives (recall xn / n → x)
√ n
2π ˆ
ϕ(−u)n eiuxn hθ (u) du →
I 1
2π ˆ
exp(−σ 2 t2 /2)eitx hθ (0) dt ˆ
= n(x)hθ (0) = n(x) hθ (y ) dy ˆ
by the inversion formula (3.3) and the deﬁnition of hθ (0). This proves (∗) and
completes the proof of (5.4). Section 2.6 Poisson Convergence 2.6. Poisson Convergence
a. The Basic Limit Theorem
Our ﬁrst result is sometimes facetiously called the “weak law of small numbers”
or the “law of rare events.” These names derive from the fact that the Poisson
appears as the limit of a sum of indicators of events that have small probabilities.
(6.1) Theorem. For each n let Xn,m , 1 ≤ m ≤ n be independent random
variables with P (Xn,m = 1) = pn,m , P (Xn,m = 0) = 1 − pn,m . Suppose
(i) n
m=1 pn,m → λ ∈ (0, ∞), and (ii) max1≤m≤n pn,m → 0.
If Sn = Xn,1 + · · · + Xn,n then Sn ⇒ Z where Z is Poisson(λ).
Here Poisson(λ) is shorthand for Poisson distribution with mean λ, that is,
P (Z = k ) = e−λ λk /k !
Note that in the spirit of the LindebergFeller theorem, no single term contributes very much to the sum. In contrast to that theorem, the contributions,
when positive, are not small.
First proof Let ϕn,m (t) = E (exp(itXn,m )) = (1 − pn,m ) + pn,m eit and let
Sn = Xn,1 + · · · + Xn,n . Then
n (1 + pn,m (eit − 1)) E exp(itSn ) = m=1 Let 0 ≤ p ≤ 1.  exp(p(eit − 1)) = exp(pRe (eit − 1)) ≤ 1 and 1 + p(eit − 1) ≤ 1
since it is on the line segment connecting 1 to eit . Using (4.3) with θ = 1 and
then (4.4). which is valid when maxm pn,m ≤ 1/2 since eit − 1 ≤ 2
n n pn,m (eit − 1) exp
m=1
n − {1 + pn,m (eit − 1)}
m=1 exp(pn,m (eit − 1)) − {1 + pn,m (eit − 1)} ≤
m=1
n p2 eit − 12
n,m ≤
m=1 135 136 Chapter 2 Central Limit Theorems
Using eit − 1 ≤ 2 again, it follows that the last expression
n ≤4 max pn,m 1≤m≤n pn,m → 0
m=1
n
m=1 by assumptions (i) and (ii). The last conclusion and pn,m → λ imply E exp(itSn ) → exp(λ(eit − 1))
To complete the proof now, we consult Example 3.2 for the ch.f. of the Poisson
distribution and apply (3.4).
We will now consider some concrete situations in which (6.1) can be applied. In each case we are considering a situation in which pn,m = c/n, so we
approximate the distribution of the sum by a Poisson with mean c.
Example 6.1. In a calculus class with 400 students, the number of students
who have their birthday on the day of the ﬁnal exam has approximately a Poisson distribution with mean 400/365 = 1.096. This means that the probability
no one was born on that date is about e−1.096 = .334. Similar reasoning shows
that the number of babies born on a given day or the number of people who
arrive at a bank between 1:15 and 1:30 should have a Poisson distribution.
Example 6.2. Suppose we roll two dice 36 times. The probability of “double
ones” (one on each die) is 1/36 so the number of times this occurs should have
approximately a Poisson distribution with mean 1. Comparing the Poisson
approximation with exact probabilities shows that the agreement is good even
though the number of trials is small.
k
Poisson
exact 0
.3678
.3627 1
.3678
.3730 2
.1839
.1865 3
.0613
.0604 After we give the second proof of (6.1), (see (6.5)) we will discuss rates of
convergence. Those results will show that for large n the largest discrepancy
occurs for k = 1 and is about 1/2en ( = .0051 in this case).
Example 6.3. Let ξn,1 , . . . , ξn,n be independent and uniformly distributed over
[−n, n]. Let Xn,m = 1 if ξn,m ∈ (a, b), = 0 otherwise. Sn is the number of points
that land in (a, b). pn,m = (b − a)/2n so m pn,m = (b − a)/2. This shows (i)
and (ii) in (6.1) hold, and we conclude that Sn ⇒ Z , a Poisson r.v. with mean
(b − a)/2. A twodimensional version of the last theorem might explain why
the statistics of ﬂying bomb hits in the South of London during World War II Section 2.6 Poisson Convergence
ﬁt a Poisson distribution. As Feller, Vol. I (1968), p.160–161 reports, the area
was divided into 576 areas of 1/4 square kilometers each. The total number of
hits was 537 for an average of .9323 per cell. The table below compares Nk the
number of cells with k hits with the predictions of the Poisson approximation.
k
Nk
Poisson 0
229
226.74 1
211
211.39 2
93
98.54 3
35
30.62 4
7
7.14 ≥5
1
1.57 For other observations ﬁtting a Poisson distribution, see Feller, Vol. I (1968),
Section VI.7.
Our second proof of (6.1) requires a little more work but provides information about the rate of convergence. (See (6.5) below.) We begin by deﬁning
the total variation distance between two measures on a countable set S.
µ−ν ≡ µ(z ) − ν (z ) = 2 sup µ(A) − ν (A)
A⊂S z The ﬁrst equality is a deﬁnition. To prove the second, note that for any A
µ(z ) − ν (z ) ≥ µ(A) − ν (A) + µ(Ac ) − ν (Ac ) = 2µ(A) − ν (A)
z and there is equality when A = {z : µ(z ) ≥ ν (z )}.
Exercise 6.1. Show that (i) d(µ, ν ) = µ − ν deﬁnes a metric on probability
measures on Z and (ii) µn − µ → 0 if and only if µn (x) → µ(x) for each
x ∈ Z, which by Exercise 2.8 is equivalent to µn ⇒ µ.
Exercise 6.2. Show that µ − ν ≤ 2δ if and only if there are random variables
X and Y with distributions µ and ν so that P (X = Y ) ≤ δ.
The next three lemmas are the keys to our second proof.
(6.2) Lemma. If µ1 × µ2 denotes the product measure on Z × Z that has
(µ1 × µ2 )(x, y ) = µ1 (x)µ2 (y ) then
µ1 × µ2 − ν1 × ν2 ≤ µ1 − ν1 + µ2 − ν2
Proof µ1 × µ2 − ν1 × ν2 =
≤ x,y µ1 (x)µ2 (y ) − ν1 (x)ν2 (y ) µ1 (x)µ2 (y ) − ν1 (x)µ2 (y ) +
x,y µ2 ( y ) =
y ν1 (x)µ2 (y ) − ν1 (x)ν2 (y )
x,y µ1 (x) − ν1 (x) +
x = µ1 − ν1 + µ2 − ν2 ν1 (x)
x µ2 (y ) − ν2 (y )
y 137 138 Chapter 2 Central Limit Theorems
(6.3) Lemma. If µ1 ∗ µ2 denotes the convolution of µ1 and µ2 , that is,
µ 1 ∗ µ 2 ( x) = µ1 ( x − y ) µ2 ( y )
y then µ1 ∗ µ2 − ν1 ∗ ν2 ≤ µ1 × µ2 − ν1 × ν2
Proof µ1 ∗ µ2 − ν1 ∗ ν2 =
≤ x y µ1 ( x − y ) µ2 ( y ) − y ν1 (x − y )ν2 (y ) µ1 (x − y )µ2 (y ) − ν1 (x − y )ν2 (y )
x y = µ1 × µ2 − ν1 × ν2
(6.4) Lemma. Let µ be the measure with µ(1) = p and µ(0) = 1 − p. Let ν be
a Poisson distribution with mean p. Then µ − ν ≤ 2p2 .
Proof µ − ν = µ(0) − ν (0) + µ(1) − ν (1) + ν ( n)
n≥2 = 1 − p − e−p  + p − p e−p  + 1 − e−p (1 + p)
Since 1 − x ≤ e−x ≤ 1 for x ≥ 0, the above
= e−p − 1 + p + p(1 − e−p ) + 1 − e−p − pe−p
= 2p(1 − e−p ) ≤ 2p2 Second proof of (6.1) Let µn,m be the distribution of Xn,m . Let µn be the
distribution of Sn . Let νn,m , νn , and ν be Poisson distributions with means
pn,m , λn = m≤n pn,m , and λ respectively. Since µn = µn,1 ∗ · · · ∗ µn,n and
νn = νn,1 ∗ · · · ∗ νn,n , (6.3), (6.2), and (6.4) imply
n (6.5) µn − νn ≤ n p2
n,m µn,m − νn,m ≤ 2
m=1 m=1 Using the deﬁnition of total variation distance now gives
n p2
n,m sup µn (A) − νn (A) ≤
A m=1 (i) and (ii) in (6.1) imply that the righthand side → 0. Since νn ⇒ ν as n → ∞,
the result follows. Section 2.6 Poisson Convergence
Remark. The proof above is due to Hodges and Le Cam (1960). By diﬀerent
methods, C. Stein (1987) (see (43) on p. 89) has proved
n sup µn (A) − νn (A) ≤ (λ ∨ 1)−1
A p2
n,m
m=1 Rates of convergence. When pn,m = 1/n, (6.5) becomes
sup µn (A) − νn (A) ≤ 1/n
A To assess the quality of this bound, we will compare the Poisson and binomial
probabilities for k successes.
k
0 Poisson
e−1 1
2 e−1
e−1 /2!
−1 3 e Binomial
1n
1− n
1
n · n−1 1 − n
n
1
−2
1− n
2n
n
3 /3! n −3 1− n−1
n−2 1
= 1− n
1
= 1− n 1 n−3
n 2
n = 1− n−1
n−1 / 2! 1− 1 n−2
n 3! Since (1 − x) ≤ e−x , we have µn (0) − νn (0) ≤ 0. Expanding
log(1 + x) = x − x3
x2
+
−...
2
3 gives
(n − 1) log 1 − 1
n =− 1
n−1 n−1
− . . . = −1 +
−
+ O(n−2 )
n
2n2
2n So
n 1− 1
n n−1 and it follows that − e−1 = ne−1 exp{1/2n + O(n−2 )} − 1 → e−1 /2 n(µn (1) − νn (1)) → e−1 /2
n(µn (2) − νn (2)) → e−1 /4 For k ≥ 3, using (1 − 2/n) ≤ (1 − 1/n)2 and (1 − x) ≤ e−x shows µn (k ) − νn (k ) ≤
0, so
sup µn (A) − νn (A) ≈ 3/4en
A⊂Z 139 140 Chapter 2 Central Limit Theorems
There is a large literature on Poisson approximations for dependent events.
Here we consider b. Two Examples with Dependence
that can be treated by exact calculations.
Example 6.4. Matching. Let π be a random permutation of {1, 2, . . . , n},
let Xn,m = 1 if m is a ﬁxed point (0 otherwise), and let Sn = Xn,1 + · · · + Xn,n
be the number of ﬁxed points. We want to compute P (Sn = 0). (For a more
exciting story consider men checking hats or wives swapping husbands.) Let
An,m = {Xn,m = 1}. The inclusionexclusion formula implies
P (∪n =1 Am ) =
m P (Am ) −
m P (A ∩ Am )
<m P (Ak ∩ A ∩ Am ) − . . . +
k< <m =n· 1
n (n − 2)!
n (n − 3)!
−
+
−...
n
2
n!
3
n! since the number of permutations with k speciﬁed ﬁxed points is (n − k )! Canceling some factorials gives
n (−1)m−1
m!
m=1 P (Sn > 0) = n so P (Sn = 0) = (−1)m
m!
m=0 Recognizing the second sum as the ﬁrst n + 1 terms in the expansion of e−1
gives
∞ P (Sn = 0) − e−1  = ≤ (−1)m
m!
m=n+1
1
(n + 1)! ∞ (n + 2)−k =
k=0 1
1
· 1−
(n + 1)!
n+2 −1 a much better rate of convergence than 1/n. To compute the other probabilities,
we observe that by considering the locations of the ﬁxed points
n
1
P (Sn−k = 0)
k n(n − 1) · · · (n − k + 1)
1
= P (Sn−k = 0) → e−1 /k !
k! P ( Sn = k ) = Section 2.6 Poisson Convergence
Example 6.5. Occupancy problem. Suppose that r balls are placed at
random into n boxes. It follows from the Poisson approximation to the Binomial
that if n → ∞ and r/n → c, then the number of balls in a given box will
approach a Poisson distribution with mean c. The last observation should
explain why the fraction of empty boxes approached e−c in Example 5.5 of
Chapter 1. Here we will show:
(6.6) Theorem. If ne−r/n → λ ∈ [0, ∞) the number of empty boxes approaches
a Poisson distribution with mean λ.
Proof To see where the answer comes from, notice that in the Poisson approximation the probability that a given box is empty is e−r/n ≈ λ/n, so if the
occupancy of the various boxes were independent, the result would follow from
(6.1). To prove the result, we begin by observing
P ( boxes i1 , i2 , . . . , ik are empty ) = 1− k
n r If we let pm (r, n) = the probability exactly m boxes are empty when r balls are
put in n boxes, then P ( no empty box ) = 1 − P ( at least one empty box ). So
by inclusionexclusion
n (a) n
k (−1)k p0 (r, n) =
k=0 1− k
n r By considering the locations of the empty boxes
(b) n
m pm (r, n) = 1− m
n r p0 (r, n − m) To evaluate the limit of pm (r, n) we begin by showing that if ne−r/n → λ then
n
m (c) 1− m
n r → λm /m! One half of this is easy. Since (1 − x) ≤ e−x and ne−r/n → λ
(d) n
m 1− m
n For the other direction, observe
n
m 1− m
n r ≤
n
m nm −mr/n
e
→ λm /m!
m! ≥ (n − m)m /m! so r ≥ 1− m
n m+r nm /m! 141 142 Chapter 2 Central Limit Theorems
Now (1 − m/n)m → 1 as n → ∞ and 1/m! is a constant. To deal with the rest,
we note that if 0 ≤ t ≤ 1/2 then
log(1 − t) = −t − t2 /2 − t3 /3 . . .
≥ −t − t2
1 + 2−1 + 2−2 + · · · = −t − t2
2 so we have
log nm 1 − m
n r ≥ m log n − rm/n − r(m/n)2 Our assumption ne−r/n → λ means
r = n log n − n log λ + o(n)
so r(m/n)2 → 0. Multiplying the last display by m/n and rearranging gives
m log n − rm/n → m log λ. Combining the last two results shows
lim inf nm 1 −
n→∞ m
n r ≥ λm and (c) follows. From (a), (c), and the dominated convergence theorem (using
(d) to get the domination) we get
(e) if ne−r/n → λ then p0 (r, n) → ∞
k λk
k=0 (−1) k! = e−λ For ﬁxed m, (n − m)e−r/(n−m) → λ, so it follows from (e) that p0 (r, n − m) →
e−λ . Combining this with (b) and (c) completes the proof of (6.6).
Example 6.6. Coupon collector’s problem. Let X1 , X2 , . . . be i.i.d. uniform on {1, 2, . . . , n} and Tn = inf {m : {X1 , . . . Xm } = {1, 2, . . . , n}}. Since
Tn ≤ m if and only if m balls ﬁll up all n boxes, it follows from (6.6) that
P (Tn − n log n ≤ nx) → exp(−e−x )
Proof If r = n log n + nx then ne−r/n → e−x . Note that Tn is the sum of n independent random variables (see Example 5.3 in
Chapter 1), but Tn does not converge to the normal distribution. The problem
is that the last few terms in the sum are of order n so the hypotheses of the
LindebergFeller theorem are not satisﬁed.
For a concrete instance of the previous result consider: What is the probability that in a village of 2190 (= 6 · 365) people all birthdays are represented?
Do you think the answer is much diﬀerent for 1825 (= 5 · 365) people? Section 2.6 Poisson Convergence
Solution Here n = 365, so 365 log 365 = 2153 and
P (T365 ≤ 2190) = P ((T365 − 2153)/365 ≤ 37/365)
≈ exp(−e−0.1014 ) = exp(−0.9036) = 0.4051
P (T365 ≤ 1825) = P ((T365 − 2153)/365 ≤ −328/365)
≈ exp(−e0.8986 ) = exp(−2.4562) = 0.085
n
As we observed in Example 5.3 of Chapter 1, if we let τk = inf {m :
n
n
n
{X1 , . . . , Xm } = k }, then τ1 = 1 and for 2 ≤ k ≤ n, τk − τk−1 are independent
and have a geometric distribution with parameter 1 − (k − 1)/n.
n
Exercise 6.3. Suppose k/n1/2 → λ ∈ [0, ∞) and show that τk − k ⇒
2
Poisson(λ /2). Hint: This is easy if you use (6.7) below.
n
2
n
Exercise 6.4. Let µn,k = Eτk and σn,k = var(τk ). Suppose k/n → a ∈ (0, 1),
√
n
and use the LindebergFeller theorem to show (τk − µn,k )/ n ⇒ σχ. The last result is true when k/n1/2 → ∞ and n − k → ∞, see Baum and
Billingsley (1966). Results for k = n − j can be obtained from (6.6), so we have
examined all the possibilities. c. Poisson Processes
(6.1) generalizes trivially to give the following result.
(6.7) Theorem. Let Xn,m , 1 ≤ m ≤ n be independent nonnegative integer
valued random variables with P (Xn,m = 1) = pn,m , P (Xn,m ≥ 2) = n,m .
(i) n
m=1 pn,m → λ ∈ (0, ∞), (ii) max1≤m≤n pn,m → 0,
and (iii) n
m=1 n,m → 0. If Sn = Xn,1 + · · · + Xn,n then Sn ⇒ Z where Z is Poisson(λ).
Proof Let Xn,m = 1 if Xn,m = 1, and 0 otherwise. Let Sn = Xn,1 + · · · + Xn,n .
(i)(ii) and (6.1) imply Sn ⇒ Z , (iii) tells us P (Sn = Sn ) → 0 and the result
follows from the converging together lemma, Exercise 2.10.
The next result, which uses (6.7), explains why the Poisson distribution
comes up so frequently in applications. Let N (s, t) be the number of arrivals
at a bank or an ice cream parlor in the time interval (s, t]. Suppose 143 144 Chapter 2 Central Limit Theorems
(i) the numbers of arrivals in disjoint intervals are independent,
(ii) the distribution of N (s, t) only depends on t − s,
(iii) P (N (0, h) = 1) = λh + o(h),
and (iv) P (N (0, h) ≥ 2) = o(h).
Here, the two o(h) stand for functions g1 (h) and g2 (h) with gi (h)/h → 0 as
h → 0.
(6.8) Theorem. If (i)–(iv) hold then N (0, t) has a Poisson distribution with
mean λt.
Proof Let Xn,m = N ((m − 1)t/n, mt/n) for 1 ≤ m ≤ n and apply (6.7). A family of random variables Nt , t ≥ 0 satisfying:
(i) if 0 = t0 < t1 < . . . < tn , N (tk ) − N (tk−1 ), 1 ≤ k ≤ n are independent,
(ii) N (t) − N (s) is Poisson(λ(t − s)),
is called a Poisson process with rate λ. To understand how Nt behaves, it
is useful to have another method to construct it. Let ξ1 , ξ2 , . . . be independent
random variables with P (ξi > t) = e−λt for t ≥ 0. Let Tn = ξ1 + · · · + ξn and
Nt = sup{n : Tn ≤ t} where T0 = 0. In the language of renewal theory (see
(7.3) in Chapter 1), Tn is the time of the nth arrival and Nt is the number of
arrivals by time t. To check that Nt is a Poisson process, we begin by recalling
(see Exercise 4.8 in Chapter 1):
P (Tn = s) = λn sn−1 −λs
e
for s ≥ 0
(n − 1)! i.e., the distribution of Tn has a density given by the righthand side. Now
P (Nt = 0) = P (T1 > t) = e−λt
and for n ≥ 1
t P (Nt = n) = P (Tn ≤ t < Tn+1 ) = P (Tn = s)P (ξn+1 > t − s) ds
0 t =
0 λn sn−1 −λs −λ(t−s)
(λt)n
e
e
ds = e−λt
(n − 1)!
n! The last two formulas show that Nt has a Poisson distribution with mean λt.
To check that the number of arrivals in disjoint intervals is independent, we
observe
P (Tn+1 ≥ uNt = n) = P (Tn+1 ≥ u, Tn ≤ t)/P (Nt = n) Section 2.6 Poisson Convergence
To compute the numerator, we observe
t P (Tn+1 ≥ u, Tn ≤ t) = P (Tn = s)P (ξn+1 ≥ u − s) ds
0
t =
0 λn sn−1 −λs −λ(u−s)
(λt)n
e
e
ds = e−λu
(n − 1)!
n! The denominator is P (Nt = n) = e−λt (λt)n /n!, so
P (Tn+1 ≥ uNt = n) = e−λu /e−λt = e−λ(u−t)
or rewriting things P (Tn+1 − t ≥ sNt = n) = e−λs . Let T1 = TN (t)+1 − t, and
Tk = TN (t)+k − TN (t)+k−1 for k ≥ 2. The last computation shows that T1 is
independent of Nt . If we observe that
P (Tn ≤ t, Tn+1 ≥ u, Tn+k − Tn+k−1 ≥ vk , k = 2, . . . , K )
K = P (Tn ≤ t, Tn+1 ≥ u) P (ξn+k ≥ vk )
k=2 then it follows that
(a) T1 , T2 , . . . are i.i.d. and independent of Nt .
The last observation shows that the arrivals after time t are independent of Nt
and have the same distribution as the original sequence. From this it follows
easily that:
(b) If 0 = t0 < t1 . . . < tn then N (ti ) − N (ti−1 ), i = 1, . . . , n are independent.
To see this, observe that the vector (N (t2 ) − N (t1 ), . . . , N (tn ) − N (tn−1 )) is
σ (Tk , k ≥ 1) measurable and hence is independent of N (t1 ). Then use induction
to conclude
n P (N (ti ) − N (ti−1 ) = ki , i = 1, . . . , n) = exp(−λ(ti − ti−1 ))
i=1 λ(ti − ti−1 ))ki
ki ! Remark. The key to the proof of (a) is the lack of memory property of the
exponential distribution:
(∗) P (T > t + sT > t) = P (T > s) which implies that the location of the ﬁrst arrival after t is independent of what
occurred before time t and has an exponential distribution. 145 146 Chapter 2 Central Limit Theorems
Exercise 6.5. Show that if P (T > 0) = 1 and (∗) holds then there is a λ > 0 so
that P (T > t) = e−λt for t ≥ 0. Hint: First show that this holds for t = m2−n .
Exercise 6.6. Show that (iii) and (iv) in (6.8) can be replaced by
(v) If Ns− = limr↑s Nr then P (Ns − Ns− ≥ 2 for some s) = 0.
That is, if (i), (ii), and (v) hold then there is a λ ≥ 0 so that N (0, t) has a Poisson
distribution with mean λt. Prove this by showing: (a) If u(s) = P (Ns = 0)
then (i) and (ii) imply u(r)u(s) = u(r + s). It follows that u(s) = e−λs for some
λ ≥ 0, so (iii) holds. (b) if v (s) = P (Ns ≥ 2) and An = {Nk/n − N(k−1)/n ≥ 2
for some k ≤ n} then (v) implies P (An ) → 0 as n → ∞ and (iv) holds.
Exercise 6.7. Let Tn be the time of the nth arrival in a rate λ Poisson
process. Let U1 , U2 , . . . , Un be independent uniform on (0,1) and let Vkn be the
n
kth smallest number in {U1 , . . . , Un }. Show that the vectors (V1n , . . . , Vn ) and
(T1 /Tn+1 , . . . , Tn /Tn+1 ) have the same distribution.
Spacings. The last result can be used to study the spacings between the order
statistics of i.i.d. uniforms. We use notation of Exercise 6.7 in the next four
n
exercises, taking λ = 1 and letting V0n = 0, and Vn+1 = 1.
n
Exercise 6.8. Smirnov (1949) nVk ⇒ Tk . Exercise 6.9. Weiss (1955) n−1 n
m=1 1(n(Vin −Vin 1 )>x) → e−x in probability.
− n
n
Exercise 6.10. (n/ log n) max1≤m≤n+1 Vm − Vm−1 → 1 in probability.
n
n
Exercise 6.11. P (n2 min1≤m≤n Vm − Vm−1 > x) → e−x . For the rest of the section, we concentrate on the Poisson process itself.
Exercise 6.12. Thinning. Let N have a Poisson distribution with mean λ
and let X1 , X2 , . . . be an independent i.i.d. sequence with P (Xi = j ) = pj for
j = 0, 1, . . . , k . Let Nj = {m ≤ N : Xm = j }. Show that N0 , N1 , . . . , Nk are
independent and Nj has a Poisson distribution with mean λpj .
In the important special case Xi ∈ {0, 1}, the result says that if we thin a
Poisson process by ﬂipping a coin with probability p of heads to see if we keep
the arrival, then the result is a Poisson process with rate λp.
Exercise 6.13. Poissonization and the occupancy problem. If we put
a Poisson number of balls with mean r in n boxes and let Ni be the number 2.7 Stable Laws of balls in box i, then the last exercise implies N1 , . . . , Nn are independent and
have a Poisson distribution with mean r/n. Use this observation to prove (6.6).
Hint: If r = n log n−(log λ)n+o(n) and si = n log n−(log µi )n with µ2 < λ < µ1
then the normal approximation to the Poisson tells us P (Poisson(s1 ) < r <
Poisson(s2)) → 1 as n → ∞.
Compound Poisson process. At the arrival times T1 , T2 , . . . of a Poisson process with rate λ, groups of customers of size ξ1 , ξ2 , . . . arrive at an ice
cream parlor. Suppose the ξi are i.i.d. and independent of the Tj s. This is a
compound Poisson process. The result of Exercise 6.12 shows that Ntk =
the number of groups of size k to arrive in [0, t] are independent Poisson’s with
mean pk λt.
A Poisson process on a measure space (S, S , µ) is a random map m :
S → {0, 1, . . .} that for each ω is a measure on S and has the following property:
if A1 , . . . , An are disjoint sets with µ(Ai ) < ∞ then m(A1 ), . . . , m(An ) are
independent and have Poisson distributions with means µ(Ai ). µ is called the
mean measure of the process. Exercise 6.12 implies that if µ(S ) < ∞ we
can construct m by the following recipe: let X1 , X2 , . . . be i.i.d. elements of S
with distribution ν (·) = µ(·)/µ(S ), let N be an independent Poisson random
variable with mean µ(S ), and let m(A) = {j ≤ N : Xj ∈ A}. To extend
the construction to inﬁnite measure spaces, e.g., S = Rd , S = Borel sets, µ =
Lebesgue measure, divide the space up into disjoint sets of ﬁnite measure and
put independent Poisson processes on each set. *2.7. Stable Laws
Let X1 , X2 , . . . be i.i.d. and Sn = X1 + · · · + Xn . In Section 2.4, we saw that if
EXi = µ and var(Xi ) = σ 2 ∈ (0, ∞) then
(Sn − nµ)/ σn1/2 ⇒ χ
2
In this section, we will investigate the case EX1 = ∞ and give necessary and
suﬃcient conditions for the existence of constants an and bn so that (Sn − bn )/an ⇒ Y where Y is nondegenerate We begin with an example. Suppose the distribution of Xi has
(7.1a) P (X1 > x) = P (X1 < −x) (7.1b) P (X1  > x) = x−α for x ≥ 1 147 148 Chapter 2 Central Limit Theorems
where 0 < α < 2. If ϕ(t) = E exp(itX1 ) then
∞ −1 α
dx +
2xα+1
1 − cos(tx)
dx
xα+1 (1 − eitx ) 1 − ϕ(t) =
1 ∞ =α
1 (1 − eitx )
−∞ α
dx
2xα+1 Changing variables tx = u, dx = du/t the last integral becomes
∞ =α
t 1 − cos u du
= tα α
(u/t)α+1 t 2 As u → 0, 1 − cos u ∼ u /2. So (1 − cos u)/u
since α < 2 implies −α + 1 > −1. If we let
∞ C=α
0 ∞
t α+1 1 − cos u
du
uα+1 ∼ u−α+1 /2 which is integrable, 1 − cos u
du < ∞
uα+1 and observe (7.1a) implies ϕ(t) = ϕ(−t), then the results above show
(7.2) 1 − ϕ(t) ∼ C tα as t → 0 Let X1 , X2 , . . . be i.i.d. with the distribution given in (7.1) and let Sn = X1 +
· · · + Xn .
E exp(itSn /n1/α ) = ϕ(t/n1/α )n = (1 − {1 − ϕ(t/n1/α )})n
As n → ∞, n(1 − ϕ(t/n1/α )) → C tα , so it follows from (4.2) that
E exp(itSn /n1/α ) → exp(−C tα )
From part (ii) of (3.4), it follows that the expression on the right is the characteristic function of some Y and
Sn /n1/α ⇒ Y (7.3) To prepare for our general result, we will now give another proof of (7.3).
If 0 < a < b and an1/α > 1 then
P (an1/α < X1 < bn1/α ) = 1 −α
(a − b−α )n−1
2 so it follows from (6.1) that
Nn (a, b) ≡ {m ≤ n : Xm /n1/α ∈ (a, b)} ⇒ N (a, b) 2.7 Stable Laws where N (a, b) has a Poisson distribution with mean (a−α − b−α )/2. An easy
extension of the last result shows that if A ⊂ R − (−δ, δ ) and δn1/α > 1 then
P (X1 /n1/α ∈ A) = n−1
A α
dx
2xα+1 so Nn (A) ≡ {m ≤ n : Xm /n1/α ∈ A} ⇒ N (A), where N (A) has a Poisson
distribution with mean
µ(A) =
A α
dx < ∞
2xα+1 The limiting family of random variables N (A) is called a Poisson process on
(−∞, ∞) with mean measure µ. (See the end of Section 2.6 for more on this
process.) Notice that for any > 0, µ( , ∞) = −α /2 < ∞, so N ( , ∞) < ∞.
The last paragraph describes the limiting behavior of the random set
Xn = {Xm /n1/α : 1 ≤ m ≤ n}
To describe the limit of Sn /n1/α , we will “sum up the points.” Let > 0 and In ( ) = {m ≤ n : Xm  > n1/α }
ˆ
ˆ
¯
Xm
Sn ( ) =
Sn ( ) = Sn − Sn ( )
m ∈ In ( ) ˆ
In ( ) = the indices of the “big terms,” i.e., those > n1/α in magnitude. Sn ( )
¯
is the sum of the big terms, and Sn ( ) is the rest of the sum. The ﬁrst thing
¯
we will do is show that the contribution of Sn ( ) is small if is. Let
¯
Xm ( ) = Xm 1(Xm ≤ n1/α ) ¯
¯
¯
Symmetry implies E Xm ( ) = 0, so E (Sn ( )2 ) = nE X1 ( )2 .
∞ ¯
E X1 ( ) 2 =
0 =1+ n1/α 1 ¯
2yP (X1 ( ) > y ) dy ≤
2
2−α 2−α 2/α−1 n − 2y y −α dy 2y dy +
0 1 2 2−α 2/α−1
2
≤
n
2−α
2−α where we have used α < 2 in computing the integral and α > 0 in the ﬁnal
inequality. From this it follows that
(7.4) 2 2−α
¯
E (Sn ( )/n1/α )2 ≤
2−α 149 150 Chapter 2 Central Limit Theorems
ˆ
To compute the limit of Sn ( )/n1/α , we observe that In ( ) has a binomial
ˆ
distribution with success probability p = −α /n. Given In ( ) = m, Sn ( )/n1/α
is the sum of m independent random variables with a distribution Fn that is
symmetric about 0 and has
1 − Fn (x) = P (X1 /n1/α > x  X1 /n1/α > ) = x−α /2 −α for x ≥ The last distribution is the same as that of X1 , so if ϕ(t) = E exp(itX1 ), the
distribution Fn has characteristic function ϕ( t). Combining the observations
in this paragraph gives
n ˆ
E exp(itSn ( )/n1/α ) =
m=0 n
(
m −α /n)m (1 − −α /n)n−m ϕ( t)m Writing
1 n(n − 1) · · · (n − m + 1)
1
n1
=
≤
m nm
m!
nm
m!
noting (1− −α /n)n ≤ exp(− −α ) and using the dominated convergence theorem
∞ (7.5) ˆ
E exp(itSn ( )/n1/α ) → exp(− −α )( −α m ) ϕ( t)m /m! m=0 = exp(− −α {1 − ϕ( t)}) To get (7.3) now, we use the following generalization of (4.7).
(7.6) Lemma. If hn ( ) → g ( ) for each > 0 and g ( ) → g (0) as
we can pick n → 0 so that hn ( n ) → g (0). → 0 then Proof Let Nm be chosen so that hn (1/m) − g (1/m) ≤ 1/m for n ≥ Nm and
m → Nm is increasing. Let n = 1/m for Nm ≤ n < Nm+1 and = 1 for n < N1 .
When Nm ≤ n < Nm+1 , n = 1/m so it follows from the triangle inequality
and the deﬁnition of n that
hn ( n ) − g (0) ≤ hn (1/m) − g (1/m) + g (1/m) − g (0)
≤ 1/m + g (1/m) − g (0)
When n → ∞, we have m → ∞ and the result follows.
ˆ
Let hn ( ) = E exp(itSn ( )/n1/α ) and g ( ) = exp(−
α
implies 1 − ϕ(t) ∼ C t as t → 0 so
g ( ) → exp(−C tα ) as →0 −α {1 − ϕ( t)}). (7.2) 2.7 Stable Laws and (7.6) implies we can pick n → 0 with hn ( n ) → exp(−C tα ). Introducing
ˆ
Y with E exp(itY ) = exp(−C tα ), it follows that Sn ( n )/n1/α ⇒ Y . If n → 0
then (7.4) implies
¯
Sn ( n )/n1/α ⇒ 0
and (7.3) follows from the converging together lemma, Exercise 2.10.
Once we give one ﬁnal deﬁnition, we will state and prove the general result
alluded to above. L is said to be slowly varying, if
lim L(tx)/L(x) = 1 x→∞ for all t > 0 Exercise 7.1. Show that L(t) = log t is slowly varying but t is not if = 0. (7.7) Theorem. Suppose X1 , X2 , . . . are i.i.d. with a distribution that satisﬁes
(i) limx→∞ P (X1 > x)/P (X1  > x) = θ ∈ [0, 1]
(ii) P (X1  > x) = x−α L(x)
where α < 2 and L is slowly varying. Let Sn = X1 + · · · + Xn
an = inf {x : P (X1  > x) ≤ n−1 } and bn = nE (X1 1(X1 ≤an ) )
As n → ∞, (Sn − bn )/an ⇒ Y where Y has a nondegenerate distribution.
Remark. This is not much of a generalization of the example, but the conditions are necessary for the existence of constants an and bn so that (Sn −
bn )/an ⇒ Y , where Y is nondegenerate. Proofs of necessity can be found in
Chapter 9 of Breiman (1968) or in Gnedenko and Kolmogorov (1954). (7.13)
gives the ch.f. of Y . The reader can skip to that point without much loss.
Proof It is not hard to see that (ii) implies (7.8) nP (X1  > an ) → 1 To prove this, note that nP (X1  > an ) ≤ 1 and let > 0. Taking x = an /(1+ )
and t = 1 + 2 , (ii) implies
(1 + 2 )−α = lim n→∞ proving (7.8) since
(7.9) P (X1  > (1 + 2 )an /(1 + ))
P (X1  > an )
≤ lim inf
n→∞
P (X1  > an /(1 + ))
1/n is arbitrary. Combining (7.8) with (i) and (ii) gives
nP (X1 > xan ) → θx−α for x > 0 151 152 Chapter 2 Central Limit Theorems
so {m ≤ n : Xm > xan } ⇒ Poisson(θx−α ). The last result leads, as before, to
the conclusion that Xn = {Xm /an : 1 ≤ m ≤ n} converges to a Poisson process
on (−∞, ∞) with mean measure
θαx−(α+1) dx + µ(A) =
A∩(0,∞) (1 − θ)αx−(α+1) dx
A∩(−∞,0) To sum up the points, let In ( ) = {m ≤ n : Xm  > an }
µ( ) = EXm 1(
ˆ an <Xm ≤an ) ˆ
Sn ( ) = Xm
m ∈ In ( ) µ( ) = EXm 1(Xm ≤
¯ an )
n ˆ
¯
ˆ
Sn ( ) = (Sn − bn ) − (Sn ( ) − nµ( )) = {Xm 1(Xm ≤ an ) − µ( ) }
¯ m=1 ¯
If we let Xm ( ) = Xm 1(Xm ≤ an ) then ¯
¯
¯
E (Sn ( )/an )2 = n var(X1 ( )/an ) ≤ nE (X1 ( )/an )2
¯
E (X1 ( )/an )2 ≤ 2yP (X1  > yan ) dy
0 = P (X1  > an ) 2y
0 P (X1  > yan )
dy
P (X1  > an ) We would like to use (7.8) and (ii) to conclude
¯
nE (X1 ( )/an )2 → 2y y −α dy =
0 2
2−α 2−α and hence
(7.10) 2 2−α
¯
lim sup E (Sn ( )/an )2 ≤
2−α
n→∞ To justify interchanging the limit and the integral and complete the proof of
(7.10), we show the following (take δ < 2 − α):
Lemma. For any δ > 0 there is C so that for all t ≥ t0 and y ≤ 1
P (X1  > yt)/P (X1  > t) ≤ Cy −α−δ
Proof (ii) implies that as t → ∞
P (X1  > t/2)/P (X1  > t) → 2α 2.7 Stable Laws so for t ≥ t0 we have
P (X1  > t/2)/P (X1  > t) ≤ 2α+δ
Iterating and stopping the ﬁrst time t/2m < t0 we have for all n ≥ 1
P (X1  > t/2n )/P (X1  > t) ≤ C 2(α+δ)n
where C = 1/P (X1  > t0 ). Applying the last result to the ﬁrst n with 1/2n < y
and noticing y ≤ 1/2n−1 , we have
P (X1  > yt)/P (X1  > t) ≤ C 2α+δ y −α−δ
ˆ
To compute the limit of Sn ( ), we observe that In ( ) ⇒ Poisson( −α ).
ˆ
Given In ( ) = m, Sn ( )/an is the sum of m independent random variables
with distribution Fn that has
1 − Fn (x) = P (X1 /an > x  X1 /an > ) → θx−α / −α
Fn (−x) = P (X1 /an < −x  X1 /an > ) → (1 − θ)x−α / −α for x ≥ . If we let ψn (t) denote the characteristic function of Fn , then (3.4)
implies
∞ ψn (t) → ψ (t) = − eitx θ α αx−(α+1) dx + eitx (1 − θ) α αx−(α+1) dx
−∞ as n → ∞. So repeating the proof of (7.5) gives
ˆ
E exp(itSn ( )/an ) → exp(− −α {1 − ψ (t)}) ∞ (eitx − 1)θαx−(α+1) dx = exp
− (eitx − 1)(1 − θ)αx−(α+1) dx +
−∞ where we have used −α = ∞ αx−(α+1) dx. To bring in µ( ) = EXm 1(
ˆ an <Xm ≤an ) we observe that (7.9) implies nP (xan < Xm ≤ yan ) → θ(x−α − y −α ). So
1 nµ( )/an →
ˆ − xθαx−(α+1) dx + x(1 − θ)αx−(α+1) dx
−1 153 154 Chapter 2 Central Limit Theorems
ˆ
ˆ
From this it follows that E exp(it{Sn ( ) − nµ( )}/an ) →
∞ (eitx − 1)θαx−(α+1) dx exp
1 1 (eitx − 1 − itx)θαx−(α+1) dx +
(7.11) − (eitx − 1 − itx)(1 − θ)αx−(α+1) dx + −1
−1 (eitx − 1)(1 − θ)αx−(α+1) dx +
−∞ The last expression is messy, but eitx − 1 − itx ∼ −t2 x2 /2 as t → 0, so we need
to subtract the itx to make
1 (eitx − 1 − itx)x−(α+1) dx converge when α ≥ 1
0 To reduce the number of integrals from four to two, we can write the limit as
→ 0 of the lefthand side of (7.11) as
∞ eitx − 1 − exp itc +
(7.12) 0 itx
1 + x2 0 eitx − 1 − +
−∞ θαx−(α+1) dx
itx
1 + x2 (1 − θ)αx−(α+1) dx where c is a constant. Combining (7.10) and (7.11) using (7.6), it follows easily
that (Sn − bn )/an ⇒ Y where EeitY is given in (7.12).
Exercise 7.2. Show that when α < 1, centering is unnecessary, i.e., we can
let bn = 0.
By doing some calculus (see Breiman (1968), p. 204–206) one can rewrite
(7.12) as
(7.13) exp(itc − btα {1 + iκ sgn (t)wα (t)}) where −1 ≤ κ ≤ 1, (κ = 2θ − 1) and
wα (t) = tan(πα/2) if α = 1
(2/π ) log t if α = 1 The reader should note that while we have assumed 0 < α < 2 throughout the
developments above, if we set α = 2 then the term with κ vanishes and (7.13) 2.7 Stable Laws reduces to the characteristic function of the normal distribution with mean c
and variance 2b.
The distributions whose characteristic functions are given in (7.13) are
called stable laws. α is commonly called the index. When α = 1 and κ = 0,
we have the Cauchy distribution. Apart from the Cauchy and the normal, there
is only one other case in which the density is known: When α = 1/2, κ = 1,
c = 0, and b = 1, the density is
(2πy 3 )−1/2 exp(−1/2y ) for y ≥ 0 (7.14) One can calculate the ch.f. and verify our claim. However, later (see Section
7.4) we will be able to check the claim without eﬀort, so we leave the somewhat
tedious calculation to the reader.
We are now ﬁnally ready to treat some examples
Example 7.1. Let X1 , X2 , . . . be i.i.d. with a density that is symmetric about
0, and continuous and positive at 0. We claim that
1
n 1
1
+···+
X1
Xn ⇒ a Cauchy distribution (α = 1, κ = 0) To verify this, note that
P (1/Xi > x) = P (0 < Xi < x−1 )
x−1 f (y ) dy ∼ f (0)/x =
0 as x → ∞. A similar calculation shows P (1/Xi < −x) ∼ f (0)/x so in (7.7) (i)
holds with θ = 1/2 and (ii) holds with α = 1. The scaling constant an ∼ 2f (0)n,
while the centering constant vanishes since we have supposed the distribution
of X is symmetric about 0.
Remark. Readers who want a challenge should try to drop the symmetry
assumption, assuming for simplicity that f is diﬀerentiable at 0.
Example 7.2. Let X1 , X2 , . . . be i.i.d. with P (Xi = 1) = P (Xi = −1) = 1/2,
let Sn = X1 + · · · + Xn , and let τ = inf {n ≥ 1 : Sn = 1}. In Chapter 3 (see the
discussion after (3.4)) we will show
P (τ > 2n) ∼ π −1/2 n−1/2 as n → ∞ Let τ1 , τ2 , . . . be independent with the same distribution as τ , and let Tn =
τ1 + · · · + τn . Results in Section 3.1 imply that Tn has the same distribution 155 156 Chapter 2 Central Limit Theorems
as the nth time Sm hits 0. We claim that Tn /n2 converges to the stable law
with α = 1/2, κ = 1 and note that this is the key to the derivation of (7.14).
To prove the claim, note that in (7.7) (i) holds with θ = 1 and (ii) holds with
α = 1/2. The scaling constant an ∼ Cn2 . Since α < 1, Exercise 7.2 implies the
centering constant is unnecessary.
Example 7.3. Assume n objects Xn,1 , . . . , Xn,n are placed independently and
at random in [−n, n]. Let
n sgn (Xn,m )/Xn,m p Fn =
m=1 be the net force exerted on 0. We will now show that if p > 1/2, then
lim E exp(itFn ) = exp(−ct1/p ) n→∞ To do this, it is convenient to let Xn,m = nYm where the Yi are i.i.d. on [−1, 1].
Then
n Fn = n−p sgn (Ym )/Ym p
m=1 p Letting Zm = sgn (Ym )/Ym  , Zm is symmetric about 0 with P (Zm  > x) =
P (Ym  < x−1/p ) so in (7.7) (i) holds with θ = 1/2 and (ii) holds with α = 1/p.
The scaling constant an ∼ Cnp and the centering constant is 0 by symmetry.
Exercise 7.3. Show that (i) If p < 1/2 then Fn /n1/2−p ⇒ cχ.
(ii) If p = 1/2 then Fn /(log n)1/2 ⇒ cχ.
Example 7.4. In the examples above, we have had bn = 0. To get a feel for
the centering constants consider X1 , X2 , . . . i.i.d. with
P (Xi > x) = θx−α P (Xi < −x) = (1 − θ)x−α where 0 < α < 2. In this case an = n1/α and
n1/α (2θ − 1)αx−α dx ∼ bn = n
1 cn
α>1
cn log n α = 1
cn1/α
α<1 When α < 1 the centering is the same size as the scaling and can be ignored.
When α > 1, bn ∼ nµ where µ = EXi .
Our next result explains the name stable laws. A random variable Y is
said to have a stable law if for every integer k > 0 there are constants ak 2.7 Stable Laws and bk so that if Y1 , . . . , Yk are i.i.d. and have the same distribution as Y , then
(Y1 + . . . + Yk − bk )/ak =d Y . The last deﬁnition makes half of the next result
obvious.
(7.15) Theorem. Y is the limit of (X1 +· · ·+Xk −bk )/ak for some i.i.d. sequence
Xi if and only if Y has a stable law.
Proof If Y has a stable law we can take X1 , X2 , . . . i.i.d. with distribution Y .
To go the other way, let
Zn = (X1 + · · · + Xn − bn )/an
j
and Sn = X(j −1)n+1 + · · · + Xjn . A little arithmetic shows
1
k
Znk = (Sn + · · · + Sn − bnk )/ank
1
k
ank Znk = (Sn − bn ) + · · · + (Sn − bn ) + (kbn − bnk )
1
k
ank Znk /an = (Sn − bn )/an + · · · + (Sn − bn )/an + (kbn − bnk )/an The ﬁrst k terms on the righthand side ⇒ Y1 + · · · + Yk as n → ∞ where
Y1 , . . . , Yk are independent and have the same distribution as Y , and Znk ⇒ Y .
So the desired result follows from the following. Take Wn = Znk and
Wn = akn
kbn − bnk
Znk −
an
an (7.16) The convergence of types theorem. If Wn ⇒ W and there are
constants αn > 0, βn so that Wn = αn Wn + βn ⇒ W where W and W are
nondegenerate, then there are constants α and β so that αn → α and βn → β.
Proof Let ϕn (t) = E exp(itWn ).
ψn (t) = E exp(it(αn Wn + βn )) = exp(itβn )ϕn (αn t) If ϕ and ψ are the characteristic functions of W and W , then
(a) ϕn (t) → ϕ(t) ψn (t) = exp(itβn )ϕn (αn t) → ψ (t) Take a subsequence αn(m) that converges to a limit α ∈ [0, ∞]. Our ﬁrst step
is to observe α = 0 is impossible. If this happens, then using the uniform
convergence proved in Exercise 3.16
(b) ψ n ( t )  = ϕ n ( α n t )  → 1 157 158 Chapter 2 Central Limit Theorems
ψ (t) ≡ 1, and the limit is degenerate by (5.1). Letting t = u/αn and interchanging the roles of ϕ and ψ shows α = ∞ is impossible. If α is a subsequential limit, then arguing as in (b) gives ψ (t) = ϕ(αt). If there are two
subsequential limits α < α, using the last equation for both limits implies
ϕ(u) = ϕ(uα /α). Iterating gives ϕ(u) = ϕ(u(α /α)k ) → 1 as k → ∞,
contradicting our assumption that W is nondegenerate, so αn → α ∈ [0, ∞).
To conclude that βn → β now, we observe that (ii) of Exercise 3.16 implies
ϕn → ϕ uniformly on compact sets so ϕn (αn t) → ϕ(αt). If δ is small enough
so that ϕ(αt) > 0 for t ≤ δ , it follows from (a) and another use of Exercise
3.16 that
ψ (t)
ψn (t)
→
exp(itβn ) =
ϕn (αt)
ϕ(αt)
uniformly on [−δ, δ ]. exp(itβn ) is the ch.f. of a point mass at βn . Using (3.5)
now as in the proof of (3.4), it follows that the sequence of distributions that
are point masses at βn is tight, i.e., βn is bounded. If βnm → β then exp(itβ ) =
ψ (t)/ϕ(αt) for t ≤ δ , so there can only be one subsequential limit.
(7.15) justiﬁes calling the distributions with characteristic functions given
by (7.13) or (7.12) stable laws. To complete the story, we should mention that
these are the only stable laws. Again, see Chapter 9 of Breiman (1968) or Gnedenko and Kolmogorov (1954). The next example shows that it is sometimes
useful to know what all the possible limits are.
Example 7.4. The Holtsmark distribution. (α = 3/2, κ = 0). Suppose
stars are distributed in space according to a Poisson process with density t and
their masses are i.i.d. Let Xt be the xcomponent of the gravitational force at
0 when the density is t. A change of density 1 → t corresponds to a change of
length 1 → t−1/3 , and gravitational attraction follows an inverse square law so
(7.17) d Xt = t3/2 X1 If we imagine thinning the Poisson process by rolling an nsided die, then Exercise 6.12 implies
d
1
n
Xt = Xt/n + · · · + Xt/n
where the random variables on the righthand side are independent and have
the same distribution as Xt/n . It follows from (7.15) that Xt has a stable law.
The scaling property (7.17) implies α = 3/2. Since Xt =d −Xt , κ = 0.
Exercises
Exercise 7.4. Let Y be a stable law with κ = 1. Use the limit theorem (7.7)
to conclude that Y ≥ 0 if α < 1. 2.8 Inﬁnitely Divisible Distributions
Exercise 7.5. Let X be symmetric stable with index α. (i) Use (3.5) to show
that E X p < ∞ for p < α. (ii) Use the second proof of (7.3) to show that
P (X  ≥ x) ≥ Cx−α so E X α = ∞.
Exercise 7.6. Let Y, Y1 , Y2 , . . . be independent and have a stable law with
index α. (7.15) implies there are constants αk and βk so that Y1 + · · · + Yk
and αk Y + βk have the same distribution. Use the proof of (7.15), (7.7) and
Exercise 7.2 to conclude that (i) αk = k 1/α , (ii) if α < 1 then βk = 0.
Exercise 7.7. Let Y be a stable law with index α < 1 and κ = 1. Exercise 7.4
implies that Y ≥ 0, so we can deﬁne its Laplace transform ψ (λ) = E exp(−λY ).
The previous exercise implies that for any integer n ≥ 1 we have ψ (λ)n =
ψ (n1/α λ). Use this to conclude E exp(−λY ) = exp(−cλα ).
Exercise 7.8. (i) Show that if X is symmetric stable with index α and Y ≥ 0
is an independent stable with index β < 1 then XY 1/α is symmetric stable with
index αβ . (ii) Let W1 and W2 be independent standard normals. Check that
2
1/W2 has the density given in (7.14) and use this to conclude that W1 /W2 has
a Cauchy distribution. *2.8. Inﬁnitely Divisible Distributions
In the last section, we identiﬁed the distributions that can appear as the limit
of normalized sums of i.i.d.r.v.’s. In this section, we will describe those that are
limits of sums
(∗) Sn = Xn,1 + · · · + Xn,n where the Xn,m are i.i.d. Note the verb “describe.” We will prove almost
nothing in this section, just state some of the most important facts to bring the
reader up to cocktail party literacy.
A suﬃcient condition for Z to be a limit of sums of the form (∗) is that Z has
an inﬁnitely divisible distribution, i.e., for each n there is an i.i.d. sequence
Yn,1 , . . . , Yn,n so that
d Z = Yn,1 + · · · + Yn,n
Our ﬁrst result shows that this condition is also necessary.
(8.1) Theorem. Z is a limit of sums of type (∗) if and only if Z has an inﬁnitely
divisible distribution.
Proof As remarked above, we only have to prove necessity. Write S2n = (X2n,1 + · · · + X2n,n ) + (X2n,n+1 + · · · + X2n,2n ) ≡ Yn + Yn 159 160 Chapter 2 Central Limit Theorems
The random variables Yn and Yn are independent and have the same distribution. If Sn ⇒ Z then the distributions of Yn are a tight sequence since
P (Yn > y )2 = P (Yn > y )P (Yn > y ) ≤ P (S2n > 2y )
and similarly P (Yn < −y )2 ≤ P (S2n < −2y ). If we take a subsequence nk so
that Ynk ⇒ Y (and hence Ynk ⇒ Y ) then Z =d Y + Y . A similar argument
shows that Z can be divided into n > 2 pieces and the proof is complete.
With (8.1) established, we turn now to examples. In the ﬁrst three cases,
the distribution is inﬁnitely divisible because it is a limit of sums of the form
(∗). The number gives the relevant limit theorem.
Example 8.1. Normal distribution. (4.1)
Example 8.2. Stable Laws. (7.7)
Example 8.3. Poisson distribution. (6.1)
Example 8.4. Compound Poisson distribution. Let ξ1 , ξ2 , . . . be i.i.d. and
N (λ) be an independent Poisson r.v. with mean λ. Then Z = ξ1 + · · · + ξN (λ)
has an inﬁnitely divisible distribution. (Let Xn,j =d ξ1 + · · · + ξN (λ/n) .) For
developments below, we would like to observe that if ϕ(t) = E exp(itξi ) then
∞ (8.2) e−λ E exp(itZ ) =
n=0 λn
ϕ(t)n = exp(−λ(1 − ϕ(t)))
n! Exercise 8.1. Show that the gamma distribution is inﬁnitely divisible.
The next two exercises give examples of distributions that are not inﬁnitely
divisible.
Exercise 8.2. Show that the distribution of a bounded r.v. Z is inﬁnitely
divisible if and only if Z is constant. Hint: Show var(Z ) = 0.
Exercise 8.3. Show that if µ is inﬁnitely divisible, its ch.f. ϕ never vanishes.
Hint: Look at ψ = ϕ2 , which is also inﬁnitely divisible, to avoid taking nth
roots of complex numbers then use Exercise 3.20.
Example 8.4 is a son of 8.3 but a father of 8.1 and 8.2. To explain this
remark, we observe that if ξ = and − with probability 1/2 each then ϕ(t) =
(ei t + e−i t )/2 = cos( t). So if λ = −2 , then (8.2) implies
E exp(itZ ) = exp(− 2 (1 − cos( t))) → exp(−t2 /2) 2.8 Inﬁnitely Divisible Distributions
as → 0. In words, the normal distribution is a limit of compound Poisson
distributions. To see that Example 8.2 is also a special case (using the notation
from the proof of (7.7)), let
In ( ) = {m ≤ n : Xm  > an }
ˆ
Sn ( ) = Xm
m ∈ In ( ) ˆ
¯
Sn ( ) = Sn − Sn ( )
¯
If n → 0 then Sn ( n )/an ⇒ 0. If is ﬁxed then as n → ∞ we have In ( ) ⇒
−α
ˆ
Poisson(
) and Sn ( )/an ⇒ a compound Poisson distribution:
ˆ
E exp(itSn ( )/an ) → exp(− −α {1 − ψ (t)}) Combining the last two observations and using the proof of (7.7) shows that
stable laws are limits of compound Poisson distributions. The formula (7.12)
for the limiting ch.f.
∞ eitx − 1 − exp itc +
0 (8.3) 0 eitx − 1 − +
−∞ itx
1 + x2 itx
1 + x2 θαx−(α+1) dx
(1 − θ)αx−(α+1) dx helps explain:
(8.4) L´vyKhinchin Theorem. Z has an inﬁnitely divisible distribution if
e
and only if its characteristic function has
log ϕ(t) = ict − σ 2 t2
+
2 eitx − 1 − where µ is a measure with µ({0}) = 0 and itx
1 + x2 x2
1+x2 µ(dx) µ(dx) < ∞. For a proof, see Breiman (1968), Section 9.5., or Feller II (1971), Section
XVII.2. µ is called the L´vy measure of the distribution. Comparing with
e
(8.3) and recalling the proof of (7.7) suggests the following interpretation of µ:
If σ 2 = 0 then Z can be built up by making a Poisson process on R with mean
measure µ and then summing up the points. As in the case of stable laws, we
have to sum the points in [− , ]c , subtract an appropriate constant, and let
→ 0.
Exercise 8.4. What is the L´vy measure for the limit ℵ in part (iii) of Exercise
e
4.13? 161 162 Chapter 2 Central Limit Theorems
The theory of inﬁnitely divisible distributions is simpler in the case of ﬁnite
variance. In this case, we have:
(8.5) Kolmogorov’s Theorem. Z has an inﬁnitely divisible distribution with
mean 0 and ﬁnite variance if and only if its ch.f. has
log ϕ(t) = (eitx − 1 − itx)x−2 ν (dx) Here the integrand is −t2 /2 at 0, ν is called the canonical measure and
var(Z ) = ν (R).
To explain the formula, note that if Zλ has a Poisson distribution with mean λ
E exp(itx(Zλ − λ)) = exp(λ(eitx − 1 − itx))
so the measure for Z = x(Zλ − λ) has ν ({x}) = λx2 .
d *2.9. Limit Theorems in R Let X = (X1 , . . . , Xd ) be a random vector. We deﬁne its distribution function by F (x) = P (X ≤ x). Here x ∈ Rd , and X ≤ x means Xi ≤ xi for
i = 1, . . . , d. As in one dimension, F has three obvious properties:
(i) It is nondecreasing, i.e., if x ≤ y then F (x) ≤ F (y ).
(ii) limx→∞ F (x) = 1, limxi →−∞ F (x) = 0. (iii) F is right continuous, i.e., limy↓x F (y ) = F (x).
Here x → ∞ means each coordinate xi goes to ∞, xi → −∞ means we let
xi → −∞ keeping the other coordinates ﬁxed, and y ↓ x means each coordinate
yi ↓ xi .
In one dimension, any function with properties (i)–(iii) is the distribution
of some random variable. See (1.2) in Chapter 1. In d ≥ 2, this is not the case.
Suppose d = 2 and let a1 < b1 , a2 < b2 .
P (X ∈ (a1 , b1 ] × (a2 , b2 ]) = F (b1 , b2 ) − F (a1 , b2 ) − F (b1 , a2 ) + F (a1 , a2 )
so if F is going to be a distribution function the last quantity has to be ≥ 0.
The next example shows that this is not guaranteed by (i)–(iii). if x1 , x2 ≥ 1
1 2/3 if x1 ≥ 1 and 0 ≤ x2 < 1
F (x1 , x2 ) = 2/3 if x2 ≥ 1 and 0 ≤ x1 < 1 0
otherwise Section 2.9 Limit Theorems in Rd
If 0 < a1 , a2 < 1 ≤ b1 , b2 < ∞, then
F (b1 , b2 ) − F (a1 , b2 ) − F (b1 , a2 ) + F (a1 , a2 ) = 1 − 2/3 − 2/3 + 0 = −1/3
A little thought reveals that F is the distribution function of the measure with
µ({(0, 1)}) = µ({(1, 0)}) = 2/3 µ({(1, 1)}) = −1/3 To formulate the additional condition, we need to guarantee that F is the
distribution function of a probability measure, let
A = (a1 , b1 ] × · · · × (ad , bd ]
V = {a1 , b1 } × · · · × {ad , bd }
V = the vertices of the rectangle A. If v ∈ V , let
sgn (v ) = (−1)# of a’s in v
The inclusionexclusion formula implies
P (X ∈ A) = sgn (v )F (v )
v ∈V So if we use ∆A F to denote the righthand side, we need
(iv) ∆A F ≥ 0 for all rectangles A.
The last condition guarantees that the measure assigned to each rectangle is
≥ 0. A standard result from measure theory (see (1.6) in the Appendix) now
implies there is a unique probability measure with distribution F.
Exercise 9.1. If F is the distribution of (X1 , . . . , Xd ) then Fi (x) = P (Xi ≤ x)
are its marginal distributions. How can they be obtained from F ?
Exercise 9.2. Let F1 , . . . , Fd be distributions on R. Show that for any α ∈
[−1, 1]
d F ( x1 , . . . , x d ) = 1+α d (1 − Fi (xi ))
i=1 F j ( xj )
j =1 is a d.f. with the given marginals. The case α = 0 corresponds to independent
r.v.’s.
Exercise 9.3. A distribution F is said to have a density f if
x1 F (x1 , ..., xk ) = xk ...
−∞ f (y ) dyk . . . dy1
−∞ 163 164 Chapter 2 Central Limit Theorems
Show that if f is continuous, ∂ k F/∂x1 . . . ∂xk = f.
If Fn and F are distribution functions on Rd , we say that Fn converges
weakly to F , and write Fn ⇒ F , if Fn (x) → F (x) at all continuity points of
F . Our ﬁrst task is to show that there are enough continuity points for this to
be a sensible deﬁnition. For a concrete example, consider
F (x, y ) = 1
y
0 if x ≥ 0, y ≥ 1
if x ≥ 0, 0 ≤ y < 1
otherwise F is the distribution function of (0, Y ) where Y is uniform on (0,1). Notice that
this distribution has no atoms, but F is discontinuous at (0, y ) when y > 0.
Keeping the last example in mind, observe that if xn < x, i.e., xn,i < xi
for all coordinates i, and xn ↑ x as n → ∞ then
F ( x) − F ( xn ) = P ( X ≤ x) − P ( X ≤ xn ) ↓ P ( X ≤ x) − P ( X < x )
In d = 2, the last expression is the probability X lies in
{(a, x2 ) : a ≤ x1 } ∪ {(x1 , b) : b ≤ x2 }
i
Let Hc = {x : xi = c} be the hyperplane where the ith coordinate is c. For
i
i
each i, the Hc are disjoint so Di = {c : P (X ∈ Hc ) > 0} is at most countable.
i
/
It is easy to see that if x has xi ∈ D for all i then F is continuous at x. This
gives us more than enough points to reconstruct F.
As in Section 2.2, it will be useful to have several equivalent deﬁnitions of
weak convergence. In Chapter 7, we will need to know that this is valid for an
arbitrary metric space (S, ρ), so we will prove the result in that generality and
insert another equivalence that will be useful there. f is said to be Lipschitz
continuous if there is a constant C so that f (x) − f (y ) ≤ Cρ(x, y ). (9.1) Theorem. The following statements are equivalent to Xn ⇒ X∞ .
(i) Ef (Xn ) → Ef (X∞ ) for all bounded continuous f.
(ii) Ef (Xn ) → Ef (X∞ ) for all bounded Lipschitz continuous f.
(iii) For all closed sets K , lim supn→∞ P (Xn ∈ K ) ≤ P (X∞ ∈ K ).
(iv) For all open sets G, lim inf n→∞ P (Xn ∈ G) ≥ P (X∞ ∈ G).
(v) For all sets A with P (X∞ ∈ ∂A) = 0, limn→∞ P (Xn ∈ A) = P (X∞ ∈ A).
(vi) Let Df = the set of discontinuities of f . For all bounded functions f with
P (X∞ ∈ Df ) = 0, we have Ef (Xn ) → Ef (X∞ ). Section 2.9 Limit Theorems in Rd
Proof We will begin by showing that (i)–(vi) are equivalent. (i) implies (ii): Trivial.
(ii) implies (iii): Let ρ(x, K ) = inf {ρ(x, y ) : y ∈ K }, ϕj (r) = (1 − jr)+ , and
fj (x) = ϕj (ρ(x, K )). fj is Lipschitz continuous, has values in [0,1], and ↓ 1K (x)
as j ↑ ∞. So
lim sup P (Xn ∈ K ) ≤ lim Efj (Xn ) = Efj (X∞ ) ↓ P (X∞ ∈ K ) as j ↑ ∞
n→∞ n→∞ (iii) is equivalent to (iv): As in the proof of (2.3), this follows easily from
two facts: A is open if and only if Ac is closed; P (A) + P (Ac ) = 1.
¯
(iii) and (iv) imply (v): Let K = A, G = Ao , and reason as in the proof of
(2.3).
(v) implies (vi): Suppose f (x) ≤ K and pick α0 < α1 < . . . < α so that
P (f (X∞ ) = αi ) = 0 for 0 ≤ i ≤ , α0 < −K < K < α , and αi − αi−1 < .
This is always possible since {α : P (f (X∞ ) = α) > 0} is a countable set. Let
Ai = {x : αi−1 < f (x) ≤ αi }. ∂Ai ⊂ {x : f (x) ∈ {αi−1 , αi }} ∪ Df , so
P (X∞ ∈ ∂Ai ) = 0 , and it follows from (v) that
αi P (Xn ∈ Ai ) →
i=1 αi P (X∞ ∈ Ai )
i=1 The deﬁnition of the αi implies
0≤ αi P (Xn ∈ Ai ) − Ef (Xn ) ≤ for 1 ≤ n ≤ ∞ i=1 Since is arbitrary, it follows that Ef (Xn ) → Ef (X∞ ). (vi) implies (i): Trivial.
It remains to show that the ﬁve conditions are equivalent to weak convergence (⇒).
(v) implies (⇒) : If F is continuous at x, then A = (−∞, x1 ] × . . . × (−∞, xd ]
has µ(∂A) = 0, so Fn (x) = P (Xn ∈ A) → P (X∞ ∈ A) = F (x).
i
i
(⇒) implies (iv): Let Di = {c : P (X∞ ∈ Hc ) > 0} where Hc = {x : xi = c}.
We say a rectangle A = (a1 , b1 ] × . . . × (ad , bd ] is good if ai , bi ∈ Di for all
/ 165 166 Chapter 2 Central Limit Theorems
i. (⇒) implies that for all good rectangles P (Xn ∈ A) → P (X∞ ∈ A). This
is also true for B that are a ﬁnite disjoint union of good rectangles. Now any
open set G is an increasing limit of Bk ’s that are a ﬁnite disjoint union of good
rectangles, so
lim inf P (Xn ∈ G) ≥ lim inf P (Xn ∈ Bk ) = P (X∞ ∈ Bk ) ↑ P (X∞ ∈ G)
n→∞ n→∞ as k → ∞. The proof of (9.1) is complete.
Remark. In Section 2.2, we proved that (i)–(v) are consequences of weak
convergence by constructing r.v’s with the given distributions so that Xn → X∞
a.s. This can be done in Rd (or any complete separable metric space), but the
construction is rather messy. See Billingsley (1979), p. 337–340 for a proof in
Rd .
Exercise 9.4. Let Xn be random vectors. Show that if Xn ⇒ X then the
coordinates Xn,i ⇒ Xi .
A sequence of probability measures µn is said to be tight if for any
there is an M so that lim inf n→∞ µn ([−M, M ]d ) ≥ 1 − . > 0, (9.2) Theorem. If µn is tight, then there is a weakly convergent subsequence.
Proof Let Fn be the associated distribution functions, and let q1 , q2 , . . . be an
enumeration of Qd = the points in Rd with rational coordinates. By a diagonal
argument like the one in the proof of (2.5), we can pick a subsequence so that
Fn(k) (q ) → G(q ) for all q ∈ Qd . Let
F (x) = inf {G(q ) : q ∈ Qd , q > x}
where q > x means qi > xi for all i. It is easy to see that F is right continuous.
To check that it is a distribution function, we observe that if A is a rectangle
with vertices in Qd then ∆A Fn ≥ 0 for all n, so ∆A G ≥ 0, and taking limits we
see that the last conclusion holds for F for all rectangles A. Tightness implies
that F has properties (i) and (ii) of a distribution F . We leave it to the reader
to check that Fn ⇒ F . The proof of (2.5) works if you read inequalities such
as r1 < r2 < x < s as the corresponding relations between vectors.
The characteristic function of (X1 , . . . , Xd ) is ϕ(t) = E exp(it · X ) where
t · X = t1 X1 + · · · + td Xd is the usual dot product of two vectors. Section 2.9 Limit Theorems in Rd
(9.3) Inversion formula. If A = [a1 , b1 ] × . . . × [ad , bd ] with µ(∂A) = 0 then
d µ(A) = lim (2π )−d
T →∞ ψj (tj )ϕ(t) dt
[−T,T ]d j =1 where ψj (s) = (exp(−isaj ) − exp(−isbj ))/is.
Proof Fubini’s theorem implies
d ψj (tj ) exp(itj xj ) µ(dx) dt
[−T,T ]d j =1
d T j =1 −T = ψj (tj ) exp(itj xj ) dtj µ(dx) It follows from the proof of (3.2) that
T ψj (tj ) exp(itj xj ) dtj → π 1(aj ,bj ) (x) + 1[aj ,bj ] (x)
−T so the desired conclusion follows from the bounded convergence theorem.
Exercise 9.5. Let ϕ be the ch.f. of a distribution F on R. What is the
distribution on Rd that corresponds to the ch.f. ψ (t1 , . . . , td ) = ϕ(t1 + · · · + td )?
Exercise 9.6. Show that random variables X1 , . . . , Xk are independent if and
only if
k ϕX1 ,...Xk (t) = ϕ Xj ( t j )
j =1 (9.4) Convergence theorem. Let Xn , 1 ≤ n ≤ ∞ be random vectors with
ch.f. ϕn . A necessary and suﬃcient condition for Xn ⇒ X∞ is that ϕn (t) →
ϕ∞ (t).
Proof exp(it · x) is bounded and continuous, so if Xn ⇒ X∞ then ϕn (t) →
ϕ∞ (t). To prove the other direction it suﬃces, as in the proof of (3.4), to prove
that the sequence is tight. To do this, we observe that if we ﬁx θ ∈ Rd , then for
all s ∈ R, ϕn (sθ) → ϕ∞ (sθ), so it follows from (3.4) that the distributions of
θ · Xn are tight. Applying the last observation to the d unit vectors e1 , . . . , ed
shows that the distributions of Xn are tight and completes the proof. 167 168 Chapter 2 Central Limit Theorems
Remark. As before, if ϕn (t) → ϕ∞ (t) with ϕ∞ (t) continuous at 0, then ϕ∞ (t)
is the ch.f. of some X∞ and Xn ⇒ X∞ .
(9.4) has an important corollary.
(9.5) Cram´rWold device. A suﬃcient condition for Xn ⇒ X∞ is that
e
θ · Xn ⇒ θ · X∞ for all θ ∈ Rd .
Proof The indicated condition implies E exp(iθ · Xn ) → E exp(iθ · X∞ ) for
all θ ∈ Rd .
(9.5) leads immediately to
(9.6) The central limit theorem in Rd . Let X1 , X2 , . . . be i.i.d. random
vectors with EXn = µ, and ﬁnite covariances
Γij = E ((Xn,i − µi )(Xn,j − µj ))
If Sn = X1 + · · · + Xn then (Sn − nµ)/n1/2 ⇒ χ, where χ has a multivariate
normal distribution with mean 0 and covariance Γ, i.e., E exp(iθ · χ) = exp − i j θi θj Γij /2 Proof By considering Xn = Xn − µ, we can suppose without loss of generality
that µ = 0. Let θ ∈ Rd . θ · Xn is a random variable with mean 0 and variance
2 E θi Xn,i
i = E (θi θj Xn,i Xn,j ) =
i j θi θj Γij
i j so it follows from the onedimensional central limit theorem and (9.5) that
Sn /n1/2 ⇒ χ where E exp(iθ · χ) = exp − i j θi θj Γij /2 To illustrate the use of (9.6), we consider two examples. In each e1 , . . . , ed
are the d unit vectors. Section 2.9 Limit Theorems in Rd
Example 9.1. Simple random walk on Zd . Let X1 , X2 , . . . be i.i.d. with
P (Xn = +ei ) = P (Xn = −ei ) = 1/2d for i = 1, . . . , d i
i
j
EXn = 0 and if i = j then EXn Xn = 0 since both components cannot be
nonzero simultaneously. So the covariance matrix is Γij = (1/2d)I. Example 9.2. Let X1 , X2 , . . . be i.i.d. with P (Xn = ei ) = 1/6 for i =
1, 2, . . . , 6. In words, we are rolling a die and keeping track of the numbers
that come up. EXn,i = 1/6 and EXn,i Xn,j = 0 for i = j , so Γij = (1/6)(5/6)
when i = j and = −(1/6)2 when i = j . In this case, the limiting distribution is
concentrated on {x : i xi = 0}.
Our treatment of the central limit theorem would not be complete without
some discussion of the multivariate normal distribution. We begin by observing
that Γij = Γji and if EXi = 0 and EXi Xj = Γi,j
2 θi θj Γij = E
i j θi Xi ≥0 i so Γ is symmetric and nonnegative deﬁnite. A wellknown result implies that
there is an orthogonal matrix U (i.e., one with U t U = I , the identity matrix) so
that Γ = U t V U , where V ≥ 0 is a diagonal matrix. Let W be the nonnegative
diagonal matrix with W 2 = V . If we let A = W U , then Γ = At A. Let Y
be a ddimensional vector whose components are independent and have normal
distributions with mean 0 and variance 1. If we view vectors as 1 × d matrices
and let χ = Y A, then χ has the desired normal distribution. To check this,
observe that
θ·YA=
θi
Yj Aji
i j has a normal distribution with mean 0 and variance
2 Aji θi
j i θi At
ij =
j i Ajk θk = θAt Aθt = θΓθt k so E (exp(iθ · χ)) = exp(−(θΓθt )/2).
If the covariance matrix has rank d, we say that the normal distribution is
nondegenerate. In this case, its density function is given by (2π )−d/2 (det Γ)−1/2 exp − i,j yi Γ−1 yj /2
ij 169 170 Chapter 2 Central Limit Theorems
The joint distribution in degenerate cases can be computed by using a linear
transformation to reduce to the nondegenerate case. For instance, in Example
9.2 we can look at the distribution of (X1 , . . . , X5 ).
Exercise 9.7. Suppose (X1 , . . . , Xd ) has a multivariate normal distribution
with mean vector θ and covariance Γ. Show X1 , . . . , Xd are independent if and
only if Γij = 0 for i = j . In words, uncorrelated random variables with a joint
normal distribution are independent.
Exercise 9.8. Show that (X1 , . . . , Xd ) has a multivariate normal distribution
with mean vector θ and covariance Γ if and only if every linear combination
c1 X1 + · · · + cd Xd has a normal distribution with mean cθt and variance cΓct . 3 Random Walks Let X1 , X2 , . . . be i.i.d. taking values in Rd and let Sn = X1 + . . . + Xn . Sn
is a random walk. In the last chapter, we were primarily concerned with
the distribution of Sn . In this one, we will look at properties of the sequence
S1 (ω ), S2 (ω ), . . . For example, does the last sequence return to (or near) 0 inﬁnitely often? The ﬁrst section introduces stopping times, a concept that will
be very important in this and the next two chapters. After the ﬁrst section is
completed, the remaining three can be read in any order or skipped without
much loss. The second section is not starred since it contains some basic facts
about random walks. 3.1. Stopping Times
Most of the results in this section are valid for i.i.d. X ’s taking values in some
nice measurable space (S, S ) and will be proved in that generality. For several
reasons, it is convenient to use the special probability space from the proof of
Kolmogorov’s extension theorem:
Ω = {(ω1 , ω2 , . . .) : ωi ∈ S }
F = S ×S ×...
P = µ×µ ×... µ is the distribution of Xi Xn ( ω ) = ω n
So, throughout this section, we will suppose (without loss of generality) that
our random variables are constructed on this special space.
Before taking up our main topic, we will prove a 01 law that, in the
i.i.d. case, generalizes Kolmogorov’s. To state the new 01 law we need two
deﬁnitions. A ﬁnite permutation of N = {1, 2, . . .} is a map π from N onto
N so that π (i) = i for only ﬁnitely many i. If π is a ﬁnite permutation of N and
ω ∈ S N we deﬁne (πω )i = ωπ(i) . In words, the coordinates of ω are rearranged
according to π . Since Xi (ω ) = ωi this is the same as rearranging the random
variables. An event A is permutable if π −1 A ≡ {ω : πω ∈ A} is equal to A
for any ﬁnite permutation π , or in other words, if its occurrence is not aﬀected 172 Chapter 3 Random Walks
by rearranging the random variables. The collection of permutable events is a
σ ﬁeld. It is called the exchangeable σ ﬁeld and denoted by E .
To see the reason for interest in permutable events, suppose S = R and let
Sn (ω ) = X1 (ω ) + · · · + Xn (ω ). Two examples of permutable events are
(i) {ω : Sn (ω ) ∈ B i.o.}
(ii) {ω : lim supn→∞ Sn (ω )/cn ≥ 1}
In each case, the event is permutable because Sn (ω ) = Sn (πω ) for large n. The
list of examples can be enlarged considerably by observing:
(iii) All events in the tail σ ﬁeld T are permutable.
To see this, observe that if A ∈ σ (Xn+1 , Xn+2 , . . .) then the occurrence of A is
unaﬀected by a permutation of X1 , . . . , Xn . (i) shows that the converse of (iii)
is false. The next result shows that for an i.i.d. sequence there is no diﬀerence
between E and T . They are both trivial.
(1.1) HewittSavage 01 law. If X1 , X2 , . . . are i.i.d. and A ∈ E then P (A) ∈
{0, 1}.
Proof Let A ∈ E . As in the proof of Kolmogorov’s 01 law, we will show A
is independent of itself, i.e., P (A) = P (A ∩ A) = P (A)P (A) so P (A) ∈ {0, 1}.
Let An ∈ σ (X1 , . . . , Xn ) so that
(a) P (An ∆A) → 0 Here A∆B = (A − B ) ∪ (B − A) is the symmetric diﬀerence. The existence
of the An ’s is proved in Exercise 3.1 in the Appendix. An can be written as
{ω : (ω1 , . . . , ωn ) ∈ Bn } with Bn ∈ S n . Let
π (j ) = j + n if 1 ≤ j ≤ n
j − n if n + 1 ≤ j ≤ 2n
j
if j ≥ 2n + 1 Observing that π 2 is the identity (so we don’t have to worry about whether to
write π or π −1 ) and the coordinates are i.i.d. (so the permuted coordinates are)
gives
(b) P (ω : ω ∈ An ∆A) = P (ω : πω ∈ An ∆A) Now {ω : πω ∈ A} = {ω : ω ∈ A}, since A is permutable, and
{ω : πω ∈ An } = {ω : (ωn+1 , . . . , ω2n ) ∈ Bn } Section 3.1 Stopping Times
If we use An to denote the last event then we have
(c) {ω : πω ∈ An ∆A} = {ω : ω ∈ An ∆A} Combining (b) and (c) gives
(d) P (An ∆A) = P (An ∆A) Now A − C ⊂ (A − B ) ∪ (B − C ) and with a similar inequality for C − A implies
A∆C ⊂ (A∆B ) ∪ (B ∆C ). The last inequality, (d), and (a) imply
P (An ∆An ) ≤ P (An ∆A) + P (A∆An ) → 0
The last result implies
0 ≤ P (An ) − P (An ∩ An )
≤ P (An ∪ An ) − P (An ∩ An ) = P (An ∆An ) → 0
so P (An ∩ An ) → P (A). But An and An are independent, so
P (An ∩ An ) = P (An )P (An ) → P (A)2
(Recall P (An ) = P (An ).) This shows P (A) = P (A)2 , and proves (1.1).
A typical application of (1.1) is
(1.2) Theorem. For a random walk on R, there are only four possibilities, one
of which has probability one.
(i) Sn = 0 for all n.
(ii) Sn → ∞.
(iii) Sn → −∞.
(iv) −∞ = lim inf Sn < lim sup Sn = ∞.
Proof (1.1) implies lim sup Sn is a constant c ∈ [−∞, ∞]. Let Sn = Sn+1 −X1 .
Since Sn has the same distribution as Sn , it follows that c = c − X1 . If c
is ﬁnite, subtracting c from both sides we conclude X1 ≡ 0 and (i) occurs.
Turning the last statement around, we see that if X1 ≡ 0 then c = −∞ or ∞.
The same analysis applies to the liminf. Discarding the impossible combination
lim sup Sn = −∞ and lim inf Sn = +∞, we have proved the result.
Exercise 1.1. Symmetric random walk. Let X1 , X2 , . . . ∈ R be i.i.d. with a
distribution that is symmetric about 0 and nondegenerate (i.e., P (Xi = 0) < 1).
Show that we are in case (iv) of (1.2). 173 174 Chapter 3 Random Walks
Exercise 1.2. Let X1 , X2 , . . . be i.i.d. with EXi = 0 and EXi2 = σ 2 ∈ (0, ∞).
Use the central limit theorem to conclude that we are in case (iv) of (1.2). Later
in Exercise 1.11 you will show that EXi = 0 and P (Xi = 0) < 1 is suﬃcient.
The special case in which P (Xi = 1) = P (Xi = −1) = 1/2 is called simple
random walk. Since a simple random walk cannot skip over any integers,
it follows from either exercise above that with probability one it visits every
integer inﬁnitely many times.
Let Fn = σ (X1 , . . . , Xn ) = the information known at time n. A random
variable N taking values in {1, 2, . . .} ∪ {∞} is said to be a stopping time or
an optional random variable if for every n < ∞, {N = n} ∈ Fn . If we think
of Sn as giving the (logarithm of the) price of a stock at time n, and N as the
time we sell it, then the last deﬁnition says that the decision to sell at time n
must be based on the information known at that time. The last interpretation
gives one explanation for the second name. N is a time at which we can exercise
an option to buy a stock. Chung prefers the second name because N is “usually
rather a momentary pause after which the process proceeds again: time marches
on!”
The canonical example of a stopping time is N = inf {n : Sn ∈ A}, the
hitting time of A. To check that this is a stopping time, we observe that
{N = n} = {S1 ∈ Ac , . . . , Sn−1 ∈ Ac , Sn ∈ A} ∈ Fn
Two concrete examples of hitting times that have appeared above are
Example 1.1. N = inf {k : Sk  ≥ x} from the proof of (8.2) in Chapter 1.
Example 1.2. If the Xi ≥ 0 and Nt = sup{n : Sn ≤ t} is the random variable
that ﬁrst appeared in Example 7.1 in Chapter 1, then Nt + 1 = inf {n : Sn > t}
is a stopping time.
The next result allows us to construct new examples from the old ones.
Exercise 1.3. If S and T are stopping times then S ∧ T and S ∨ T are stopping
times. Since constant times are stopping times, it follows that S ∧ n and S ∨ n
are stopping times.
Exercise 1.4. Suppose S and T are stopping times. Is S + T a stopping time?
Give a proof or a counterexample.
Associated with each stopping time N is a σ ﬁeld FN = the information
known at time N . Formally, FN is the collection of sets A that have A ∩ {N = Section 3.1 Stopping Times
n} ∈ Fn for all n < ∞, i.e., when N = n, A must be measurable with respect
to the information known at time n. Trivial but important examples of sets in
FN are {N ≤ n}, i.e., N is measurable with respect to FN .
Exercise 1.5. Show that if Yn ∈ Fn and N is a stopping time, YN ∈ FN .
As a corollary of this result we see that if f : S → R is measurable, Tn =
m≤n f (Xm ), and Mn = maxm≤n Tm then TN and MN ∈ FN . An important
special case is S = R, f (x) = x.
Exercise 1.6. Show that if M ≤ N are stopping times then FM ⊂ FN .
Exercise 1.7. Show that if L ≤ M and A ∈ FL then
N= L
M on A
on Ac is a stopping time Our ﬁrst result about FN is
(1.3) Theorem. Let X1 , X2 , . . . be i.i.d., Fn = σ (X1 , . . . , Xn ) and N be a
stopping time. Conditional on {N < ∞}, {XN +n , n ≥ 1} is independent of FN
and has the same distribution as the original sequence.
Proof By (2.2) in the appendix it is enough to show that if A ∈ FN and
Bj ∈ S for 1 ≤ j ≤ k then
k P (A, N < ∞, XN +j ∈ Bj , 1 ≤ j ≤ k ) = P (A ∩ {N < ∞}) µ(Bj )
j =1 where µ(B ) = P (Xi ∈ B ). The method (“divide and conquer”) is one that we
will see many times below. We break things down according to the value of N
in order to replace N by n and reduce to the case of a ﬁxed time.
P (A, N = n, XN +j ∈ Bj , 1 ≤ j ≤ k ) = P (A, N = n, Xn+j ∈ Bj , 1 ≤ j ≤ k )
k = P (A ∩ {N = n}) µ(Bj )
j =1 since A ∩ {N = n} ∈ Fn and that σ ﬁeld is independent of Xn+1 , . . . , Xn+k .
Summing over n now gives the desired result.
To delve further into properties of stopping times, we recall we have supposed Ω = S N and deﬁne the shift θ : Ω → Ω by
(θω )(n) = ω (n + 1) n = 1, 2, . . . 175 176 Chapter 3 Random Walks
In words, we drop the ﬁrst coordinate and shift the others one place to the left.
The iterates of θ are deﬁned by composition. Let θ1 = θ, and for k ≥ 2 let
θk = θ ◦ θk−1 . Clearly, (θk ω )(n) = ω (n + k ), n = 1, 2, . . . To extend the last
deﬁnition to stopping times, we let
θN ω = θn ω
∆ on {N = n}
on {N = ∞} Here ∆ is an extra point that we add to Ω. According to the only joke in
Blumenthal and Getoor (1968), ∆ is a “cemetery or heaven depending upon
your point of view.” Seriously, ∆ is a convenience in making deﬁnitions like the
next one.
Example 1.3. Returns to 0. For a concrete example of the use of θ, suppose
S = Rd and let
τ (ω ) = inf {n : ω1 + · · · + ωn = 0}
where inf ∅ = ∞, and we set τ (∆) = ∞. If we let τ2 (ω ) = τ (ω ) + τ (θτ ω ) then
on {τ < ∞},
τ (θτ ω ) = inf {n : (θτ ω )1 + · · · + (θτ ω )n = 0}
= inf {n : ωτ +1 + · · · + ωτ +n = 0}
τ (ω ) + τ (θτ ω ) = inf {m > τ : ω1 + · · · + ωm = 0}
So τ2 is the time of the second visit to 0 (and thanks to the conventions θ∞ ω = ∆
and τ (∆) = ∞, this is true for all ω ). The last computation generalizes easily
to show that if we let
τn (ω ) = τn−1 (ω ) + τ (θτn−1 ω )
then τn is the time of the nth visit to 0.
If we have any stopping time T , we can deﬁne its iterates by T0 = 0 and
Tn (ω ) = Tn−1 (ω ) + T (θTn−1 ω ) for n ≥ 1 If we assume P = µ × µ × . . . then
(1.4) P (Tn < ∞) = P (T < ∞)n Proof We will prove this by induction. The result is trivial when n = 1.
Suppose now that it is valid for n − 1. Applying (1.3) to N = Tn−1 , we see that Section 3.1 Stopping Times
T (θTn−1 ) < ∞ is independent of Tn−1 < ∞, and has the same probability as
T < ∞, so
P (Tn < ∞) = P (Tn−1 < ∞, T (θTn−1 ω ) < ∞)
= P (Tn−1 < ∞)P (T < ∞) = P (T < ∞)n
by the induction hypothesis.
Letting tn = T (θTn−1 ), we can extend (1.3) to
(1.5) Theorem. Suppose P (T < ∞) = 1. Then the “random vectors”
Vn = (tn , XTn−1 +1 , . . . , XTn )
are independent and identically distributed.
Proof It is clear from (1.3) that Vn and V1 have the same distribution. The
independence follows from (1.3) and induction since V1 , . . . , Vn−1 ∈ F (Tn−1 ). Example 1.4. Ladder variables. Let α(ω ) = inf {n : ω1 + · · · + ωn > 0}
where inf ∅ = ∞, and set α(∆) = ∞. Let α0 = 0 and let
αk (ω ) = αk−1 (ω ) + α(θαk−1 ω )
for k ≥ 1. At time αk , the random walk is at a record high value. The next
three exercises investigate these times.
Exercise 1.8. (i) If P (α < ∞) < 1 then P (sup Sn < ∞) = 1.
(ii) If P (α < ∞) = 1 then P (sup Sn = ∞) = 1.
Exercise 1.9. Let β = inf {n : Sn < 0}. Prove that the four possibilities
in (1.2) correspond to the four combinations of P (α < ∞) < 1 or = 1, and
P (β < ∞) < 1 or = 1.
¯
Exercise 1.10. Let S0 = 0, β = inf {n ≥ 1 : Sn ≤ 0} and
An = {0 ≥ Sm , S1 ≥ Sm , . . . , Sm−1 ≥ Sm , Sm < Sm+1 , . . . , Sm < Sn }
m
¯
(i) Show 1 = n =0 P (An ) = n =0 P (α > m)P (β > n − m).
m
m
m
¯ = ∞).
(ii) Let n → ∞ and conclude Eα = 1/P (β
Exercise 1.11. (i) Combine the last exercise with the proof of (ii) in Exercise
¯
1.8 to conclude that if EXi = 0 then P (β = ∞) = 0. (ii) Show that if we 177 178 Chapter 3 Random Walks
assume in addition that P (Xi = 0) < 1 then P (β = ∞) = 0 and Exercise 1.9
implies we are in case (iv) of (1.2).
Our ﬁnal result about stopping times is:
(1.6) Wald’s equation. Let X1 , X2 , . . . be i.i.d. with E Xi  < ∞. If N is a
stopping time with EN < ∞ then ESN = EX1 EN.
Proof First suppose the Xi ≥ 0.
∞ ESN = n ∞ Sn 1{N =n} dP = SN dP =
n=1 Xm 1{N =n} dP
n=1 m=1 Since the Xi ≥ 0, we can interchange the order of summation (i.e., use Fubini’s
theorem) to conclude that the last expression
∞ ∞ ∞ Xm 1{N =n} dP = =
m=1 n=m Xm 1{N ≥m} dP
m=1 Now {N ≥ m} = {N ≤ m − 1}c ∈ Fm−1 and is independent of Xm , so the last
expression
∞ = EXm P (N ≥ m) = EX1 EN
m=1 To prove the result in general, we run the last argument backwards. If we have
EN < ∞ then
∞ ∞ ∞ Xm 1{N =n} dP E X m P ( N ≥ m ) = ∞> m=1 n=m m=1 The last formula shows that the double sum converges absolutely in one order,
so Fubini’s theorem gives
∞ ∞ ∞ n Xm 1{N =n} dP =
m=1 n=m Xm 1{N =n} dP
n=1 m=1 Using the independence of {N ≥ m} ∈ Fm−1 and Xm , and rewriting the last
identity, it follows that
∞ EXm P (N ≥ m) = ESN
m=1 Section 3.1 Stopping Times
Since the lefthand side is EN EX1 , the proof is complete.
Exercise 1.12. Let X1 , X2 , . . . be i.i.d. uniform on (0,1), let Sn = X1 +· · ·+Xn ,
and let T = inf {n : Sn > 1}. Show that P (T > n) = 1/n!, so ET = e and
EST = e/2.
Example 1.5. Simple random walk. Let X1 , X2 , . . . be i.i.d. with P (Xi =
1) = 1/2 and P (Xi = −1) = 1/2. Let a < 0 < b be integers and let N = inf {n :
Sn ∈ (a, b)}. To apply (1.6), we have to check that EN < ∞. To do this, we
observe that if x ∈ (a, b), then
P (x + Sb−a ∈ (a, b)) ≥ 2−(b−a)
/
since b − a steps of size +1 in a row will take us out of the interval. Iterating
the last inequality, it follows that
P (N > n(b − a)) ≤ 1 − 2−(b−a) n so EN < ∞. Applying (1.6) now gives ESN = 0 or
bP (SN = b) + aP (SN = a) = 0
Since P (SN = b) + P (SN = a) = 1, it follows that (b − a)P (SN = b) = −a, so
P (SN = b) = −a
b−a P (SN = a) = b
b−a Letting Ta = inf {n : Sn = a}, we can write the last conclusion as
(1.7) P ( Ta < Tb ) = b
b−a for a < 0 < b Setting b = M and letting M → ∞ gives
P (Ta < ∞) ≥ P (Ta < TM ) → 1
for all a < 0. From symmetry (and the fact that T0 ≡ 0), it follows that
(1.8) P (Tx < ∞) = 1 for all x ∈ Z Our ﬁnal fact about Tx is that ETx = ∞ for x = 0. To prove this, note that if
ETx < ∞ then (1.6) would imply
x = ESTx = EX1 ETx = 0 179 180 Chapter 3 Random Walks
In Section 3.3, we will compute the distribution of T1 and show that
P (T1 > t) ∼ C t−1/2
Exercise 1.13. Asymmetric simple random walk. Let X1 , X2 , . . . be
i.i.d. with P (X1 = 1) = p > 1/2 and P (X1 = −1) = 1 − p, and let Sn =
X1 + · · · + Xn . Let α = inf {m : Sm > 0} and β = inf {n : Sn < 0}.
(i) Use Exercise 1.9 to conclude that P (α < ∞) = 1 and P (β < ∞) < 1.
(ii) If Y = inf Sn , then P (Y ≤ −k ) = P (β < ∞)k .
(iii) Apply Wald’s equation to α ∧ n and let n → ∞ to get Eα = 1/EX1 =
¯
1/(2p − 1). Comparing with Exercise 1.10 shows P (β = ∞) = 2p − 1.
Exercise 1.14. An optimal stopping problem. Let Xn , n ≥ 1 be i.i.d. with
+
EX1 < ∞ and let
Yn = max Xm − cn
1≤m≤n That is, we are looking for a large value of X , but we have to pay c > 0 for each
observation. (i) Let T = inf {n : Xn > a}, p = P (Xn > a), and compute EYT .
(ii) Let α (possibly < 0) be the unique solution of E (X1 − α)+ = c. Show that
EYT = α in this case and use the inequality
n Yn ≤ α + ((Xm − α)+ − c)
m=1 for n ≥ 1 to conclude that if τ ≥ 1 is a stopping time with Eτ < ∞, then
EYτ ≤ α. The analysis above assumes that you have to play at least once. If
the optimal α < 0, then you shouldn’t play at all.
Exercise 1.15. Wald’s second equation. Let X1 , X2 , . . . be i.i.d. with
2
EXn = 0 and EXn = σ 2 < ∞. If T is a stopping time with ET < ∞ then
2
2
2
EST = σ ET . Hint: Compute EST ∧n by induction and show that ST ∧n is a
2
Cauchy sequence in L .
An amusing consequence of the last result is
2
(1.7) Theorem. Let X1 , X2 , . . . be i.i.d. with EXn = 0 and EXn = 1, and let
1/2
Tc = inf {n ≥ 1 : Sn  > cn }. ETc < ∞ for c < 1
= ∞ for c ≥ 1 Proof One half of this is easy. If ETc < ∞ then the previous exercise implies
2
ETc = E (STc ) > c2 ETc a contradiction if c ≥ 1. To prove the other direction, Section 3.1 Stopping Times
2
we let τ = Tc ∧ n and observe Sτ −1 ≤ c2 (τ − 1), so using the CauchySchwarz
inequality
2
2
2
Eτ = ESτ = ESτ −1 + 2E (Sτ −1 Xτ ) + EXτ
2
2
≤ c2 Eτ + 2c(Eτ EXτ )1/2 + EXτ To complete the proof now, we will show
Lemma. If T is a stopping time with ET = ∞ then
2
EXT ∧n /E (T ∧ n) → 0 (1.7) follows for if
contradiction.
Proof < 1 − c2 and n is large, we will have Eτ ≤ (c2 + )Eτ , a We begin by writing
n 2
2
2
E (XT ∧n ) = E (XT ∧n ; XT ∧n ≤ (T ∧ n)) + 2
2
E (Xj ; T ∧ n = j, Xj > j )
j =1 The ﬁrst term is ≤ E (T ∧ n). To bound the second, choose N ≥ 1 so that for
n≥N
n
2
2
E ( Xj ; Xj > j ) < n
j =1
2
2
This is possible since the dominated convergence theorem implies E (Xj ; Xj >
j ) → 0 as j → ∞. For the ﬁrst part of the sum, we use a trivial bound N
2
2
2
E (Xj ; T ∧ n = j, Xj > j ) ≤ N EX1
j =1 2
To bound the remainder of the sum, we note (i) Xj ≥ 0; (ii) {T ∧ n ≥ j } is
2
2
∈ Fj −1 and hence is independent of Xj 1(Xj > j ) , (iii) use some trivial arithmetic,
(iv) use Fubini’s theorem and enlarge the range of j , (v) use the choice of N 181 182 Chapter 3 Random Walks
and a trivial inequality
n n
2
2
E (Xj ; T ∧ n = j, Xj > j ) ≤ j =N 2
2
E (Xj ; T ∧ n ≥ j, Xj > j )
j =N
n
2
2
P ( T ∧ n ≥ j ) E ( Xj ; Xj > j ) = j =N
n
∞
2
2
P ( T ∧ n = k ) E ( Xj ; Xj > j ) =
j =N k=j
∞ k
2
2
P ( T ∧ n = k ) E ( Xj ; Xj > j ) ≤
k=N j =1
∞ ≤ kP (T ∧ n = k ) ≤ E (T ∧ n)
k=N Combining our estimates shows
2
2
EXT ∧n ≤ 2 E (T ∧ n) + N EX1 Letting n → ∞ and noting E (T ∧ n) → ∞, we have
2
lim sup EXT ∧n /E (T ∧ n) ≤ 2
n→∞ where is arbitrary. 3.2. Recurrence
Throughout this section, Sn will be random walk, i.e., Sn = X1 + · · · + Xn
where X1 , X2 , . . . are i.i.d., and we will investigate the question mentioned at
the beginning of the chapter. Does the sequence S1 (ω ), S2 (ω ), . . . return to (or
near) 0 inﬁnitely often? The answer to the last question is either Yes or No,
and the random walk is called recurrent or transient accordingly. We begin
with some deﬁnitions that formulate the question precisely and a result that
establishes a dichotomy between the two cases.
The number x ∈ Rd is said to be a recurrent value for the random walk
Sn if for every > 0, P ( Sn − x < i.o.) = 1. Here x = sup xi . The
reader will see the reason for this choice of norm in the proof of (2.5). The
HewittSavage 01 law, (1.1), implies that if the last probability is < 1, it is 0.
Our ﬁrst result shows that to know the set of recurrent values, it is enough to Section 3.2 Recurrence
check x = 0. A number x is said to be a possible value of the random walk if
for any > 0, there is an n so that P ( Sn − x < ) > 0.
(2.1) Theorem. The set V of recurrent values is either ∅ or a closed subgroup
of Rd . In the second case, V = U , the set of possible values.
Proof Suppose V = ∅. It is clear that V c is open, so V is closed. To prove
that V is a group, we will ﬁrst show that
(∗) if x ∈ U and y ∈ V then y − x ∈ V .
This statement has been formulated so that once it is established, (2.1) follows
easily. Let
pδ,m (z ) = P ( Sn − z ≥ δ for all n ≥ m)
If y − x ∈ V , there is an > 0 and m ≥ 1 so that p2
/
there is a k so that P ( Sk − x < ) > 0. Since ,m (y − x) P ( Sn − Sk − (y − x) ≥ 2 for all n ≥ k + m) = p2 > 0. Since x ∈ U ,
,m (y − x) and is independent of { Sk − x < }, it follows that
p ,m+k (y ) ≥ P ( Sk − x < )p2 ,m (y − x) > 0 contradicting y ∈ V , so y − x ∈ V .
To conclude V is a group when V = ∅, let q, r ∈ V , and observe: (i) taking
x = y = r in (∗) shows 0 ∈ V , (ii) taking x = r, y = 0 shows −r ∈ V , and (iii)
taking x = −r, y = q shows q + r ∈ V . To prove that V = U now, observe that
if u ∈ U taking x = u, y = 0 shows −u ∈ V and since V is a group, it follows
that u ∈ V .
If V = ∅, the random walk is said to be transient, otherwise it is called
recurrent. Before plunging into the technicalities needed to treat a general
random walk, we begin by analyzing the special case Polya considered in 1921.
Legend has it that Polya thought of this problem while wandering around in a
park near Z¨rich when he noticed that he kept encountering the same young
u
couple. History does not record what the young couple thought.
Example 2.1. Simple random walk on Zd .
P (Xi = ej ) = P (Xi = −ej ) = 1/2d
for each of the d unit vectors ej . To analyze this case, we begin with a result
that is valid for any random walk. Let τ0 = 0 and τn = inf {m > τn−1 : Sm = 0}
be the time of the nth return to 0. From (1.4), it follows that
P (τn < ∞) = P (τ1 < ∞)n 183 184 Chapter 3 Random Walks
a fact that leads easily to:
(2.2) Theorem. For any random walk, the following are equivalent:
∞
(i) P (τ1 < ∞) = 1, (ii) P (Sm = 0 i.o.) = 1, and (iii) m=0 P (Sm = 0) = ∞.
Proof
1. Let If P (τ1 < ∞) = 1, then P (τn < ∞) = 1 for all n and P (Sm = 0 i.o.) =
∞ ∞ 1(Sm =0) = V=
m=0 1(τn <∞)
n=0 be the number of visits to 0, counting the visit at time 0. Taking expected value
and using Fubini’s theorem to put the expected value inside the sum:
∞ ∞ EV = P (Sm = 0) =
m=0
∞ P (τn < ∞)
n=0 P (τ1 < ∞)n = =
n=0 1
1 − P (τ1 < ∞) The second equality shows (ii) implies (iii), and in combination with the last
two shows that if (i) is false then (iii) is false (i.e., (iii) implies (i)).
(2.3) Theorem. Simple random walk is recurrent in d ≤ 2 and transient in
d ≥ 3.
To steal a joke from Kakutani (U.C.L.A. colloquium talk): “A drunk man will
eventually ﬁnd his way home but a drunk bird may get lost forever.”
Proof of (2.3) Let ρd (m) = P (Sm = 0). ρd (m) is 0 if m is odd. From (1.4)
in Chapter 2, we get ρ1 (2n) ∼ (πn)−1/2 as n → ∞. This and (2.2) gives the
result in one dimension. Our next step is
Simple random walk is recurrent in two dimensions. Note that in order
for S2n = 0 we must for some 0 ≤ m ≤ n have m up steps, m down steps, n − m
to the left and n − m to the right so
n ρ2 (2n) = 4−2n
−2n =4 2n!
m! m! (n − m)! (n − m)!
m=0
2n
n n m=0 n
m n
n−m = 4−2n 2n
n 2 = ρ1 (2n)2 To see the next to last equality, consider choosing n students from a class with
n boys and n girls and observe that for some 0 ≤ m ≤ n you must choose m Section 3.2 Recurrence
boys and n − m girls. Using the asymptotic formula ρ1 (2n) ∼ (πn)−1/2 , we get
ρ2 (2n) ∼ (πn)−1 . Since
n−1 = ∞, the result follows from (2.2).
1
2
Remark. For a direct proof of ρ2 (2n) = ρ1 (2n)2 , note that if Tn and Tn are
independent, one dimensional random walks then Tn jumps from x to x + (1, 1),
x + (1, −1), x + (−1, 1), and x + (−1, −1) with equal probability, so rotating Tn
√
by 45 degrees and dividing by 2 gives Sn . Simple random walk is transient in three dimensions. Intuitively, this
holds since the probability of being back at 0 after 2n steps is ∼ cn−3/2 and
this is summable. We will not compute the probability exactly but will get an
upper bound of the right order of magnitude. Again, since the number of steps
in the directions ±ei must be equal for i = 1, 2, 3
ρ3 (2n) = 6−2n
j,k (2n)!
(j !k !(n − j − k )!)2
n!
j !k !(n − j − k )! = 2−2n 2n
n ≤ 2−2n 2 n!
2n
max 3−n
j,k
n
j !k !(n − j − k )! 3−n
j,k where in the last inequality we have used the fact that if aj,k are ≥ 0 and sum
to 1 then j,k a2 ≤ maxj,k aj,k . Our last step is to show
j,k
max 3−n
j,k n!
≤ Cn−1
j !k !(n − j − k )! To do this, we note that (a) if any of the numbers j , k or n − j − k is < [n/3]
increasing the smallest number and decreasing the largest number decreases
the denominator (since x(1 − x) is maximized at 1/2), so the maximum occurs
when all three numbers are as close as possible to n/3; (b) Stirlings’ formula
implies
nn
n!
∼ jk
·
j !k !(n − j − k )!
j k (n − j − k )n−j −k 1
n
·
jk (n − j − k ) 2π Taking j and k within 1 of n/3 the ﬁrst term on the right is ≤ C 3n , and the
desired result follows.
1
2
3
Simple random walk is transient in d > 3. Let Tn = (Sn , Sn , Sn ), N (0) = 0
and N (n) = inf {m > N (n − 1) : Tm = TN (n−1) }. It is easy to see that TN (n) 185 186 Chapter 3 Random Walks
is a threedimensional simple random walk. Since TN (n) returns inﬁnitely often
to 0 with probability 0 and the ﬁrst three coordinates are constant in between
the N (n), Sn is transient.
Remark. Let πd = P (Sn = 0 for some n ≥ 1) be the probability simple
random walk on Zd returns to 0. The last display in the proof of (2.2) implies
∞ P (S2n = 0) =
n=0 1
1 − πd In d = 3, P (S2n = 0) ∼ Cn−3/2 so ∞ N P (S2n = 0) ∼ C N −1/2 , and the
n=
series converges rather slowly. For example, if we want to compute the return
probability to 5 decimal places, we would need 1010 terms. At the end of the
section, we will give another formula that leads very easily to accurate results.
The rest of this section is devoted to proving the following facts about
random walks:
(2.7) Sn is recurrent in d = 1 if Sn /n → 0 in probability.
(2.8) Sn is recurrent in d = 2 if Sn /n1/2 ⇒ a nondegenerate normal distribution.
(2.12) Sn is transient in d ≥ 3 if it is “truly three dimensional.”
To prove (2.12), we will give a necessary and suﬃcient condition for recurrence,
(2.9), that shows that the conditions in (2.7) and (2.8) are close to the best
possible. The ﬁrst step in deriving these results is to generalize (2.2):
∞ (2.4) Lemma. If n=1 P ( Sn < ) < ∞ then P ( Sn <
∞
If n=1 P ( Sn < ) = ∞ then P ( Sn < 2 i.o.) = 1. i.o.) = 0. Proof The ﬁrst conclusion follows from the BorelCantelli lemma. To prove
the second, let F = { Sn < i.o.}c . Breaking things down according to the
last time Sn < ,
∞ P (F ) = P ( Sm < , Sn ≥ for all n ≥ m + 1) m=0
∞ ≥ P ( Sm < , Sn − Sm ≥ 2 for all n ≥ m + 1)
m=0
∞ = P ( Sm < )ρ2
m=0 ,1 Section 3.2 Recurrence
where ρδ,k = P ( Sn ≥ δ for all n ≥ k ). Since P (F ) ≤ 1, and
∞ P ( Sm < ) = ∞
m=0 it follows that ρ2 ,1 = 0. To extend this conclusion to ρ2 Am = { Sm < , Sn ≥ ,k with k ≥ 2, let for all n ≥ m + k } Since any ω can be in at most k of the Am , repeating the argument above gives
∞ k≥ ∞ P (Am ) ≥
m=0 P ( Sm < )ρ2 ,k m=0 So ρ2 ,k = P ( Sn ≥ 2 for all j ≥ k ) = 0, and since k is arbitrary, the desired
conclusion follows.
Our second step is to show that the convergence or divergence of the sums
in (2.4) is independent of . The previous proof is valid for any norm. For the
next one, we need x = supi xi .
(2.5) Lemma. Let m be an integer ≥ 2.
∞ ∞ P ( Sn < m ) ≤ (2m)d
n=0 Proof P ( Sn < )
n=0 We begin by observing
∞ ∞ P (Sn ∈ k + [0, )d ) P ( Sn < m ) ≤
n=0 n=0 k where the inner sum is over k ∈ {−m, . . . , m − 1}d. If we let
Tk = inf { ≥ 0 : S ∈ k + [0, )d }
then breaking things down according to the value of Tk and using Fubini’s
theorem gives
∞ ∞ n P (Sn ∈ k + [0, )d ) =
n=0 P (Sn ∈ k + [0, )d , Tk = )
n=0 =0
∞∞ ≤ P ( Sn − S
=0 n= < , Tk = ) 187 188 Chapter 3 Random Walks
Since {Tk = } and { Sn − S
∞ = < } are independent, the last sum
∞ P ( Tk = m )
m=0 ∞ P ( Sj < ) ≤
j =0 P ( Sj < )
j =0 Since there are (2m)d values of k in {−m, . . . , m − 1}d , the proof of (2.5) is
complete.
Combining (2.4) and (2.5) gives:
(2.6) Corollary. The convergence (resp. divergence) of n P ( Sn < ) for a
single value of > 0 is suﬃcient for transience (resp. recurrence).
In d = 1, if EXi = µ = 0, then the strong law of large numbers implies
Sn /n → µ so Sn  → ∞ and Sn is transient. As a converse, we have
(2.7) ChungFuchs theorem. Suppose d = 1. If the weak law of large
numbers holds in the form Sn /n → 0 in probability, then Sn is recurrent.
Proof Let un (x) = P (Sn  < x) for x > 0. (2.5) implies
∞ ∞ un (1) ≥
n=0 Am 1
1
un ( m ) ≥
un (n/A)
2m n=0
2m n=0 for any A < ∞ since un (x) ≥ 0 and is increasing in x. By hypothesis un (n/A) →
1, so letting m → ∞ and noticing the righthand side is A/2 times the average
of the ﬁrst Am terms
∞
un (1) ≥ A/2
n=0 Since A is arbitrary, the sum must be ∞, and the desired conclusion follows
from (2.6).
(2.8) Theorem. If Sn is a random walk in R2 and Sn /n1/2 ⇒ a nondegenerate
normal distribution then Sn is recurrent.
Remark. The conclusion is also true if the limit is degenerate, but in that case
the random walk is essentially one (or zero) dimensional, and the result follows
from the ChungFuchs theorem.
Proof Let u(n, m) = P ( Sn < m). (2.5) implies
∞ ∞ u(n, 1) ≥ (4m2 )−1
n=0 u(n, m)
n=0 Section 3.2 Recurrence
√
If m/ n → c then
u(n, m) → n (x) dx
[−c,c]2 where n(x) is the density of the limiting normal distribution. If we use ρ(c)
to denote the righthand side and let n = [θm2 ], it follows that u([θm2 ], m) →
ρ(θ−1/2 ). If we write
∞ ∞ m−2 u([θm2 ], m) dθ u(n, m) =
0 n=0 let m → ∞, and use Fatou’s lemma, we get
∞ ∞ lim inf (4m2 )−1
m→∞ u(n, m) ≥ 4−1 ρ(θ−1/2 ) dθ
0 n=0 Since the normal density is positive and continuous at 0
n (x) dx ∼ n (0)(2c)2 ρ(c) =
[−c,c]2 as c → 0. So ρ(θ−1/2 ) ∼ 4n (0)/θ as θ → ∞, the integral diverges, and back∞
tracking to the ﬁrst inequality in the proof it follows that n=0 u(n, 1) = ∞,
proving (2.8).
We come now to the promised necessary and suﬃcient condition for recurrence. Here ϕ = E exp(it · Xj ) is the ch.f. of one step of the random walk.
(2.9) Theorem. Let δ > 0. Sn is recurrent if and only if
Re
(−δ,δ )d 1
dy = ∞
1 − ϕ(y ) We will prove a weaker result:
(2.9 ) Theorem. Let δ > 0. Sn is recurrent if and only if
Re sup
r<1 (−δ,δ )d 1
dy = ∞
1 − rϕ(y ) Remark. Half of the work needed to get (2.9) from (2.9 ) is trivial.
0 ≤ Re 1
1
→ Re
1 − rϕ(y )
1 − ϕ(y ) as r → 1 189 190 Chapter 3 Random Walks
so Fatou’s lemma shows that if the integral is inﬁnite, the walk is recurrent.
The other direction is rather diﬃcult: (2.9 ) is in Chung and Fuchs (1951), but
a proof of (2.9) had to wait for Ornstein (1969) and Stone (1969) to solve the
problem independently. Their proofs use a trick to reduce to the case where
the increments have a density and then a second trick to deal with that case, so
we will not give the details here. The reader can consult either of the sources
cited or Port and Stone (1969), where the result is demonstrated for random
walks on Abelian groups.
Proof of (2.9 ) The ﬁrst ingredient in the solution is the
(2.10) Parseval relation. Let µ and ν be probability measures on Rd with
ch.f.’s ϕ and ψ .
ψ (t) µ(dt) =
Proof ϕ(x) ν (dx) Since eit·x is bounded, Fubini’s theorem implies ψ (t)µ(dt) = eitx ν (dx)µ(dt) = eitx µ(dt)ν (dx) = ϕ(x)ν (dx) Our second ingredient is a little calculus.
(2.11) Lemma. If x ≤ π/3 then 1 − cos x ≥ x2 /4.
Proof It suﬃces to prove the result for x > 0. If z ≤ π/3 then cos z ≥ 1/2,
y sin y = cos z dz ≥
0 y
2 x 1 − cos x = x sin y dy ≥
0 0 x2
y
dy =
2
4 From Example 3.5 in Chapter 2, we see that the density
δ − x
δ2 when x ≤ δ, 0 otherwise has ch.f. 2(1 − cos δt)/(δt)2 . Let µn denote the distribution of Sn . Using (2.11)
(note π/3 ≥ 1) and then (2.10), we have
d d P ( Sn < 1/δ ) ≤ 4 1 − cos(δti )
µn (dt)
(δti )2
i=1
d = 2d
(−δ,δ )d i=1 δ − xi  n
ϕ (x) dx
δ2 Section 3.2 Recurrence
Our next step is to sum from 0 to ∞. To be able to interchange the sum and
the integral, we ﬁrst multiply by rn where r < 1.
∞ d rn P ( Sn < 1/δ ) ≤ 2d
(−δ,δ )d i=1 n=0 δ − xi 
1
dx
δ 2 1 − rϕ(x) Symmetry dictates that the integral on the right is real, so we can take the real
part without aﬀecting its value. Letting r ↑ 1 and using (δ − x)/δ ≤ 1
∞ 2
δ P ( Sn < 1/δ ) ≤
n=0 d sup
r<1 Re
(−δ,δ )d 1
dx
1 − rϕ(x) and using (2.6) gives half of (2.9 ).
To prove the other direction, we begin by noting that Example 3.8 from
Chapter 2 shows that the density
1 − cos(x/δ )
πx2 /δ
has ch.f. 1 − δt when t ≤ 1/δ , 0 otherwise. Using 1 ≥
then (2.10), d
i=1 (1 − δxi ) and d P ( Sn < 1/δ ) ≥ (1 − δxi ) µn (dx)
(−1/δ,1/δ )d i=1
d =
i=1 1 − cos(ti /δ ) n
ϕ (t) dt
πt2 /δ
i Multiplying by rn and summing gives
∞ d rn P ( Sn < 1/δ ) ≥
n=0 1 − cos(ti /δ )
1
dt
πt2 /δ
1 − rϕ(t)
i
i=1 The last integral is real, so its value is unaﬀected if we integrate only the real
part of the integrand. If we do this and apply (2.11), we get
∞ rn P ( Sn < 1/δ ) ≥ (4πδ )−d
n=0 Re
(−δ,δ )d 1
dt
1 − rϕ(t) Letting r ↑ 1 and using (2.6) now completes the proof of (2.9 ). 191 192 Chapter 3 Random Walks
We will now consider some examples. Our goal in d = 1 and d = 2 is to
convince you that the conditions in (2.7) and (2.8) are close to the best possible.
d = 1. Consider the symmetric stable laws that have ch.f. ϕ(t) = exp(−tα ).
To avoid using facts that we have not proved, we will obtain our conclusions
from (2.9 ). It is not hard to use that form of the criterion in this case since
1 − rϕ(t) ↓ 1 − exp(−tα )
α 1 − exp(−t ) ∼ t α as r ↑ 1 as t → 0 From this, it follows that the corresponding random walk is transient for α < 1
and recurrent for α ≥ 1. The case α > 1 is covered by (2.7) since these random
walks have mean 0. The result for α = 1 is new because the Cauchy distribution
does not satisfy Sn /n → 0 in probability. The random walks with α < 1 are
interesting because (1.2) implies (see Exercise 1.1)
−∞ = lim inf Sn < lim sup Sn = ∞
but P (Sn  < M i.o.) = 0 for any M < ∞.
Remark. The stable law examples are misleading in one respect. Shepp (1964)
has proved that recurrent random walks may have arbitrarily large tails. To be
precise, given a function (x) ↓ 0 as x ↑ ∞, there is a recurrent random walk
with P (X1  ≥ x) ≥ (x) for large x.
d = 2. Let α < 2, and let ϕ(t) = exp(−tα ) where t = (t2 + t2 )1/2 . ϕ
1
2
is the characteristic function of a random vector (X1 , X2 ) that has two nice
properties:
(i) the distribution of (X1 , X2 ) is invariant under rotations,
(ii) X1 and X2 have symmetric stable laws with index α.
Again, 1 − rϕ(t) ↓ 1 − exp(−tα ) as r ↑ 1 and 1 − exp(−tα ) ∼ tα as t → 0.
Changing to polar coordinates and noticing
δ dx x x−α < ∞ 2π
0 when 1 − α > −1 shows the random walks with ch.f. exp(−tα ), α < 2 are
transient. When p < α, we have E X1 p < ∞ by Exercise 7.5 in Chapter 2, so
these examples show that (2.8) is reasonably sharp.
δ d ≥ 3. The integral 0 dx xd−1 x−2 < ∞, so if a random walk is recurrent
in d ≥ 3, its ch.f. must → 1 faster than t2 . In Exercise 3.19 of Chapter 2, Section 3.2 Recurrence
we observed that (in one dimension) if ϕ(r) = 1 + o(r2 ) then ϕ(r) ≡ 1. By
considering ϕ(rθ) where r is real and θ is a ﬁxed vector, the last conclusion
generalizes easily to Rd , d > 1 and suggests that once we exclude walks that
stay on a plane through 0, no threedimensional random walks are recurrent.
A random walk in R3 is truly threedimensional if the distribution of
X1 has P (X1 · θ = 0) > 0 for all θ = 0.
(2.12) Theorem. No truly threedimensional random walk is recurrent.
Proof We will deduce the result from (2.9 ). We begin with some arithmetic.
If z is complex, the conjugate of 1 − z is 1 − z , so
¯
1
1−z
¯
=
1−z
1 − z  2 and Re 1
Re (1 − z )
=
1−z
1 − z  2 If z = a + bi with a ≤ 1, then using the previous formula and dropping the b2
from the denominator
Re 1−a
1
1
=
≤
2 + b2
1−z
(1 − a)
1−a Taking z = rϕ(t) and supposing for the second inequality that 0 ≤ Re ϕ(t) ≤ 1,
we have
(a) Re 1
1
1
≤
≤
1 − rϕ(t)
Re (1 − rϕ(t))
Re (1 − ϕ(t)) The last calculation shows that it is enough to estimate
Re (1 − ϕ(t)) = {1 − cos(x · t)}µ(dx) ≥
x·t<π/3  x · t 2
µ(dx)
4 by (2.11). Writing t = ρθ where θ ∈ S = {x : x = 1} gives
(b) Re (1 − ϕ(ρθ)) ≥ ρ2
4 x · θ2 µ(dx)
x·θ <π/3ρ Fatou’s lemma implies that if we let ρ → 0 and θ(ρ) → θ, then
(c) x · θ(ρ)2 µ(dx) ≥ lim inf
ρ→0 x · θ2 µ(dx) > 0 x·θ (ρ)<π/3ρ I claim this implies that for ρ < ρ0
(d) x · θ2 µ(dx) = C > 0 inf θ ∈S x·θ <π/3ρ 193 194 Chapter 3 Random Walks
To get the last conclusion, observe that if it is false, then for ρ = 1/n there is
a θn so that
x · θn 2 µ(dx) ≤ 1/n
x·θn <nπ/3 All the θn lie in S , a compact set, so if we pick a convergent subsequence we
contradict (c). Combining (b) and (d) gives
Re (1 − ϕ(ρθ)) ≥ Cρ2 /4
Using the last result and (a) then changing to polar coordinates, we see that if
δ is small (so Re ϕ(y ) ≥ 0 on (−δ, δ )d )
1
dy ≤
Re
1 − rϕ(y )
(−δ,δ )d √
δd dρ ρd−1 dθ 0 1
Re (1 − ϕ(ρθ)) 1 dρ ρd−3 < ∞ ≤C
0 when d > 2, so the desired result follows from (2.9 ).
Remark. The analysis becomes much simpler when we consider random walks
on Zd . The inversion formula given in Exercise 3.1 of Chapter 2 implies
P (Sn = 0) = (2π )−d ϕn (t) dt
(−π,π )d Multiplying by rn and summing gives
∞ rn P (Sn = 0) = (2π )−d
(−π,π )d n=0 1
dt
1 − rϕ(t) In the case of simple random walk in d = 3, ϕ(t) = 1
3 3
j =1 cos tj is real. 1
1
↑
when ϕ(t) > 0
1 − rϕ(t) 1 − ϕ(t)
1
0≤
≤ 1 when ϕ(t) ≤ 0
1 − rϕ(t)
So, using the monotone and bounded convergence theorems
∞ 3 P (Sn = 0) = (2π )
n=0 −3
(−π,π )3 1
1−
cos xi
3 m=1 −1 dx 3.3 Visits to 0, Arcsine Laws
This integral was ﬁrst evaluated by Watson in 1939 in terms of elliptic integrals,
which could be found in tables. Glasser and Zucker (1977) showed that it was
√
( 6/32π 3 )Γ(1/24)Γ(5/24)Γ(7/24)Γ(11/24) = 1.516386059137 . . .
so it follows from the remark after the proof of (2.3) that
π3 = .340537329544...
For numerical results in 4 ≤ d ≤ 9, see Kondo and Hara (1987). *3.3. Visits to 0, Arcsine Laws
In the last section, we took a broad look at the recurrence of random walks.
In this section, we will take a deep look at one example: simple random walk
(on Z). To steal a line from Chung, “We shall treat this by combinatorial
methods as an antidote to the analytic skulduggery above.” The developments
here follow Chapter III of Feller, vol. I. To facilitate discussion, we will think
of the sequence S1 , S2 , . . . , Sn as being represented by a polygonal line with
segments (k − 1, Sk−1 ) → (k, Sk ). A path is a polygonal line that is a possible
outcome of simple random walk. To count the number of paths from (0,0) to
(n, x), it is convenient to introduce a and b deﬁned by: a = (n + x)/2 is the
number of positive steps in the path and b = (n − x)/2 is the number of negative
steps. Notice that n = a + b and x = a − b. If −n ≤ x ≤ n and n − x is even,
the a and b deﬁned above are nonnegative integers, and the number of paths
from (0,0) to (n, x) is
(∗) Nn,x = n
a Otherwise, the number of paths is 0.
(3.1) Reﬂection principle. If x, y > 0 then the number of paths from (0, x)
to (n, y ) that are 0 at some time is equal to the number of paths from (0, −x)
to (n, y ).
Proof Suppose (0, s0 ), (1, s1 ), . . . , (n, sn ) is a path from (0, x) to (n, y ). Let
K = inf {k : sk = 0}. Let sk = −sk for k ≤ K , sk = sk for K ≤ k ≤ n.
Then (k, sk ), 0 ≤ k ≤ n, is a path from (0, −x) to (n, y ). Conversely, if
(0, t0 ), (1, t1 ), . . . , (n, tn ) is a path from (0, −x) to (n, y ) then it must cross 0.
Let K = inf {k : tk = 0}. Let tk = −tk for k ≤ K , tk = tk for K ≤ k ≤ n.
Then (k, tk ), 0 ≤ k ≤ n, is a path from (0, −x) to (n, y ) that is 0 at time K . 195 196 Chapter 3 Random Walks
The last two observations set up a onetoone correspondence between the two
classes of paths, so their numbers must be equal.
From (3.1) we get a result ﬁrst proved in 1878.
(3.2) The Ballot Theorem. Suppose that in an election candidate A gets
α votes, and candidate B gets β votes where β < α. The probability that
throughout the counting A always leads B is (α − β )/(α + β ).
Proof Let x = α − β , n = α + β . Clearly, there are as many such outcomes
as there are paths from (1,1) to (n, x) that are never 0. The reﬂection principle
implies that the number of paths from (1,1) to (n, x) that are 0 at some time
the number of paths from (1,1) to (n, x), so by (∗) the number of paths from
(1,1) to (n, x) that are never 0 is
n−1
n−1
−
α−1
α
(n − 1)!
(n − 1)!
−
=
(α − 1)!(n − α)! α!(n − α − 1)!
n!
α−β
α − (n − α)
·
=
Nn,x
=
n
α!(n − α)!
α+β Nn−1,x−1 − Nn−1,x+1 = since n = α + β , proving (3.2).
Using the ballot theorem, we can compute the distribution of the time to
hit 0 for simple random walk.
(3.3) Lemma. P (S1 = 0, . . . , S2n = 0) = P (S2n = 0).
∞ Proof P (S1 > 0, . . . , S2n > 0) = r=1 P (S1 > 0, . . . , S2n−1 > 0, S2n = 2r).
From the proof of (3.2), we see that the number of paths from (0,0) to (2n, 2r)
that are never 0 at positive times (= the number of paths from (1,1) to (2n, 2r)
that are never 0) is
N2n−1,2r−1 − N2n−1,2r+1
If we let pn,x = P (Sn = x) then this implies
P (S1 > 0, . . . , S2n−1 > 0, S2n = 2r) = 1
(p2n−1,2r−1 − p2n−1,2r+1 )
2 Summing from r = 1 to ∞ gives
P (S1 > 0, . . . , S2n > 0) = 1
1
p2n−1,1 = P (S2n = 0)
2
2 3.3 Visits to 0, Arcsine Laws
Symmetry implies P (S1 < 0, . . . , S2n < 0) = (1/2)P (S2n = 0), and the proof is
complete.
Let R = inf {m ≥ 1 : Sm = 0}. Combining (3.3) with (1.4) from Chapter 2
gives
P (R > 2n) = P (S2n = 0) ∼ π −1/2 n−1/2 (3.4) Since P (R > x)/ P (R > x) = 1, it follows from (7.7) in Chapter 2 that R is
in the domain of attraction of the stable law with α = 1/2 and κ = 1. This
implies that if Rn is the time of the nth return to 0 then Rn /n2 ⇒ Y , the
indicated stable law. In Example 7.2 in Chapter 2, we considered τ = T1 where
Tx = inf {n : Sn = x}. Since S1 ∈ {−1, 1} and T1 =d T−1 , R =d 1 + T1 , and it
follows that Tn /n2 ⇒ Y , the same stable law. In Example 6.6 of Chapter 7, we
will use this observation to ﬁnd the density of the limit.
This completes our discussion of visits to 0. We turn now to the arcsine
laws. The ﬁrst one concerns
L2n = sup{m ≤ 2n : Sm = 0}
It is remarkably easy to compute the distribution of L2n .
(3.5) Lemma. Let u2m = P (S2m = 0). Then P (L2n = 2k ) = u2k u2n−2k .
Proof P (L2n = 2k ) = P (S2k = 0, S2k+1 = 0, . . . , S2n = 0), so the desired
result follows from (3.3).
(3.6) Arcsine law for the last visit to 0. For 0 < a < b < 1,
b π −1 (x(1 − x))−1/2 dx P (a ≤ L2n /2n ≤ b) →
a To see the reason for the name, substitute y = x1/2 , dy = (1/2)x−1/2 dx in the
integral to obtain
√
b
√
a √
√
2
2
(1 − y 2 )−1/2 dy = {arcsin( b) − arcsin( a)}
π
π Since L2n is the time of the last zero before 2n, it is surprising that the
answer is symmetric about 1/2. The symmetry of the limit distribution implies
P (L2n /2n ≤ 1/2) → 1/2 197 198 Chapter 3 Random Walks
In gambling terms, if two people were to bet $1 on a coin ﬂip every day of the
year, then with probability 1/2, one of the players will be ahead from July 1 to
the end of the year, an event that would undoubtedly cause the other player to
complain about his bad luck.
The next result deals directly with the amount of time one player is ahead.
Proof of (3.6) From the asymptotic formula for u2n , it follows that if k/n →
x then
nP (L2n = 2k ) → π −1 (x(1 − x))−1/2
To get from this to (3.6), we let 2nan = the smallest even integer ≥ 2na,
let 2nbn = the largest even integer ≤ 2nb, and let fn (x) = nP (L2n = k ) for
2k/2n ≤ x < 2(k + 1)/2n so we can write
nbn P (a ≤ L2n /2n ≤ b) = bn +1/n P (L2n = 2k ) = fn (x) dx
an k=nan Our ﬁrst result implies that uniformly on compact sets
fn (x) → f (x) = π −1 (x(1 − x))−1/2
The uniformity of the convergence implies
sup fn (x) → sup f (x) < ∞
a≤x≤b an ≤x≤bn +1/n if 0 < a ≤ b < 1, so the bounded convergence theorem gives
bn +1/n b fn (x) dx →
an f (x) dx
a (3.7) Theorem. Let π2n be the number of segments (k − 1, Sk−1 ) → (k, Sk )
that lie above the axis (i.e., in {(x, y ) : x ≥ 0}), and let um = P (Sm = 0).
P (π2n = 2k ) = u2k u2n−2k
and consequently, if 0 < a < b < 1
b π −1 (x(1 − x))−1/2 dx P (a ≤ π2n /2n ≤ b) →
a Remark. Since π2n =d L2n , the second conclusion follows from the proof of
(3.6). The reader should note that the limiting density π −1 (x(1 − x))−1/2 has 3.3 Visits to 0, Arcsine Laws
a minimum at x = 1/2, and → ∞ as x → 0 or 1. An equal division of steps
between the positive and negative side is therefore the least likely possibility,
and completely onesided divisions have the highest probability.
Proof Let β2k,2n denote the probability of interest. We will prove β2k,2n =
u2k u2n−2k by induction. When n = 1, it is clear that
β0,2 = β2,2 = 1/2 = u0 u2
For a general n, ﬁrst suppose k = n. From the proof of (3.3), we have
1
u2n = P (S1 > 0, . . . , S2n > 0)
2
= P (S1 = 1, S2 − S1 ≥ 0, . . . , S2n − S1 ≥ 0)
1
= P (S1 ≥ 0, . . . , S2n−1 ≥ 0)
2
1
1
= P (S1 ≥ 0, . . . , S2n ≥ 0) = β2n,2n
2
2
The next to last equality follows from the observation that if S2n−1 ≥ 0 then
S2n−1 ≥ 1, and hence S2n ≥ 0.
The last computation proves the result for k = n. Since β0,2n = β2n,2n ,
the result is also true when k = 0. Suppose now that 1 ≤ k ≤ n − 1. In this
case, if R is the time of the ﬁrst return to 0, then R = 2m with 0 < m < n.
Letting f2m = P (R = 2m) and breaking things up according to whether the
ﬁrst excursion was on the positive or negative side gives
k β2k,2n = n−k 1
1
f2m β2k−2m,2n−2m +
f2m β2k,2n−2m
2 m=1
2 m=1 Using the induction hypothesis, it follows that
k β2k,2n = n−k 1
1
u2n−2k
f2m u2k−2m + u2k
f2m u2n−2k−2m
2
2
m=1
m=1 By considering the time of the ﬁrst return to 0, we see
k u2k = n−k f2m u2k−2m
m=1 and the desired result follows. u2n−2k = f2m u2n−2k−2m
m=1 199 200 Chapter 3 Random Walks
Our derivation of (3.7) relied heavily on special properties of simple random
walk. There is a closely related result due to E. SparreAndersen that is valid
for very general random walks. However, notice that the hypothesis (ii) in (3.8)
below excludes simple random walk.
(3.8) Theorem. Let νn = {k : 1 ≤ k ≤ n, Sk > 0}. Then
(i) P (νn = k ) = P (νk = k )P (νn−k = 0)
(ii) If the distribution of X1 is symmetric and P (Sm = 0) = 0 for all m ≥ 1,
then
P (νn = k ) = u2k u2n−2k
where u2m = 2−2m 2m
m is the probability simple random walk is 0 at time 2m. (iii) Under the hypotheses of (ii),
b π −1 (x(1 − x))−1/2 dx P (a ≤ νn /n ≤ b) → for 0 < a < b < 1 a Proof Taking things in reverse order, (iii) is an immediate consequence of
(ii) and the proof of (3.6). Our next step is to show (ii) follows from (i) by
induction. When n = 1, our assumptions imply P (ν1 = 0) = 1/2 = u0 u2 . If
n > 1 and 1 < k < n, then (i) and the induction hypothesis imply
P (νn = k ) = u2k u0 · u0 u2n−2k = u2k u2n−2k
since u0 = 1. To handle the cases k = 0 and k = n, we note that (3.5) implies
n u2k u2n−2k = 1
k=0
n We have k=0 P (νn = k ) = 1 and our assumptions imply P (νn = 0) = P (νn =
n), so these probabilities must be equal to u0 u2n .
The proof of (i) is tricky and requires careful deﬁnitions since we are not
supposing X1 is symmetric or that P (Sm = 0) = 0. Let νn = {k : 1 ≤ k ≤ n,
Sk ≤ 0} = n − νn .
Mn = max Sj n = min{j : 0 ≤ j ≤ n, Sj = Mn } Mn = min Sj n = max{j : 0 ≤ j ≤ n, Sj = Mn } 0≤j ≤n
0≤j ≤n The ﬁrst symmetry is straightforward. 3.3 Visits to 0, Arcsine Laws
(3.9) Lemma. ( n , Sn ) and (n − n , Sn ) have the same distribution. Proof If we let Tk = Sn − Sn−k = Xn + · · · + Xn−k+1 , then Tk 0 ≤ k ≤ n has
the same distribution as Sk , 0 ≤ k ≤ n. Clearly,
max Tk = Sn − min Sn−k 0≤k≤n 0≤k≤n and the set of k for which the extrema are attained are the same.
The second symmetry is much less obvious.
(3.10) Lemma. ( n , Sn ) and (νn , Sn ) have the same distribution.
( n , Sn ) and (νn , Sn ) have the same distribution.
Remark. (i) follows from (3.10) and the trivial observation
P( n = k) = P ( k = k )P ( n−k = 0) so once (3.10) is established, the proof of (3.8) will be complete.
Proof of (3.10) When n = 1, { 1 = 0} = {S1 ≤ 0} = {ν1 = 0}, and
{ 1 = 0} = {S1 > 0} = {ν1 = 0}. We shall prove the general case by induction,
supposing that both statements have been proved when n is replaced by n − 1.
Let
G(y ) = P ( n−1 = k, Sn−1 ≤ y )
H (y ) = P (νn−1 = k, Sn−1 ≤ y )
On {Sn ≤ 0}, we have
for x ≤ 0
(∗) P( n n−1 = = k, Sn ≤ x) =
= On {Sn > 0}, we have n−1 =
tation shows that for x ≥ 0
P( n n, and νn−1 = νn so if F (y ) = P (X1 ≤ y ) then F (x − y ) dG(y )
F (x − y ) dH (y ) = P (νn = k, Sn ≤ x)
n, and νn−1 = νn , so repeating the last compu = n − k, Sn > x) = P (νn = n − k, Sn > x) Since ( n , Sn ) has the same distribution as (n −
follows that for x ≥ 0
P( n n , Sn ) and νn = n − νn , it = k, Sn > x) = P (νn = k, Sn > x) 201 202 Chapter 3 Random Walks
Setting x = 0 in the last result and (∗) and adding gives
P( n = k ) = P (νn = k ) Subtracting the last two equations and combining the result with (∗) gives
P( n = k, Sn ≤ x) = P (νn = k, Sn ≤ x) for all x. Since ( n , Sn ) has the same distribution as (n − n , Sn ) and νn = n − νn ,
it follows that
P( n = n − k , Sn > x) = P (νn = n − k , Sn > x) for all x. This completes the proof of (3.10) and hence of (3.8). *3.4. Renewal Theory
Let ξ1 , ξ2 , . . . be i.i.d. positive random variables with distribution F and deﬁne
a sequence of times by T0 = 0, and Tk = Tk−1 + ξk for k ≥ 1. As explained in
Section 1.7, we think of ξi as the lifetime of the ith light bulb, and Tk is the
time the k th bulb burns out. A second interpretation from Section 2.6 is that
Tk is the time of arrival of the k th customer. To have a neutral terminology,
we will refer to the Tk as renewals. The term renewal refers to the fact that the
process “starts afresh” at Tk , i.e., {Tk+j − Tk , j ≥ 1} has the same distribution
as {Tj , j ≥ 1}.
Departing slightly from the notation in Sections 1.7 and 2.6, we let Nt =
inf {k : Tk > t}. Nt is the number of renewals in [0, t], counting the renewal at
time 0. In Chapter 1, see (7.3), we showed that
(4.1) Theorem. As t → ∞, Nt /t → 1/µ a.s. where µ = Eξi ∈ (0, ∞] and
1/∞ = 0.
Our ﬁrst result concerns the asymptotic behavior of U (t) = ENt .
(4.2) Theorem. As t → ∞, U (t)/t → 1/µ.
Proof We will apply Wald’s equation to the stopping time Nt . The ﬁrst step
is to show that P (ξi > 0) > 0 implies ENt < ∞. To do this, pick δ > 0 so that
P (ξi > δ ) = > 0 and pick K so that Kδ ≥ t. Since K consecutive ξi s that
are > δ will make Tn > t, we have
P (Nt > mK ) ≤ (1 − Km ) 3.4 Renewal Theory
and ENt < ∞. If µ < ∞, applying Wald’s equation now gives
µENt = ETNt ≥ t
so U (t) ≥ t/µ. The last inequality is trivial when µ = ∞ so it holds in general.
Turning to the upper bound, we observe that if P (ξi ≤ c) = 1, then
repeating the last argument shows µENt = ESNt ≤ t + c, and the result holds
¯
¯
¯
for bounded distributions. If we let ξi = ξi ∧ c and deﬁne Tn and Nt in the
obvious way then
¯
¯
ENt ≤ E Nt ≤ (t + c)/E (ξi )
Letting t → ∞ and then c → ∞ gives lim supt→∞ ENt /t ≤ 1/µ, and the proof
is complete.
Exercise 4.1. Show that t/E (ξi ∧ t) ≤ U (t) ≤ 2t/E (ξi ∧ t).
Exercise 4.2. Deduce (4.2) from (4.1) by showing
lim sup E (Nt /t)2 < ∞.
t→∞ Hint: Use a comparison like the one in the proof of (4.1).
Exercise 4.3. Customers arrive at times of a Poisson process with rate 1. If
the server is occupied, they leave. (Think of a public telephone or prostitute.)
If not, they enter service and require a service time with a distribution F that
has mean µ. Show that the times at which customers enter service are a renewal
process with mean µ + 1, and use (4.1) to conclude that the asymptotic fraction
of customers served is 1/(µ + 1).
To take a closer look at when the renewals occur, we let
∞ U (A) = P (Tn ∈ A)
n=0 U is called the renewal measure. We absorb the old deﬁnition, U (t) = ENt ,
into the new one by regarding U (t) as shorthand for U ([0, t]). This should not
cause problems since U (t) is the distribution function for the renewal measure.
The asymptotic behavior of U (t) depends upon whether the distribution F is
arithmetic, i.e., concentrated on {δ, 2δ, 3δ, . . .} for some δ > 0, or nonarithmetic, i.e., not arithmetic. We will treat the ﬁrst case in Chapter 5 as an
application of Markov chains, so we will restrict our attention to the second
case here. 203 204 Chapter 3 Random Walks
(4.3) Blackwell’s renewal theorem. If F is nonarithmetic then U ([t, t+h]) →
h/µ as t → ∞.
We will prove the result in the case µ < ∞ by “coupling” following Lindvall
(1977) and Athreya, McDonald, and Ney (1978). To set the stage for the
proof, we need a deﬁnition and some preliminary computations. If T0 ≥ 0 is
independent of ξ1 , ξ2 , . . . and has distribution G, then Tk = Tk−1 + ξk , k ≥ 1
deﬁnes a delayed renewal process, and G is the delay distribution. If we
let Nt = inf {k : Tk > t} as before and set V (t) = ENt , then breaking things
down according to the value of T0 gives
t (4.4) V (t) = U (t − s) dG(s)
0 The last integral, and all similar expressions below, is intended to include the
contribution of any mass G has at 0. If we let U (r) = 0 for r < 0, then the last
equation can be written as V = U ∗ G, where ∗ denotes convolution.
Applying similar reasoning to U gives
t (4.5) U (t) = 1 + U (t − s) dF (s)
0 or, introducing convolution notation,
U = 1[0,∞) (t) + U ∗ F.
Convolving each side with G (and recalling G ∗ U = U ∗ G) gives
(4.6) V =G∗U =G+V ∗F We know U (t) ∼ t/µ. Our next step is to ﬁnd a G so that V (t) = t/µ. Plugging
what we want into (4.6) gives
t t/µ = G(t) +
0
t G(t) = t/µ − or 0 t−y
dF (y )
µ
t−y
dF (y )
µ The integrationbyparts formula is
t t K (y ) dH (y ) = H (t)K (t) − H (0)K (0) −
0 H (y ) dK (y )
0 3.4 Renewal Theory
If we let H (y ) = (y − t)/µ and K (y ) = 1 − F (y ), then
1
µ t t
−
µ 1 − F (y ) dy =
0 t
0 t−y
dF (y )
µ so we have
(4.7) G(t) = 1
µ t 1 − F (y ) dy
0 It is comforting to note that µ = [0,∞) 1 − F (y ) dy , so the last formula deﬁnes
a probability distribution. When the delay distribution G is the one given in
(4.7), we call the result the stationary renewal process. Something very
special happens when F (t) = 1 − exp(−λt), t ≥ 0 where λ > 0 (i.e., the renewal
process is a rate λ Poisson process). In this case, µ = 1/λ so G(t) = F (t).
Proof of (4.3) for µ < ∞ Let Tn be a renewal process (with T0 = 0) and
Tn be an independent stationary renewal process. Our ﬁrst goal is to ﬁnd
J and K so that TJ − TK  < and the increments {TJ +i − TJ , i ≥ 1} and
{TK +i − TK , i ≥ 1} are i.i.d. sequences independent of what has come before.
Let η1 , η2 , . . . and η1 , η2 , . . . be i.i.d. independent of Tn and Tn and take
the values 0 and 1 with probability 1/2 each. Let νn = η1 + · · · + ηn and
νn = 1 + η1 + · · · + ηn , Sn = Tνn and Sn = Tνn . The increments of Sn − Sn
are 0 with probability at least 1/4, and the support of their distribution is
symmetric and contains the support of the ξk so if the distribution of the ξk is
nonarithmetic the random walk Sn − Sn is irreducible. Since the increments of
Sn − Sn have mean 0, N = inf {n : Sn − Sn  < } has P (N < ∞) = 1, and we
can let J = νN and K = νN . Let
Tn = Tn
TJ + TK +(n−J ) − TK if J ≥ n
if J < n In other words, the increments TJ +i − TJ are the same as TK +i − TK for i ≥ 1.
It is easy to see from the construction that Tn and Tn have the same
distribution. If we let
N [s, t] = {n : Tn ∈ [s, t]} and N [s, t] = {n : Tn ∈ [s, t]}
be the number of renewals in [s, t] in the two processes, then on {TJ ≤ t}
N [t, t + h] = N [t + TK − TJ , t + h + TK − TJ ] ≥ N [t + , t + h − ]
≤ N [t − , t + h + ] To relate the expected number of renewals in the two processes, we observe
that even if we condition on the location of all the renewals in [0, s], the expected 205 206 Chapter 3 Random Walks
number of renewals in [s, s + t] is at most U (t), since the worst thing that could
happen is to have a renewal at time s. Combining the last two observations,
we see that if < h/2 (so [t + , t + h − ] has positive length)
U ([t, t + h]) = EN [t, t + h] ≥ E (N [t + , t + h − ]; TJ ≤ t)
h−2
− P ( TJ > t ) U ( h)
≥
µ
since EN [t + , t + h − ] = (h − 2 )/µ and {TJ > t} is determined by the renewals
of T in [0, t] and the renewals of T in [0, t + ]. For the other direction, we
observe
U ([t, t + h]) ≤ E (N [t − , t + h + ]; TJ ≤ t) + E (N [t, t + h]; TJ > t)
h+2
+ P ( TJ > t ) U ( h)
≤
µ
The desired result now follows from the fact that P (TJ > t) → 0 and
is arbitrary. < h/2 Remark. In the ﬁrst edition, we followed Athreya, McDonald, and Ney too
closely and repeated their mistaken claim that the coupling can be done taking
J = K = inf {k : Tk − Tk  < }. To see this is not correct, suppose that the
interrenewal times have P (ξj = 1) = P (ξj = 1 + π ) = 1/2. This distribution
is nonarithmetic, but ξj − ξj is concentrated on {−π, 0, π } so we cannot couple
unless T0 − π  < . This problem was pointed out to us by Torgny Lindvall.
The remedy used above is diﬀerent from the one in his (1977) proof.
Proof of (4.3) for µ = ∞ In this case, there is no stationary renewal process,
so we have to resort to other methods. Let
β = lim sup U (t, t + 1] = lim U (tk , tk + 1]
k→∞ t→∞ for some sequence tk → ∞. We want to prove that β = 0, for then by addition
the previous conclusion holds with 1 replaced by any integer n and, by monotonicity, with n replaced by any h < n, and this gives us the result in (4.3). Fix
i and let
ak,j =
U (tk − y, tk + 1 − y ] dF i∗ (y )
(j −1,j ] By considering the location of Ti we get
∞ (a) lim k→∞ ak,j = lim
j =1 k→∞ U (tk − y, tk + 1 − y ] dF i∗ (y ) = β 3.4 Renewal Theory
Since β is the lim sup, we must have
lim sup ak,j ≤ β · P (Ti ∈ (j − 1, j ]) (b) k→∞ We want to conclude from (a) and (b) that
lim inf ak,j ≥ β · P (Ti ∈ (j − 1, j ]) (c) k→∞ To do this, we observe that by considering the location of the ﬁrst renewal in
(j − 1, j ]
(d) 0 ≤ ak,j ≤ U (1)P (Ti ∈ (j − 1, j ]) (c) is trivial when β = 0 so we can suppose β > 0. To argue by contradiction, suppose there exist j0 and > 0 so that
lim inf ak,j0 ≤ β · {P (Ti ∈ (j0 − 1, j0 ]) − }
k→∞ Pick kn → ∞ so that
akn ,j0 → β · {P (Ti ∈ (j0 − 1, j0 ]) − }
Using (d), we can pick J ≥ j0 so that
∞ ∞ lim sup
n→∞ akn ,j ≤ U (1)
j =J +1 P (Ti ∈ (j − 1, j ]) ≤ β /2
j =J +1 Now an easy argument shows
J lim sup
n→∞ akn ,j ≤
j =1 J j =1 lim sup akn ,j ≤ β n→∞ J j =1 P (Ti ∈ (j − 1, j ]) − by (b) and our assumption. Adding the last two results shows
∞ akn ,j ≤ β (1 − /2) lim sup
n→∞ j =1 which contradicts (a), and proves (c).
Now, if j − 1 < y ≤ j , we have
U (tk − y, tk + 1 − y ] ≤ U (tk − j, tk + 2 − j ] 207 208 Chapter 3 Random Walks
so using (c) it follows that for j with P (Ti ∈ (j − 1, j ]) > 0, we must have
lim inf U (tk − j, tk + 2 − j ] ≥ β
k→∞ Summing over i, we see that the last conclusion is true when U (j − 1, j ] > 0.
The support of U is closed under addition. (If x is in the support of F m∗
and y is in the support of F n∗ then x + y is in the support of F (m+n)∗ .) We have
assumed F is nonarithmetic, so U (j − 1, j ] > 0 for j ≥ j0 . Letting rk = tk − j0
and considering the location of the last renewal in [0, rk ] and the index of the
Ti gives
∞ rk rk (1 − F (rk − y )) dF i∗ (y ) = 1=
i=0
∞ ≥ 0 (1 − F (rk − y )) dU (y )
0 (1 − F (2n)) U (rk − 2n, rk + 2 − 2n]
n=1 Since lim inf k→∞ U (rk − 2n, rk + 2 − 2n] ≥ β and
∞ (1 − F (2n)) ≥ µ/2 = ∞
n=0 β must be 0, and the proof is complete.
Remark. Following Lindvall (1977), we have based the proof for µ = ∞ on
part of Feller’s (1961) proof of the discrete renewal theorem (i.e., for arithmetic
distributions). See Freedman (1971b) p. 22–25 for an account of Feller’s proof.
Purists can ﬁnd a proof that does everything by coupling in Thorisson (1987).
Our next topic is the renewal equation: H = h + H ∗ F . Two cases we
have seen in (4.5) and (4.6) are:
Example 4.1. h ≡ 1: U (t) = 1 + t
0 U (t − s) dF (s) Example 4.2. h(t) = G(t): V (t) = G(t) + t
0 V (t − s) dF (s) The last equation is valid for an arbitrary delay distribution. If we let G
be the distribution in (4.7) and subtract the last two equations, we get
Example 4.3. H (t) = U (t) − t/µ satisﬁes the renewal equation with h(t) =
1∞
µ t 1 − F (s) ds. 3.4 Renewal Theory
Last but not least, we have an example that is a typical application of the
renewal equation.
Example 4.4. Let x > 0 be ﬁxed, and let H (t) = P (TN (t) − t > x). By
considering the value of T1 , we get
t H (t) = (1 − F (t + x)) + H (t − s) dF (s)
0 The examples above should provide motivation for:
(4.8) Theorem. If h is bounded then the function
t h(t − s) dU (s) H (t) =
0 is the unique solution of the renewal equation that is bounded on bounded
intervals.
Proof Let Un (A) = n
m=0 P (Tm ∈ A) and
n t Hn ( t ) = ( h ∗ F m∗ ) ( t ) h(t − s) dUn (s) =
0 m=0 Here, F m∗ is the distribution of Tm , and we have extended the deﬁnition of h
by setting h(r) = 0 for r < 0. From the last expression, it should be clear that
Hn+1 = h + Hn ∗ F
The fact that U (t) < ∞ implies U (t) − Un (t) → 0. Since h is bounded,
H n ( t ) − H ( t )  ≤ h ∞ U ( t ) − Un (t) and Hn (t) → H (t) uniformly on bounded intervals. To estimate the convolution, we note that
Hn ∗ F (t) − H ∗ F (t) ≤ sup Hn (s) − H (s)
s≤t ≤h
∞ ∞ U ( t ) − Un (t) since U − Un = m=n+1 F m∗ is increasing in t. Letting n → ∞ in Hn+1 =
h + Hn ∗ F , we see that H is a solution of the renewal equation that is bounded
on bounded intervals. 209 210 Chapter 3 Random Walks
To prove uniqueness, we observe that if H1 and H2 are two solutions, then
K = H1 − H2 satisﬁes K = K ∗ F . If K is bounded on bounded intervals,
iterating gives K = K ∗ F n∗ → 0 as n → ∞, so H1 = H2 .
The proof of (4.8) is valid when F (∞) = P (ξi < ∞) < 1. In this case,
we have a terminating renewal process. After a geometric number of trials
with mean 1/(1 − F (∞)), Tn = ∞. This “trivial case” has some interesting
applications.
Example 4.5. Pedestrian delay. A chicken wants to cross a road (we won’t
ask why) on which the traﬃc is a Poisson process with rate λ. She needs
one unit of time with no arrival to safely cross the road. Let M = inf {t ≥ 0 :
there are no arrivals in (t, t +1]} be the waiting time until she starts to cross the
street. By considering the time of the ﬁrst arrival, we see that H (t) = P (M ≤ t)
satisﬁes
1 H (t) = e−λ + H (t − y ) λe−λy dy
0 Comparing with Example 4.1 and using (4.8), we see that
∞ H (t) = e−λ F n∗ (t)
n=0 We could have gotten this answer without renewal theory by noting
∞ P (M ≤ t) = P (Tn ≤ t, Tn+1 = ∞)
n=0 The last representation allows us to compute the mean of M . Let µ be the
mean of the interarrival time given that it is < 1, and note that the lack of
memory property of the exponential distribution implies
1 ∞ xλe−λx dx = µ=
0 ∞ −
0 =
1 1
1
− 1+
λ
λ e−λ Then, by considering the number of renewals in our terminating renewal process,
∞ e−λ (1 − e−λ )n nµ = (eλ − 1)µ EM =
n=0 since if X is a geometric with success probability e−λ then EM = µE (X − 1). 3.4 Renewal Theory
Example 4.6. Cram´r’s estimates of ruin. Consider an insurance company
e
that collects money at rate c and experiences i.i.d. claims at the arrival times
of a Poisson process Nt with rate 1. If its initial capital is x, its wealth at time
t is
Nt Wx (t) = x + ct − Yi
m=1 Here Y1 , Y2 , . . . are i.i.d. with distribution G and mean µ. Let
R(x) = P (Wx (t) ≥ 0 for all t)
be the probability of never going bankrupt starting with capital x. By considering the time and size of the ﬁrst claim:
∞ (a) x+cs e−s R ( x) = R(x + cs − y ) dG(y ) ds 0 0 This does not look much like a renewal equation, but with some ingenuity it
can be transformed into one. Changing variables t = x + cs
∞ R(x)e−x/c = t e−t/c R(t − y ) dG(y ) x 0 dt
c Diﬀerentiating w.r.t. x and then mutliplying by ex/c ,
R ( x) = 1
R ( x) −
c x R(x − y ) dG(y ) ·
0 1
c Integrating x from 0 to w
(b) R(w) − R(0) = 1
c w 1
c R(x) dx −
0 w x R(x − y ) dG(y ) dx
0 0 Interchanging the order of integration in the double integral, letting
w S (w ) = R(x) dx
0 using dG = −d(1 − G), and then integrating by parts
− 1
c w w R(x − y ) dx dG(y ) = −
0 y 1
c
1
=
c 1
c w S (w − y ) dG(y )
0 w = S (w − y ) d(1 − G)(y )
0
w −S (w) + (1 − G(y ))R(w − y ) dy
0 211 212 Chapter 3 Random Walks
Plugging this into (b), we ﬁnally have a renewal equation:
w (c) R(w) = R(0) + R (w − y )
0 1 − G(y )
dy
c It took some cleverness to arrive at the last equation, but it is straightforward
to analyze. First, we dismiss a trivial case. If µ > c,
1
t Nt ct − → c − µ < 0 a.s. Yi
m=1 so R(x) ≡ 0. When µ < c,
x F ( x) =
0 1 − G(y )
dy
c is a defective probability distribution with F (∞) = µ/c. Our renewal equation
can be written as
(d) R = R(0) + R ∗ F so comparing with Example 4.1 and using (4.8) tells us R(w) = R(0)U (w).
To complete the solution, we have to compute the constant R(0). Letting
w → ∞ and noticing R(w) → 1, U (w) → (1 − F (∞))−1 = (1 − µ/c)−1 , we have
R(0) = 1 − µ/c.
The basic fact about solutions of the renewal equation (in the nonterminating case) is:
(4.9) The renewal theorem. If F is nonarithmetic and h is directly Riemann
integrable then as t → ∞
H (t) → 1
µ ∞ h(s) ds
0 Intuitively, this holds since (4.8) implies
t H (t) = h(t − s) dU (s)
0 and (4.3) implies dU (s) → ds/µ as s → ∞. We will deﬁne directly Riemann
integrable in a minute. We will start doing the proof and then ﬁgure out what
we need to assume. 3.4 Renewal Theory
Proof Suppose
∞ ak 1[kδ,(k+1)δ) (s) h(s) =
k=0
∞ where k=0 ak  < ∞. Since U ([t, t + δ ]) ≤ U ([0, δ ]) < ∞, it follows easily from
(4.3) that
∞ t h(t − s)dU (s) =
0 (Pick K so that ak U ((t − (k + 1)δ, t − kδ ]) →
k=0 k≥K 1
µ ∞ ak δ
k=0 ak  ≤ /2U ([0, δ ]) and then T so that ak  · U ((t − (k + 1)δ, t − kδ ]) − δ/µ ≤ 2K for t ≥ T and 0 ≤ k < K.) If h is an arbitrary function on [0, ∞), we let
∞ Iδ = δ sup{h(x) : x ∈ [kδ, (k + 1)δ )}
k=0
∞ Iδ = and δ inf {h(x) : x ∈ [kδ, (k + 1)δ )}
k=0 be upper and lower Riemann sums approximating the integral of h over [0, ∞).
Comparing h with the obvious upper and lower bounds that are constant on
[kδ, (k + 1)δ ) and using the result for the special case,
Iδ
≤ lim inf
t→∞
µ t t h(t − s) dU (s) ≤ lim sup
t→∞ 0 h(t − s) dU (s) ≤
0 Iδ
µ If I δ and Iδ both approach the same ﬁnite limit I as δ → 0, then h is said to
be directly Riemann integrable, and it follows that
t h(t − s) dU (y ) → I/µ
0 Remark. The word “direct” in the name refers to the fact that while the
Riemann integral over [0, ∞) is usually deﬁned as the limit of integrals over
[0, a], we are approximating the integral over [0, ∞) directly.
In checking the new hypothesis in (4.9), the following result is useful. 213 214 Chapter 3 Random Walks
(4.10) Lemma. If h(x) ≥ 0 is decreasing with h(0) < ∞ and
then h is directly Riemann integrable.
Because h is decreasing, I δ = Proof
So ∞
k=0 δh(kδ ) and Iδ = ∞
0 h(x) dx < ∞, ∞
k=0 δh((k +1)δ ). ∞ Iδ ≥ h(x) dx ≥ Iδ = I δ − h(0)δ
0 proving the desired result.
The last result suﬃces for all our applications, so we leave it to the reader
to do
Exercise 4.4. If h ≥ 0 is continuous then h is directly Riemann integrable if
and only if I δ < ∞ for some δ > 0 (and hence for all δ > 0).
Returning now to our examples, we skip the ﬁrst two because, in those
cases, h(t) → 1 as t → ∞, so h is not integrable in any sense.
Example 4.3, part II. h(t) =
and
∞ µ ∞ [t,∞) 1 − F (s) ds. h is decreasing, h(0) = 1, ∞ h(t) dt =
0 1
µ 1 − F (s) ds dt
0 t
∞ s ∞
2
s(1 − F (s)) ds = E (ξi /2) 1 − F (s) dt ds = =
0 0 0 2
So, if ν ≡ E (ξi ) < ∞, it follows from (4.10), (4.9), and the formula in Example
4.3 that
0 ≤ U (t) − t/µ → ν/2µ2 as t → ∞ When the renewal process is a rate λ Poisson process, i.e., P (ξi > t) = e−λt ,
N (t) − 1 has a Poisson distribution with mean λt, so U (t) = 1 + λt. According
to Feller, Vol. II (1971), p. 385, if the ξi are uniform on (0,1), then
n U (t) = (−1)k et−k (t − k )k /k ! for n ≤ t ≤ n + 1
k=0 As he says, the exact expression “reveals little about the nature of U . The
asymptotic formula 0 ≤ U (t) − 2t → 2/3 is much more interesting.”
Example 4.4, part II. h(t) = 1 − F (t + x). Again, h is decreasing, but this
time h(0) ≤ 1 and the integral of h is ﬁnite when µ = E (ξi ) < ∞. Applying 3.4 Renewal Theory
(4.10) and (4.9) now gives
P ( T N ( t) − t > x ) → 1
µ ∞ h(s) ds =
0 1
µ ∞ 1 − F (t) dt
x so (when µ < ∞) the distribution of the residual waiting time TN (t) − t
converges to the delay distribution that produces the stationary renewal process.
This fact also follows from our proof of (4.3).
Using the method employed to study Example 4.4, one can analyze various other aspects of the asymptotic behavior of renewal processes. To avoid
repeating ourselves
We assume throughout that F is nonarithmetic, and in problems where the mean
appears we assume it is ﬁnite.
Exercise 4.5. Let At = t − TN (t)−1 be the “age” at time t, i.e., the amount
of time since the last renewal. If we ﬁx x > 0 then H (t) = P (At > x) satisﬁes
the renewal equation
t H (t) = (1 − F (t)) · 1(x,∞) (t) + H (t − s) dF (s)
0 1
so P (At > x) → µ (x,∞) (1 − F (t))dt, which is the limit distribution for the
residual lifetime Bt = TN (t) − t. Remark. The last result can be derived from Example 4.4 by noting that if
t > x then P (At ≥ x) = P (Bt−x > x) = P ( no renewal in (t − x, t]). To check
the placement of the strict inequality, recall Nt = inf {k : Tk > t} so we always
have As ≥ 0 and Bs > 0.
Exercise 4.6. Use the renewal equation in the last problem and (4.8) to
conclude that if T is a rate λ Poisson process At has the same distribution as
ξi ∧ t.
Exercise 4.7. Let At = t − TN (t)−1 and Bt = TN (t) − t. Show that
P (At > x, Bt > y ) → 1
µ ∞ (1 − F (t)) dt
x+y Exercise 4.8. Alternating renewal process. Let ξ1 , ξ2 , . . . > 0 be i.i.d. with
distribution F1 and let η1 , η2 , . . . > 0 be i.i.d. with distribution F2 . Let T0 = 0 215 216 Chapter 3 Random Walks
and for k ≥ 1 let Sk = Tk−1 + ξk and Tk = Sk + ηk . In words, we have a machine
that works for an amount of time ξk , breaks down, and then requires ηk units
of time to be repaired. Let F = F1 ∗ F2 and let H (t) be the probability the
machine is working at time t. Show that if F is nonarithmetic then as t → ∞
H (t) → µ1 /(µ1 + µ2 )
where µi is the mean of Fi .
Exercise 4.9. Write a renewal equation for H (t) = P ( number of renewals in
[0, t] is odd) and use the renewal theorem to show that H (t) → 1/2. Note: This
is a special case of the previous exercise.
Exercise 4.10. Renewal densities. Show that if F (t) has a directly Riemann
integrable density function f (t), then the V = U − 1[0,∞) has a density v that
satisﬁes
t v (t) = f (t) + v (t − s) dF (s)
0 Use the renewal theorem to conclude that if f is directly Riemann integrable
then v (t) → 1/µ as t → ∞.
Finally, we have an example that would have been given right after (4.1)
but was delayed because we had not yet deﬁned a delayed renewal process.
Example 4.7. Patterns in coin tossing. Let Xn , n ≥ 1 take values H
and T with probability 1/2 each. Let T0 = 0 and Tm = inf {n > Tm−1 :
(Xn , . . . , Xn+k−1 ) = (i1 , . . . , ik )} where (i1 , . . . , ik ) is some pattern of heads
and tails. It is easy to see that the Tj form a delayed renewal process, i.e.,
tj = Tj − Tj −1 are independent for j ≥ 1 and identically distributed for j ≥ 2.
To see that the distribution of t1 may be diﬀerent, let (i1 , i2 , i3 ) = (H, H, H ).
In this case, P (t1 = 1) = 1/8, P (t2 = 1) = 1/2.
Exercise 4.11. (i) Show that for any pattern of length k , Etj = 2k for j ≥ 2.
(ii) Compute Et1 when the pattern is HH, and when it is HT. Hint: For HH,
observe
Et1 = P (HH ) + P (HT )E (t1 + 2) + P (T )E (t1 + 1) 4 Martingales A martingale Xn can be thought of as the fortune at time n of a player who
is betting on a fair game; submartingales (supermartingales) as the outcome
of betting on a favorable (unfavorable) game. There are two basic facts about
martingales. The ﬁrst is that you cannot make money betting on them (see
(2.7)), and in particular if you choose to stop playing at some bounded time N
then your expected winnings EXN are equal to your initial fortune X0 . (We
are supposing for the moment that X0 is not random.) Our second fact, (2.10),
concerns submartingales. To use a heuristic we learned from Mike Brennan,
“They are the stochastic analogues of nondecreasing sequences and so if they are
+
bounded above (to be precise, supn EXn < ∞) they converge almost surely.”
As the material in Section 4.3 shows, this result has diverse applications. Later
sections give suﬃcient conditions for martingales to converge in Lp , p > 1
(Section 4.4) and in L1 (Section 4.5); consider martingales indexed by n ≤
0 (Section 4.6); and give suﬃcient conditions for EXN = EX0 to hold for
unbounded stopping times (Section 4.7). The last result is quite useful for
studying the behavior of random walks and other systems. 4.1. Conditional Expectation
We begin with a deﬁnition that is important for this chapter and the next
one. After giving the deﬁnition, we will consider several examples to explain
it. Given are a probability space (Ω, Fo , P ), a σ ﬁeld F ⊂ Fo , and a random
variable X ∈ Fo with E X  < ∞. We deﬁne the conditional expectation of
X given F , E (X F ), to be any random variable Y that has
(i) Y ∈ F , i.e., is F measurable
(ii) for all A ∈ F , A X dP = A Y dP Any Y satisfying (i) and (ii) is said to be a version of E (X F ). The ﬁrst thing
to be settled is that the conditional expectation exists and is unique. We tackle
the second claim ﬁrst but start with a technical point.
If Y satisﬁes (i) and (ii), then it is integrable. 218 Chapter 4 Martingales
Proof Letting A = {Y > 0} ∈ F , using (ii) twice, and then adding
Y dP =
A X dP ≤
A X  dP
A −Y dP = −X dP ≤ Ac Ac X  dP
Ac So we have E Y  ≤ E X .
Uniqueness. If Y also satisﬁes (i) and (ii) then
Y dP =
A Y dP for all A ∈ F A Taking A = {Y − Y ≥ > 0}, we see
0= X − X dP =
A Y − Y dP ≥ P (A)
A so P (A) = 0. Since this holds for all we have Y ≤ Y a.s., and interchanging
the roles of Y and Y , we have Y = Y a.s. Technically, all equalities such as
Y = E (X F ) should be written as Y = E (X F ) a.s., but we have ignored this
point in previous chapters and will continue to do so.
Exercise 1.1. Generalize the last argument to show that if X1 = X2 on B ∈ F
then E (X1 F ) = E (X2 F ) a.s. on B .
Existence. To start, we recall ν is said to be absolutely continuous with
respect to µ (abbreviated ν < µ) if µ(A) = 0 implies ν (A) = 0, and we use
<
(8.6) from the Appendix:
RadonNikodym Theorem. Let µ and ν be σ ﬁnite measures on (Ω, F ). If
ν < µ, there is a function f ∈ F so that for all A ∈ F
<
f dµ = ν (A)
A f is usually denoted dν/dµ and called the RadonNikodym derivative.
The last theorem easily gives the existence of conditional expectation. Suppose ﬁrst that X ≥ 0. Let µ = P and
ν (A) = X dP
A for A ∈ F Section 4.1 Conditional Expectation
The dominated convergence theorem implies ν is a measure (see Exercise 5.8 in
the Appendix) and the deﬁnition of the integral implies ν < µ. The Radon<
Nikodym derivative dν/dµ ∈ F and for any A ∈ F has
X dP = ν (A) =
A A dν
dP
dµ Taking A = Ω, we see that dν/dµ ≥ 0 is integrable, and we have shown that
dν/dµ is a version of E (X F ).
To treat the general case now, write X = X + − X − , let Y1 = E (X + F )
and Y2 = E (X − F ). Now Y1 − Y2 ∈ F is integrable, and for all A ∈ F we have
X + dP − X dP =
A A = X − dP
A Y1 dP −
A Y2 dP =
A (Y1 − Y2 ) dP
A This shows Y1 − Y2 is a version of E (X F ) and completes the proof. a. Examples
Intuitively, we think of F as describing the information we have at our disposal
 for each A ∈ F , we know whether or not A has occurred. E (X F ) is then
our “best guess” of the value of X given the information we have. Some examples should help to clarify this and connect E (X F ) with other deﬁnitions of
conditional expectation.
Example 1.1. If X ∈ F , then E (X F ) = X ; i.e., if we know X then our “best
guess” is X itself. Since X always satisﬁes (ii), the only thing that can keep X
from being E (X F ) is condition (i). A special case of this example is X = c,
where c is a constant.
Example 1.2. At the other extreme from perfect information is no information.
Suppose X is independent of F , i.e., for all B ∈ R and A ∈ F
P ({X ∈ B } ∩ A) = P (X ∈ B )P (A)
We claim that, in this case, E (X F ) = EX ; i.e., if you don’t know anything
about X , then the best guess is the mean EX . To check the deﬁnition, note
that EX ∈ F so (i). To verify (ii), we observe that if A ∈ F then since X and
1A ∈ F are independent, (4.8) in Chapter 1 implies
X dP = E (X 1A ) = EX E 1A =
A EX dP
A 219 220 Chapter 4 Martingales
The reader should note that here and in what follows the game is “guess and
verify.” We come up with a formula for the conditional expectation and then
check that it satisﬁes (i) and (ii).
Example 1.3. In this example, we relate the new deﬁnition of conditional
expectation to the ﬁrst one taught in an undergraduate probability course.
Suppose Ω1 , Ω2 , . . . is a ﬁnite or inﬁnite partition of Ω into disjoint sets, each
of which has positive probability, and let F = σ (Ω1 , Ω2 , . . .) be the σ ﬁeld
generated by these sets. Then
E (X F ) = E (X ; Ωi )
P (Ωi ) on Ωi In words, the information in Ωi tells us which element of the partition our
outcome lies in and given this information, the best guess for X is the average
value of X over Ωi . To prove our guess is correct, observe that the proposed
formula is constant on each Ωi , so it is measurable with respect to F . To verify
(ii), it is enough to check the equality for A = Ωi , but this is trivial: Ωi E (X ; Ωi )
dP = E (X ; Ωi ) =
P (Ωi ) X dP
Ωi A degenerate but important special case is F = {∅, Ω}, the trivial σ ﬁeld. In
this case, E (X F ) = EX.
To continue the connection with undergraduate notions, let
P (AG ) = E (1A G )
P (AB ) = P (A ∩ B )/P (B )
and observe that in the last example P (AF ) = P (AΩi ) on Ωi .
Exercise 1.2. Bayes’ formula. Let G ∈ G and show that
P (GA) = P (AG ) dP P (AG ) dP G Ω When G is the σ ﬁeld generated by a partition, this reduces to the usual Bayes’
formula
P (Gi A) = P (AGi )P (Gi ) P (AGj )P (Gj )
j Section 4.1 Conditional Expectation
The deﬁnition of conditional expectation given a σ ﬁeld contains conditioning on a random variable as a special case. We deﬁne
E (X Y ) = E (X σ (Y ))
where σ (Y ) is the σ ﬁeld generated by Y .
Example 1.4. To continue making connection with deﬁnitions of conditional
expectation from undergraduate probability, suppose X and Y have joint density f (x, y ), i.e.,
for B ∈ R2 f (x, y ) dx dy P ((X, Y ) ∈ B ) =
B and suppose for simplicity that f (x, y ) dx > 0 for all y . We claim that in this
case, if E g (X ) < ∞ then E (g (X )Y ) = h(Y ), where
h( y ) = g (x)f (x, y ) dx f (x, y ) dx To “guess” this formula, note that treating the probability densities P (Y = y )
as if they were real probabilities
P ( X = x Y = y ) = P (X = x, Y = y )
=
P (Y = y ) f (x, y )
f (x, y ) dx so, integrating against the conditional probability density, we have
E ( g ( X ) Y = y ) = g (x)P (X = xY = y ) dx To “verify” the proposed formula now, observe h(Y ) ∈ σ (Y ) so (i) holds. To
check (ii), observe that if A ∈ σ (Y ) then A = {ω : Y (ω ) ∈ B } for some B ∈ R,
so
h(y )f (x, y ) dx dy =
g (x)f (x, y ) dx dy
E (h(Y ); A) =
B B = E (g (X )1B (Y )) = E (g (X ); A)
Remark. To drop the assumption that
h( y ) f (x, y ) dx = (i.e., h can be anything where
the proof. f (x, y ) dx > 0, deﬁne h by
g (x)f (x, y ) dx f (x, y ) dx = 0), and observe this is enough for 221 222 Chapter 4 Martingales
Example 1.5. Suppose X and Y are independent. Let ϕ be a function with
E ϕ(X, Y ) < ∞ and let g (x) = E (ϕ(x, Y )). We will now show that
E (ϕ(X, Y )X ) = g (X )
Proof It is clear that g (X ) ∈ σ (X ). To check (ii), note that if A ∈ σ (X ) then
A = {X ∈ C }, so using (3.9) and (4.6) in Chapter 1, then the deﬁnition of g ,
and (3.9) in Chapter 1 again,
ϕ(X, Y ) dP = E {ϕ(X, Y )1C (X )}
A =
= ϕ(x, y )1C (x) ν (dy ) µ(dx)
1C (x)g (x) µ(dx) = g (Y ) dP
A Example 1.6. Borel’s paradox. Let X be a randomly chosen point on the
earth, let θ be its longitude, and ϕ be its latitude. It is customary to take
θ ∈ [0, 2π ) and ϕ ∈ (−π/2, π/2] but we can equally well take θ ∈ [0, π ) and
ϕ ∈ (−π, π ]. In words, the new longitude speciﬁes the great circle on which the
point lies and then ϕ gives the angle.
At ﬁrst glance it might seem that if X is uniform on the globe then θ
and the angle ϕ on the great circle should both be uniform over their possible
values. θ is uniform but ϕ is not. The paradox completely evaporates once we
realize that in the new or in the traditional formulation ϕ is independent of θ,
so the conditional distribution is the unconditional one, which is not uniform
since there is more land near the equator than near the North Pole. b. Properties
Conditional expectation has many of the same properties that ordinary expectation does.
(1.1a) Linearity. E (aX + Y F ) = aE (X F ) + E (Y F )
Proof We need to check that the righthand side is a version of the left. It
clearly is F measurable. To check (ii), we observe that if A ∈ F then by
linearity of the integral and the deﬁning properties of E (X F ) and E (Y F ),
{aE (X F ) + E (Y F )} dP = a
A E (X F ) dP +
A =a X dP +
A E (Y F ) dP
A Y dP =
A aX + Y dP
A Section 4.1 Conditional Expectation
(1.1b) Monotonicity. If X ≤ Y then E (X F ) ≤ E (Y F ).
Proof E (X F ) dP =
A X dP ≤
A Letting A = {E (X F ) − E (Y F ) ≥
probability 0 for all > 0. Y dP =
A E (Y F ) dP
A > 0}, we see that the indicated set has Exercise 1.3. Prove Chebyshev’s inequality. If a > 0 then
P (X  ≥ aF ) ≤ a−2 E (X 2 F )
(1.1c) Monotone convergence theorem. If Xn ≥ 0 and Xn ↑ X with
EX < ∞ then E (Xn F ) ↑ E (X F ).
Proof Let Yn = X − Xn . It suﬃces to show that E (Yn F ) ↓ 0. Since Yn ↓,
(1.1b) implies Zn ≡ E (Yn F ) ↓ a limit Z∞ . If A ∈ F then
Zn dP =
A Yn dP
A Letting n → ∞, noting Yn ↓ 0, and using the dominated convergence theorem
gives that A Z∞ dP = 0 for all A ∈ F , so Z∞ ≡ 0.
Remark. By applying the last result to Y1 − Yn , we see that if Yn ↓ Y and we
have E Y1 , E Y  < ∞, then E (Yn F ) ↓ E (Y F ).
Exercise 1.4. Suppose X ≥ 0 and EX = ∞. (There is nothing to prove when
EX < ∞.) Show there is a unique F measurable Y with 0 ≤ Y ≤ ∞ so that
X dP =
A Y dP for all A ∈ F A Hint: Let XM = X ∧ M , YM = E (XM F ), and let M → ∞.
(1.1d) Jensen’s inequality. If ϕ is convex and E X , E ϕ(X ) < ∞ then
ϕ(E (X F )) ≤ E (ϕ(X )F )
Proof If ϕ is linear, the result is trivial, so we will suppose ϕ is not linear.
We do this so that if we let S = {(a, b) : a, b ∈ Q, ax + b ≤ ϕ(x) for all x}, then
ϕ(x) = sup{ax + b : (a, b) ∈ S }. See the proof of (3.2) in Chapter 1 for more
details. If ϕ(x) ≥ ax + b then (1.1b) and (1.1a) imply
E (ϕ(X )F ) ≥ a E (X F ) + b a.s. 223 224 Chapter 4 Martingales
Taking the sup over (a, b) ∈ S gives
E (ϕ(X )F ) ≥ ϕ(E (X F )) a.s. Remark. Here we have written a.s. by the inequalities to stress that there is
an exceptional set for each a, b so we have to take the sup over a countable set.
Exercise 1.5. Imitate the proof in the remark after (5.2) in the Appendix to
prove the conditional CauchySchwarz inequality.
E (XY G )2 ≤ E (X 2 G )E (Y 2 G )
(1.1e) Conditional expectation is a contraction in Lp , p ≥ 1.
Proof (1.1d) implies E (X F )p ≤ E (X p F ). Taking expected values gives
E (E (X F )p ) ≤ E (E (X p F )) = E X p In the last equality, we have used an identity that is an immediate consequence of the deﬁnition:
(1.1f) E (E (Y F )) = E (Y ).
Proof Use property (ii) with A = Ω. Conditional expectation also has properties, like (1.1f), that have no analogue for “ordinary” expectation.
(1.2) Theorem. If F1 ⊂ F2 then (i) E (E (X F1 )F2 ) = E (X F1 )
(ii) E (E (X F2 )F1 ) = E (X F1 ).
In words, the smaller σ ﬁeld always wins. As the proof will show, the ﬁrst
equality is trivial. The second is easy to prove, but in combination with (1.3)
is a powerful tool for computing conditional expectations. I have seen it used
several times to prove results that are false.
Proof Once we notice that E (X F1 ) ∈ F2 , (i) follows from Example 1.1. To
prove (ii), notice that E (X F1 ) ∈ F1 , and if A ∈ F1 ⊂ F2 then
E (X F1 ) dP =
A X dP =
A E (X F2 ) dP
A Section 4.1 Conditional Expectation
Exercise 1.6. Give an example on Ω = {a, b, c} in which
E (E (X F1 )F2 ) = E (E (X F2 )F1 )
The next result shows that for conditional expectation with respect to F ,
random variables X ∈ F are like constants. They can be brought outside the
“integral.”
(1.3) Theorem. If X ∈ F and E Y , E XY  < ∞ then E (XY F ) = XE (Y F ).
Proof The righthand side ∈ F , so we have to check (ii). To do this, we use
the usual fourstep procedure. First, suppose X = 1B with B ∈ F . In this
case, if A ∈ F
1B E (Y F ) dP =
A E (Y F ) dP =
A∩B Y dP =
A∩B 1B Y dP
A so (ii) holds. The last result extends to simple X by linearity. If X, Y ≥ 0, let
Xn be simple random variables that ↑ X , and use the monotone convergence
theorem to conclude that
XE (Y F ) dP =
A XY dP
A To prove the result in general, split X and Y into their positive and negative
parts.
Exercise 1.7. Show that when E X , E Y , and E XY  are ﬁnite, each statement implies the next one and give examples with X, Y ∈ {−1, 0, 1} a.s. that
show the reverse implications are false: (i) X and Y are independent, (ii)
E (Y X ) = EY , (iii) E (XY ) = EXEY .
(1.4) Theorem. Suppose EX 2 < ∞. E (X F ) is the variable Y ∈ F that
minimizes the “mean square error” E (X − Y )2 .
Remark. This result gives a “geometric interpretation” of E (X F ). L2 (Fo ) =
{Y ∈ Fo : EY 2 < ∞} is a Hilbert space, and L2 (F ) is a closed subspace. In this
case, E (X F ) is the projection of X onto L2 (F ), i.e., the point in the subspace
closest to X .
Proof We begin by observing that if Z ∈ L2 (F ), then (1.3) implies
ZE (X F ) = E (ZX F ) 225 226 Chapter 4 Martingales
(E XZ  < ∞ by the CauchySchwarz inequality.) Taking expected values gives
E (ZE (X F )) = E (E (ZX F )) = E (ZX )
or, rearranging,
E [Z (X − E (X F ))] = 0 for Z ∈ L2 (F )
If Y ∈ L2 (F ) and Z = E (X F ) − Y then
E (X − Y )2 = E {X − E (X F ) + Z }2 = E {X − E (X F )}2 + EZ 2
since the crossproduct term vanishes. From the last formula, it is easy to see
E (X − Y )2 is minimized when Z = 0.
Exercise 1.8. Show that if G ⊂ F and EX 2 < ∞ then
E ({X − E (X F )}2 ) + E ({E (X F ) − E (X G )}2 ) = E ({X − E (X G )}2 )
Dropping the second term on the left, we get an inequality that says geometrically, the larger the subspace the closer the projection is, or statistically, more
information means a smaller mean square error. An important special case
occurs when G = {∅, Ω}.
Exercise 1.9. Let var(X F ) = E (X 2 F ) − E (X F )2 . Show that
var(X ) = E (var(X F )) + var(E (X F ))
Exercise 1.10. Let Y1 , Y2 , . . . be i.i.d. with mean µ and variance σ 2 , N an
independent positive integer valued r.v. with EN 2 < ∞ and X = Y1 + · · · + YN .
Show that var(X ) = σ 2 EN + µ2 var(N ). To understand and help remember
the formula, think about the two special cases in which N or Y is constant.
Exercise 1.11. Show that if X and Y are random variables with E (Y G ) = X
and EY 2 = EX 2 < ∞, then X = Y a.s.
Exercise 1.12. The result in the last exercise implies that if EY 2 < ∞ and
E (Y G ) has the same distribution as Y , then E (Y G ) = Y a.s. Prove this
under the assumption E Y  < ∞. Hint: The trick is to prove that sgn (X ) =
sgn (E (X G )) a.s., and then take X = Y − c to get the desired result. Section 4.1 Conditional Expectation *c. Regular Conditional Probabilities
Let (Ω, F , P ) be a probability space, X : (Ω, F ) → (S, S ) a measurable map,
and G a σ ﬁeld ⊂ F . µ : Ω × S → [0, 1] is said to be a regular conditional
distribution for X given G if
(i) For each A, ω → µ(ω, A) is a version of P (X ∈ AG ).
(ii) For a.e. ω , A → µ(ω, A) is a probability measure on (S, S ).
When S = Ω and X is the identity map, µ is called a regular conditional
probability.
Exercise 1.13. Continuation of Example 1.4. Suppose X and Y have a
joint density f (x, y ) > 0. Let
µ(y, A) = f (x, y ) dx f (x, y ) dx A Show that µ(Y (ω ), A) is a r.c.d. for X given σ (Y ).
Regular conditional distributions are useful because they allow us to simultaneously compute the conditional expectation of all functions of X and to
generalize properties of ordinary expectation in a more straightforward way.
Exercise 1.14. Let µ(ω, A) be a r.c.d. for X given F , and let f : (S, S ) →
(R, R) have E f (X ) < ∞. Start with simple functions and show that
E (f (X )F ) = µ(ω, dx)f (x) a.s. Exercise 1.15. Use regular conditional probability to get the conditional
H¨lder inequality from the unconditional one, i.e., show that if p, q ∈ (1, ∞)
o
with 1/p + 1/q = 1 then
E (XY G ) ≤ E (X p G )1/p E (Y q G )1/q
Unfortunately, r.c.d.’s do not always exist. The ﬁrst example was due to
Dieudonn´ (1948). See Doob (1953), p. 624, or Faden (1985) for more recent
e
developments. Without going into the details of the example, it is easy to see
the source of the problem. If A1 , A2 , . . . are disjoint, then (1.1a) and (1.1c)
imply
P (X ∈ An G ) a.s.
P (X ∈ ∪n An G ) =
n 227 228 Chapter 4 Martingales
but if S contains enough countable collections of disjoint sets, the exceptional
sets may pile up. Fortunately,
(1.6) Theorem. r.c.d.’s exist if (S, S ) is nice.
Proof By deﬁnition, there is a 11 map ϕ : S → R so that ϕ and ϕ−1
are measurable. Using monotonicity (1.1b) and throwing away a countable
collection of null sets, we ﬁnd there is a set Ωo with P (Ωo ) = 1 and a family
of random variables G(q, ω ), q ∈ Q so that q → G(q, ω ) is nondecreasing and
ω → G(q, ω ) is a version of P (ϕ(X ) ≤ q G ). Let F (x, ω ) = inf {G(q, ω ) : q > x}.
The notation may remind the reader of the proof of (2.5) in Chapter 2. The
argument given there shows F is a distribution function. Since G(qn , ω ) ↓
F (x, ω ), the remark after (1.1c) implies that F (x, ω ) is a version of P (ϕ(X ) ≤
xG ).
Now, for each ω ∈ Ωo , there is a unique measure ν (ω, ·) on (R, R) so that
ν (ω, (−∞, x]) = F (x, ω ). To check that for each B ∈ R , ν (ω, B ) is a version of
P (ϕ(X ) ∈ B G ), we observe that the class of B for which this statement is true
(this includes the measurability of ω → ν (ω, B )) is a λsystem that contains all
sets of the form (a1 , b1 ] ∪ · · · (ak , bk ] where −∞ ≤ ai < bi ≤ ∞, so the desired
result follows from the π − λ theorem. To extract the desired r.c.d., notice that
if A ∈ S and B = ϕ(A), then B = (ϕ−1 )−1 (A) ∈ R, and set µ(ω, A) = ν (ω, B ).
The following generalization of (1.6) will be needed in Section 5.1.
Exercise 1.16. Suppose X and Y take values in a nice space (S, S ) and G =
σ (Y ). Imitate the proof of (1.6) to show that there is a function µ : S ×S → [0, 1]
so that
(i) for each A, µ(Y (ω ), A) is a version of P (X ∈ AG )
(ii) for a.e. ω , A → µ(Y (ω ), A) is a probability measure on (S, S ). 4.2. Martingales, Almost Sure Convergence
In this section we will deﬁne martingales and their cousins supermartingales
and submartingales, and take the ﬁrst steps in developing their theory. Let Fn
be a ﬁltration, i.e., an increasing sequence of σ ﬁelds. A sequence Xn is said
to be adapted to Fn if Xn ∈ Fn for all n. If Xn is sequence with
(i) E Xn  < ∞,
(ii) Xn is adapted to Fn ,
(iii) E (Xn+1 Fn ) = Xn for all n, Section 4.2 Martingales, Almost Sure Convergence
then X is said to be a martingale (with respect to Fn ). If in the last deﬁnition, = is replaced by ≤ or ≥, then X is said to be a supermartingale or
submartingale, respectively.
Example 2.1. Consider the successive tosses of a fair coin and let ξn = 1 if the
nth toss is heads and ξn = −1 if the nth toss is tails. Let Xn = ξ1 + · · · + ξn and
Fn = σ (ξ1 , . . . , ξn ) for n ≥ 1, X0 = 0 and F0 = {∅, Ω}. I claim that Xn , n ≥ 0,
is a martingale with respect to Fn . To prove this, we observe that Xn ∈ Fn ,
E Xn  < ∞, and ξn+1 is independent of Fn , so using the linearity of conditional
expectation, (1.1a), and Example 1.2,
E (Xn+1 Fn ) = E (Xn Fn ) + E (ξn+1 Fn ) = Xn + Eξn+1 = Xn
Note that, in this example, Fn = σ (X1 , . . . , Xn ) and Fn is the smallest ﬁltration
that Xn is adapted to. In what follows, when the ﬁltration is not mentioned,
we will take Fn = σ (X1 , . . . , Xn ).
Exercise 2.1. Suppose Xn , n ≥ 1, is a martingale w.r.t. Gn and let Fn =
σ (X1 , . . . , Xn ). Then Gn ⊃ Fn and Xn is a martingale w.r.t. Fn .
If the coin tosses considered above have P (ξn = 1) ≤ 1/2 then the computation just completed shows E (Xn+1 Fn ) ≤ Xn , i.e., Xn is a supermartingale. In this case, Xn corresponds to betting on an unfavorable game so there
is nothing “super” about a supermartingale. The name comes from the fact
that if f is superharmonic (i.e., f has continuous derivatives of order ≤ 2 and
∂ 2 f /∂x2 + · · · + ∂ 2 f /∂x2 ≤ 0), then
1
d
f ( x) ≥ 1
B (0, r) f (y ) dy
B (x,r ) where B (x, r) = {y : x − y  ≤ r} is the ball of radius r, and B (0, r) is the
volume of the ball of radius r.
Exercise 2.2. Suppose f is superharmonic on Rd . Let ξ1 , ξ2 , . . . be i.i.d. uniform on B (0, 1), and deﬁne Sn by Sn = Sn−1 + ξn for n ≥ 1 and S0 = x. Show
that Xn = f (Sn ) is a supermartingale.
Our ﬁrst result is an immediate consequence of the deﬁnition of a supermartingale. We could take the conclusion of (2.1) as the deﬁnition of supermartingale, but then the deﬁnition would be harder to check.
(2.1) Theorem. If Xn is a supermartingale then for n > m, E (Xn Fm ) ≤ Xm . 229 230 Chapter 4 Martingales
Proof The deﬁnition gives the result for n = m + 1. Suppose n = m + k with
k ≥ 2. By (1.2),
E (Xm+k Fm ) = E (E (Xm+k Fm+k−1 )Fm ) ≤ E (Xm+k−1 Fm )
by the deﬁnition and (1.1b). The desired result now follows by induction.
(2.2) Corollary. (i) If Xn is a submartingale then for n > m, E (Xn Fm ) ≥ Xm .
(ii) If Xn is a martingale then for n > m, E (Xn Fm ) = Xm .
Proof To prove (i), note that −Xn is a supermartingale and use (1.1a). For
(ii), observe that Xn is a supermartingale and a submartingale.
Remark. The idea in the proof of (2.2) can be used many times below. To keep
from repeating ourselves, we will just state the result for either supermartingales
or submartingales and leave it to the reader to translate the result for the other
two.
(2.3) Theorem. If Xn is a martingale w.r.t. Fn and ϕ is a convex function
with E ϕ(Xn ) < ∞ for all n then ϕ(Xn ) is a submartingale w.r.t. Fn .
Proof By Jensen’s inequality and the deﬁnition
E (ϕ(Xn+1 )Fn ) ≥ ϕ(E (Xn+1 Fn )) = ϕ(Xn ) (2.4) Corollary. Let p ≥ 1. If Xn is a martingale w.r.t. Fn and E Xn p < ∞
for all n, then Xn p is a submartingale w.r.t. Fn .
(2.5) Theorem. If Xn is a submartingale w.r.t. Fn and ϕ is an increasing
convex function with E ϕ(Xn ) < ∞ for all n, then ϕ(Xn ) is a submartingale
w.r.t. Fn .
Proof By Jensen’s inequality and the assumptions
E (ϕ(Xn+1 )Fn ) ≥ ϕ(E (Xn+1 Fn )) ≥ ϕ(Xn ) 2
Exercise 2.3. Give an example of a submartingale Xn so that Xn is a supermartingale. Hint: Xn does not have to be random. (2.6) Corollary. (i) If Xn is a submartingale then (Xn − a)+ is a submartingale.
(ii) If Xn is a supermartingale then Xn ∧ a is a supermartingale. Section 4.2 Martingales, Almost Sure Convergence
Let Fn , n ≥ 0 be a ﬁltration. Hn , n ≥ 1 is said to be a predictable
sequence if Hn ∈ Fn−1 for all n ≥ 1. In words, the value of Hn may be
predicted (with certainty) from the information available at time n − 1. In this
section, we will be thinking of Hn as the amount of money a gambler will bet
at time n. This can be based on the outcomes at times 1, . . . , n − 1 but not on
the outcome at time n!
Once we start thinking of Hn as a gambling system, it is natural to ask
how much money we would make if we used it. For concreteness, let us suppose
that the game consists of ﬂipping a coin and that for each dollar you bet you
win one dollar when the coin comes up heads and lose your dollar when the coin
comes up tails. Let Xn be the net amount of money you would have won at
time n if you had bet one dollar each time. If you bet according to a gambling
system H then your winnings at time n would be
n (H · X )n = Hm (Xm − Xm−1 )
m=1 since Xm − Xm−1 = +1 or −1 when the mth toss results in a win or loss,
respectively.
Let ξm = Xm − Xm−1 . A famous gambling system called the “martingale”
is deﬁned by H1 = 1 and for n ≥ 2, Hn = 2Hn−1 if ξn−1 = −1 and Hn = 1 if
ξn−1 = 1. In words, we double our bet when we lose, so that if we lose k times
and then win, our net winnings will be −1 − 2 . . . − 2k−1 + 2k = 1. This system
seems to provide us with a “sure thing” as long as P (ξm = 1) > 0. However,
the next result says there is no system for beating an unfavorable game.
(2.7) Theorem. Let Xn , n ≥ 0, be a supermartingale. If Hn ≥ 0 is predictable
and each Hn is bounded then (H · X )n is a supermartingale.
Proof Using the fact that conditional expectation is linear, (H · X )n ∈ Fn ,
Hn ∈ Fn−1 , and (1.3), we have
E ((H · X )n+1 Fn ) = (H · X )n + E (Hn+1 (Xn+1 − Xn )Fn )
= (H · X )n + Hn+1 E ((Xn+1 − Xn )Fn ) ≤ (H · X )n
since E ((Xn+1 − Xn )Fn ) ≤ 0 and Hn+1 ≥ 0.
Remark. The same result is obviously true for submartingales and for martingales (in the last case, without the restriction Hn ≥ 0).
The notion of a stopping time, introduced in Section 3.1, is closely related
to the concept of a gambling system. Recall that a random variable N is said to 231 232 Chapter 4 Martingales
be a stopping time if {N = n} ∈ Fn for all n < ∞. If you think of N as the
time a gambler stops gambling, then the condition above says that the decision
to stop at time n must be measurable with respect to the information he has
at that time. If we let Hn = 1{N ≥n} , then {N ≥ n} = {N ≤ n − 1}c ∈ Fn−1 ,
so Hn is predictable, and it follows from (2.7) that (H · X )n = XN ∧n − X0 is
a supermartingale. Since the constant sequence Yn = X0 is a supermartingale
and the sum of two supermartingales is also, we have:
(2.8) Corollary. If N is a stopping time and Xn is a supermartingale, then
XN ∧n is a supermartingale.
Although you cannot make money with gambling systems, you can prove
theorems with them. Suppose Xn , n ≥ 0, is a submartingale. Let a < b, let
N0 = −1, and for k ≥ 1 let
N2k−1 = inf {m > N2k−2 : Xm ≤ a}
N2k = inf {m > N2k−1 : Xm ≥ b}
The Nj are stopping times and {N2k−1 < m ≤ N2k } = {N2k−1 ≤ m − 1} ∩
{N2k ≤ m − 1}c ∈ Fm−1 , so
Hm = 1 if N2k−1 < m ≤ N2k for some k
0 otherwise deﬁnes a predictable sequence. X (N2k−1 ) ≤ a and X (N2k ) ≥ b, so between
times N2k−1 and N2k , Xm crosses from below a to above b. Hm is a gambling
system that tries to take advantage of these “upcrossings.” In stock market
terms, we buy when Xm ≤ a and sell when Xm ≥ b, so every time an upcrossing
is completed, we make a proﬁt of ≥ (b − a). Finally, Un = sup{k : N2k ≤ n} is
the number of upcrossings completed by time n.
(2.9) The upcrossing inequality. If Xm , m ≥ 0, is a submartingale then
(b − a)EUn ≤ E (Xn − a)+ − E (X0 − a)+
Proof Let Ym = a + (Xm − a)+ . By (2.6), Ym is a submartingale. Clearly, it
upcrosses [a, b] the same number of times that Xm does, and we have (b−a)Un ≤
(H · Y )n , since each upcrossing results in a proﬁt ≥ (b − a) and a ﬁnal incomplete
upcrossing (if there is one) makes a nonnegative contribution to the righthand
side. Let Km = 1 − Hm . Clearly, Yn − Y0 = (H · Y )n + (K · Y )n , and it follows
from (2.7) that E (K · Y )n ≥ E (K · Y )0 = 0 so E (H · Y )n ≤ E (Yn − Y0 ), proving
(2.9). Section 4.2 Martingales, Almost Sure Convergence
We have proved the result in its classical form, even though this is a little
misleading. The key fact is that E (K · X )n ≥ 0, i.e., no matter how hard
you try you can’t lose money betting on a submartingale. From the upcrossing
inequality, we easily get
(2.10) The martingale convergence theorem. If Xn is a submartingale
+
with sup EXn < ∞ then as n → ∞, Xn converges a.s. to a limit X with
E X  < ∞.
Proof Since (X − a)+ ≤ X + + a, (2.9) implies that
+
EUn ≤ (a + EXn )/(b − a) As n ↑ ∞, Un ↑ U the number of upcrossings of [a, b] by the whole sequence, so
+
if sup EXn < ∞ then EU < ∞ and hence U < ∞ a.s. Since the last conclusion
holds for all rational a and b,
∪a,b∈Q {lim inf Xn < a < b < lim sup Xn } has probability 0 and hence lim sup Xn = lim inf Xn a.s., i.e., lim Xn exists a.s. Fatou’s lemma
+
guarantees EX + ≤ lim inf EXn < ∞, so X < ∞ a.s. To see X > −∞, we
observe that
−
+
+
EXn = EXn − EXn ≤ EXn − EX0
(since Xn is a submartingale), so another application of Fatou’s lemma shows
−
+
EX − ≤ lim inf EXn sup EXn − EX0 < ∞
n→∞ n and completes the proof.
Remark. To prepare for the proof of (6.1), the reader should note that we
have shown that if the number of upcrossings of (a, b) by Xn is ﬁnite for all
a, b ∈ Q, then the limit of Xn exists.
An important special case of (2.10) is
(2.11) Corollary. If Xn ≥ 0 is a supermartingale then as n → ∞, Xn → X
a.s. and EX ≤ EX0 .
+
Proof Yn = −Xn ≤ 0 is a submartingale with EYn = 0. Since EX0 ≥ EXn ,
the inequality follows from Fatou’s lemma. In the next section, we will give several applications of the last two results.
We close this one by giving two “counterexamples.” 233 234 Chapter 4 Martingales
Example 2.2. The ﬁrst shows that the assumptions of (2.11) (and hence those
of (2.10)) do not guarantee convergence in L1 . Let Sn be a symmetric simple
random walk with S0 = 1, i.e., Sn = Sn−1 + ξn where ξ1 , ξ2 , . . . are i.i.d. with
P (ξi = 1) = P (ξi = −1) = 1/2. Let N = inf {n : Sn = 0} and let Xn = SN ∧n .
(2.8) implies that Xn is a nonnegative martingale. (2.11) implies Xn converges
to a limit X∞ < ∞ that must be ≡ 0, since convergence to k > 0 is impossible.
(If Xn = k > 0 then Xn+1 = k ± 1.) Since EXn = EX0 = 1 for all n and
X∞ = 0, convergence cannot occur in L1 .
Example 2.2 is an important counterexample to keep in mind as you read
the rest of this chapter. The next two are not as important.
Example 2.3. We will now give an example of a martingale with Xk → 0
in probability but not a.s. Let X0 = 0. When Xk−1 = 0, let Xk = 1 or −1
with probability 1/2k and = 0 with probability 1 − 1/k . When Xk−1 = 0, let
Xk = kXk−1 with probability 1/k and = 0 with probability 1 − 1/k . From the
construction, P (Xk = 0) = 1 − 1/k so Xk → 0 in probability. On the other
hand, the second BorelCantelli lemma implies P (Xk = 0 for k ≥ K ) = 0, and
values in (−1, 1) − {0} are impossible, so Xk does not converge to 0 a.s.
Exercise 2.4. Give an example of a martingale Xn with Xn → −∞ a.s.
Hint: Let Xn = ξ1 + · · · + ξn , where the ξi are independent (but not identically
distributed) with Eξi = 0.
Our ﬁnal result is useful in reducing questions about submartingales to
questions about martingales.
(2.12) Doob’s decomposition. Any submartingale Xn , n ≥ 0, can be written
in a unique way as Xn = Mn + An , where Mn is a martingale and An is a
predictable increasing sequence with A0 = 0.
Proof We want Xn = Mn + An , E (Mn Fn−1 ) = Mn−1 , and An ∈ Fn−1 . So
we must have
E (Xn Fn−1 ) = E (Mn Fn−1 ) + E (An Fn−1 )
= Mn−1 + An = Xn−1 − An−1 + An
and it follows that
(a) An − An−1 = E (Xn Fn−1 ) − Xn−1
(b) Mn = Xn − An
Now A0 = 0 and M0 = X0 by assumption, so we have An and Mn deﬁned for
all time, and we have proved uniqueness. To check that our recipe works, we Section 4.2 Martingales, Almost Sure Convergence
observe that An − An−1 ≥ 0 since Xn is a submartingale and induction shows
An ∈ Fn−1 . To see that Mn is a martingale, we use (b), An ∈ Fn−1 and (a):
E (Mn Fn−1 ) = E (Xn − An Fn−1 )
= E (Xn Fn−1 ) − An = Xn−1 − An−1 = Mn−1
Exercise 2.5. Let Xn =
decomposition for Xn ? m≤n 1Bm and suppose Bn ∈ Fn . What is the Doob Exercises
2
2.6. Let ξ1 , ξ2 , . . . be independent with Eξi = 0 and var(ξm ) = σm < ∞, and
n
2
2
2
2
let sn = m=1 σm . Then Sn − sn is a martingale. 2.7. If ξ1 , ξ2 , . . . are independent and have Eξi = 0 then
(
Xnk) = ξi1 · · · ξik
1≤i1 <...<ik ≤n
(2) 2
is a martingale. When k = 2 and Sn = ξ1 + · · · + ξn , 2Xn = Sn − 2
m≤n ξm . 2.8. Generalize (2.6) by showing that if Xn and Yn are submartingales w.r.t. Fn
then Xn ∨ Yn is also.
2.9. Let Y1 , Y2 , . . . be nonnegative i.i.d. random variables with EYm = 1 and
P (Ym = 1) < 1. (i) Show that Xn = m≤n Ym deﬁnes a martingale. (ii) Use
(2.11) and an argument by contradiction to show Xn → 0 a.s. (iii) Use the
strong law of large numbers to conclude (1/n) log Xn → c < 0.
2.10. Suppose yn > −1 for all n and
exists. yn  < ∞. Show that ∞
m=1 (1 + ym ) 2.11. Let Xn and Yn be positive integrable and adapted to Fn . Suppose
E (Xn+1 Fn ) ≤ (1 + Yn )Xn
with
Yn < ∞ a.s. Prove that Xn converges a.s. to a ﬁnite limit by ﬁnding a
closely related supermartingale to which (2.11) can be applied.
2.12. Use the random walks in Exercise 2.2 to conclude that in d ≤ 2, nonnegative superharmonic functions must be constant. The example f (x) = x2−d
shows this is false in d > 2.
1
2
2.13. The switching principle. Suppose Xn and Xn are supermartingales
1
2
with respect to Fn , and N is a stopping time so that XN ≥ XN . Then
1
2
Yn = Xn 1(N >n) + Xn 1(N ≤n) is a supermartingale. 235 236 Chapter 4 Martingales
Since N + 1 is a stopping time, this implies that
1
2
Zn = Xn 1(N ≥n) + Xn 1(N <n) is a supermartingale. 2.14. Dubins’ inequality. For every positive supermartingale Xn , n ≥ 0, the
number of upcrossings U of [a, b] satisﬁes
P (U ≥ k ) ≤ a
b k E min(X0 /a, 1) To prove this, we let N0 = −1 and for j ≥ 1 let
N2j −1 = inf {m > N2j −2 : Xm ≤ a}
N2j = inf {m > N2j −1 : Xm ≥ b}
Let Yn = 1 for 0 ≤ n < N1 and for j ≥ 1
Yn = (b/a)j −1 (Xn /a)
(b/a)j for N2j −1 ≤ n < N2j
for N2j ≤ n < N2j +1 (i) Use the switching principle in the previous exercise and induction to show
j
that Zn = Yn∧Nj is a supermartingale. (ii) Use EYn∧N2k ≤ EY0 and let n → ∞
to get Dubins’ inequality. 4.3. Examples
In this section, we will apply the martingale convergence theorem to generalize
the second BorelCantelli lemma and to study Polya’s urn scheme, RadonNikodym derivatives, and branching processes. The four topics are independent
of each other and are taken up in the order indicated. a. Bounded Increments
Our ﬁrst result shows that martingales with bounded increments either converge
or oscillate between +∞ and −∞.
(3.1) Theorem. Let X1 , X2 , . . . be a martingale with Xn+1 − Xn  ≤ M < ∞.
Let
C = {lim Xn exists and is ﬁnite}
D = {lim sup Xn = +∞ and lim inf Xn = −∞}
Then P (C ∪ D) = 1. Section 4.3 Examples
Proof Since Xn −X0 is a martingale, we can without loss of generality suppose
that X0 = 0. Let 0 < K < ∞ and let N = inf {n : Xn ≤ −K }. Xn∧N is a
martingale with Xn∧N ≥ −K − M a.s. so applying (2.11) to Xn∧N + K + M
shows lim Xn exists on {N = ∞}. Letting K → ∞, we see that the limit
exists on {lim inf Xn > −∞}. Applying the last conclusion to −Xn , we see
that lim Xn exists on {lim sup Xn < ∞} and the proof is complete.
Exercise 3.1. Let Xn , n ≥ 0, be a submartingale with sup Xn < ∞. Let
+
ξn = Xn − Xn−1 and suppose E (sup ξn ) < ∞. Show that Xn converges a.s.
Exercise 3.2. Give an example of a martingale Xn with supn Xn  < ∞ and
P (Xn = a i.o.) = 1 for a = −1, 0, 1. This example shows that it is not enough
to have sup Xn+1 − Xn  < ∞ in (3.1).
Exercise 3.3. (Assumes familiarity with ﬁnite state Markov chains.) Fine
tune the example for the previous problem so that P (Xn = 0) → 1 − 2p and
P (Xn = −1), P (Xn = 1) → p, where p is your favorite number in (0, 1), i.e.,
you are asked to do this for one value of p that you may choose. This example
shows that a martingale can converge in distribution without converging a.s. (or
in probability).
Exercise 3.4. Let Xn and Yn be positive integrable and adapted to Fn .
Suppose E (Xn+1 Fn ) ≤ Xn + Yn , with Yn < ∞ a.s. Prove that Xn converges
k
a.s. to a ﬁnite limit. Hint: Let N = inf k m=1 Ym > M , and stop your
supermartingale at time N .
(3.2) Corollary. Second BorelCantelli lemma, II. Let Fn , n ≥ 0 be a
ﬁltration with F0 = {∅, Ω} and An , n ≥ 1 a sequence of events with An ∈ Fn .
Then
∞ {An i.o.} = P (An Fn−1 ) = ∞
n=1 Proof If we let X0 = 0 and
n Xn = 1Am − P (Am Fm−1 ) for n ≥ 1 m=1 then Xn is a martingale with Xn − Xn−1  ≤ 1. Using the notation of (3.1) we 237 238 Chapter 4 Martingales
have:
∞ ∞ 1An = ∞ if and only if on C,
n=1 P (An Fn−1 ) = ∞
n=1 ∞ on D, ∞ 1An = and n=1 P (An Fn−1 ) = ∞
n=1 Since P (C ∪ D) = 1, the result follows.
Exercise 3.5. Let pm ∈ [0, 1). Use the BorelCantelli lemmas to show that
∞
∞
m=1 (1 − pm ) = 0 if and only if
m=1 pm = ∞.
Exercise 3.6. Show ∞
n=2 P (An  ∩n−1 Ac ) = ∞ implies P (∩∞=1 Ac ) = 0.
m
m
m=1 m b. Polya’s Urn Scheme
An urn contains r red and g green balls. At each time we draw a ball out, then
replace it, and add c more balls of the color drawn. Let Xn be the fraction of
green balls after the nth draw. To check that Xn is a martingale, note that if
there are i red balls and j green balls at time n, then
Xn+1 = (j + c)/(i + j + c) with probability j/(i + j )
j/(i + j + c)
with probability i/(i + j ) and we have
j
j
i
( j + c + i) j
j
j+c
·
+
·
=
=
i+j+c i+j
i+j +c i+j
(i + j + c)(i + j )
i+j
Since Xn ≥ 0, (2.11) implies that Xn → X∞ a.s. To compute the distribution of the limit, we observe (a) the probability of getting green on the ﬁrst
m draws then red on the next = n − m draws is
g+c
g + (m − 1)c
r
r + ( − 1)c
g
·
···
·
···
g+r g+r+c
g + r + (m − 1)c g + r + mc
g + r + (n − 1)c
and (b) any other outcome of the ﬁrst n draws with m green balls drawn and
red balls drawn has the same probability since the denominator remains the
same and the numerator is permuted. Consider the special case c = 1, g = 1,
r = 1. Let Gn be the number of green balls after the nth draw has been
completed and the new ball has been added. It follows from (a) and (b) that
P (Gn = m + 1) = 1
n m!(n − m)!
=
(n + 1)!
n+1
m Section 4.3 Examples
so X∞ has a uniform distribution on (0,1).
If we suppose that c = 1, g = 2, and r = 1, then
P (Gn = m + 2) = 2m!(n − m)!
n!
→ 2x
m!(n − m)! (n + 2)! if n → ∞ and m/n → x. In general, the distribution of X∞ has density
r
Γ((g + r)/c) g −1
x c (1 − x) c −1
Γ(g/c)Γ(r/c) This is the beta distribution with parameters g/c and r/c. In Example 4.5
we will see that the limit behavior changes drastically if, in addition to the c
balls of the color chosen, we always add one ball of the opposite color. c. RadonNikodym Derivatives
Let µ be a ﬁnite measure and ν a probability measure on (Ω, F ). Let Fn ↑ F
be σ ﬁelds (i.e., σ (∪Fn ) = F ). Let µn and νn be the restrictions of µ and ν to
Fn .
(3.3) Theorem. Suppose µn < νn for all n. Let Xn = dµn /dνn and let
<
X = lim sup Xn . Then
µ(A) = Xdν + µ(A ∩ {X = ∞})
A <
Remark. µr (A) ≡ A X dν is a measure < ν . Since (2.11) implies ν (X =
∞) = 0, µs (A) ≡ µ(A ∩ {X = ∞}) is singular w.r.t. ν . Thus µ = µr + µs
gives the Lebesgue decomposition of µ (see (8.5) in the Appendix), and X∞ =
dµr /dν , ν a.s. Here and in the proof we need to keep track of the measure to
which the a.s. refers.
Proof As the reader can probably anticipate: (3.4) Lemma. Xn (deﬁned on (Ω, F , ν )) is a martingale w.r.t. Fn .
Proof We observe that, by deﬁnition, Xn ∈ Fn . Let A ∈ Fn . Since Xn ∈ Fn
and νn is the restriction of ν to Fn
Xn dν =
A Xn dνn
A 239 240 Chapter 4 Martingales
Using the deﬁnition of Xn and Exercise 8.7 in the Appendix
Xn dνn = µn (A) = µ(A)
A the last equality holding since A ∈ Fn and µn is the restriction of µ to Fn . If
A ∈ Fm−1 ⊂ Fm , using the last result for n = m and n = m − 1 gives
Xm dν = µ(A) =
A Xm−1 dν
A so E (Xm Fm−1 ) = Xm−1 .
Since Xn is a nonnegative martingale, (2.11) implies that Xn → X ν a.s.
We want to check that the equality in the theorem holds. Dividing µ(A) by
µ(Ω), we can without loss of generality suppose µ is a probability measure. Let
ρ = (µ + ν )/2, ρn = (µn + νn )/2 = the restriction of ρ to Fn . Let Yn = dµn /dρn ,
Zn = dνn /dρn . Yn , Zn ≥ 0 and Yn + Zn = 2 (by Exercise 8.6 in the Appendix),
so Yn and Zn are bounded martingales with limits Y and Z. As the reader can
probably guess,
(∗) Y = dµ/dρ Z = dν/dρ It suﬃces to prove the ﬁrst equality. From the proof of (3.4), if A ∈ Fm ⊂ Fn
µ(A) = Yn dρ →
A Y dρ
A by the bounded convergence theorem. The last computation shows that
µ(A) = Y dρ for all A ∈ G = ∪m Fm
A G is a π system, so the π − λ theorem implies the equality is valid for all
A ∈ F = σ (G ) and (∗) is proved.
It follows from Exercises 8.8 and 8.9 in the Appendix that Xn = Yn /Zn . At
this point, the reader can probably leap to the conclusion that X = Y /Z . To get
there carefully, note Y + Z = 2 ρa.s., so ρ(Y = 0, Z = 0) = 0. Having ruled out
0/0 we have X = Y /Z ρa.s. (Recall X ≡ lim sup Xn .) Let W = (1/Z ) · 1(Z>0) .
Using (∗), then 1 = ZW + 1(Z =0) , we have
(a) µ(A) = Y dρ =
A Y W Z dρ +
A 1(Z =0) Y dρ
A Section 4.3 Examples
Now (∗) implies dν = Z dρ, and it follows from the deﬁnitions that
Y W = X 1(Z>0) = X ν a.s. the second equality holding since ν ({Z = 0}) = 0. Combining things, we have
Y W Z dρ = (b) X dν A A To handle the other term, we note that (∗) implies dµ = Y dρ, and it follows
from the deﬁnitions that {X = ∞} = {Z = 0} µa.s. so
(c) 1(Z =0) Y dρ =
A 1(X =∞) dµ
A Combining (a), (b), and (c) gives the desired result.
Example 3.1. Suppose Fn = σ (Ik,n : 0 ≤ k < Kn ) where for each n,
Ik,n is a partition of Ω, and the (n + 1)th partition is a reﬁnement of the
nth. In this case, the condition µn < νn is ν (Ik,n ) = 0 implies µ(Ik,n ) = 0,
<
and the martingale Xn = µ(Ik,n )/ν (Ik,n ) on Ik,n is an approximation to the
RadonNikodym derivative. For a concrete example, consider Ω = [0, 1), Ik,n =
[k 2−n , (k + 1)2−n ) for 0 ≤ k < 2n , and ν = Lebesgue measure.
Exercise 3.7. Check by direct computation that the Xn in Example 3.1 is a
martingale. Show that if we drop the condition µn < νn and set Xn = 0 when
<
ν (Ik,n ) = 0, then E (Xn+1 Fn ) ≤ Xn .
Exercise 3.8. Apply (3.3) to Example 3.1 to get a “probabilistic” proof of the
RadonNikodym theorem. To be precise, suppose F is countably generated
(i.e., there is a sequence of sets An so that F = σ (An : n ≥ 1)) and show that
if µ and ν are σ ﬁnite measures and µ < ν , then there is a function g so that
<
µ(A) = A g dν .
Remark. Before you object to this as circular reasoning (the RadonNikodym
theorem was used to deﬁne conditional expectation!), observe that the conditional expectations that are needed for Example 3.1 have elementary deﬁnitions.
Kakutani dichotomy for inﬁnite product measures. Let µ and ν be
measures on sequence space (RN , RN ) that make the coordinates ξn (ω ) = ωn
independent. Let Fn (x) = µ(ξn ≤ x), Gn (x) = ν (ξn ≤ x). Suppose Fn < Gn
<
and let qn = dFn /dGn . Let Fn = σ (ξm : m ≤ n), let µn and νn be the
restrictions of µ and ν to Fn , and let Xn = dµn /dνn . (3.3) implies that Xn → X 241 242 Chapter 4 Martingales
ν a.s. The convergence of the inﬁnite product is a tail event, so the Kolmogorov
01 law implies µ(X < ∞) ∈ {0, 1} and it follows from (3.3) that either µ < ν
<
or µ ⊥ ν . The next result gives a concrete criterion for which of the two
alternatives occurs.
∞
m=1 (3.5) Theorem. µ < ν or µ ⊥ ν , according as
<
Proof √ qm dGm > 0 or = 0. Jensen’s inequality and Exercise 8.7 in the Appendix imply
√ 2 qm dGm ≤ qm dGm = dFm = 1 so the inﬁnite product of the integrals is well deﬁned and ≤ 1. Let
Xn = qm (ωm )
m≤n as above, and recall that Xn → X ν a.s. If the inﬁnite product is 0 then
n
1
Xn/2 dν = √ qm dGm → 0 m=1 Fatou’s lemma implies
X 1/2 dν ≤ lim inf
n→∞ 1
Xn/2 dν = 0 so X = 0 ν a.s., and (3.3) implies µ ⊥ ν . To prove the other direction, let
1/2
Yn = Xn . Now qm dGm = 1, so if we use E to denote expected value with
2
respect to ν , then EYm = EXm = 1, so
n+k √ 1/2 1
E (Yn+k − Yn )2 = E (Xn+k + Xn − 2Xn/2 Xn+k ) = 2 1 − qm dGm m=n+1 Now a − b = a1/2 − b1/2  · (a1/2 + b1/2 ), so using CauchySchwarz and the fact
(a + b)2 ≤ 2a2 + 2b2 gives
E Xn+k − Xn  = E (Yn+k − Yn (Yn+k + Yn ))
≤ E (Yn+k − Yn )2 E (Yn+k + Yn )2
≤ 4E (Yn+k − Yn )2 1/2 1/2 Section 4.3 Examples
From the last two equations, it follows that if the inﬁnite product is > 0, then
Xn converges to X in L1 , so P (X = ∞) = 0 and the desired result follows from
(3.3).
To complete the proof without using that result, note that for any A ∈ Fm
and n ≥ m we have
µn (A) = Xn dν →
A X dν
A since EXn 1A − EX 1A  ≤ E Xn − X . Using the π − λ theorem now gives the
desired result.
For the next three exercises, suppose Fn , Gn are concentrated on {0, 1} and
have Fn (0) = 1 − αn , Gn (0) = 1 − βn .
Exercise 3.9. (i) Use (3.5) to ﬁnd a necessary and suﬃcient condition for
µ < ν . (ii) Suppose that 0 < ≤ αn , βn ≤ 1 − < 1. Show that in this case
<
the condition is simply (αn − βn )2 < ∞.
Exercise 3.10. Show that if
αn < ∞ and
βn = ∞ in the previous
exercise then µ ⊥ ν . This shows that the condition (αn − βn )2 < ∞ is not
suﬃcient for µ < ν in general.
<
Exercise 3.11. Suppose 0 < αn , βn < 1. Show that
suﬃcient for µ < ν in general.
< αn − βn  < ∞ is d. Branching Processes
n
Let ξi , i, n ≥ 0, be i.i.d. nonnegative integervalued random variables. Deﬁne
a sequence Zn , n ≥ 0 by Z0 = 1 and Zn+1 = n
n
ξ1 +1 + · · · + ξZ+1
n
0 if Zn > 0
if Zn = 0 Zn is called a GaltonWatson process. The idea behind the deﬁnitions is
that Zn is the number of people in the nth generation, and each member of the
nth generation gives birth independently to an identically distributed number
n
of children. pk = P (ξi = k ) is called the oﬀspring distribution.
m
m
(3.6) Lemma. Let Fn = σ (ξi : i ≥ 1, 1 ≤ m ≤ n) and µ = Eξi ∈ (0, ∞).
Then Zn /µn is a martingale w.r.t. Fn . 243 244 Chapter 4 Martingales
Proof Clearly, Zn ∈ Fn .
∞ E (Zn+1 Fn ) = E (Zn+1 1{Zn =k} Fn )
k=1 n
n
by (1.1a) and (1.1c). On {Zn = k }, Zn+1 = ξ1 +1 + · · · + ξk +1 , so the sum is
∞ ∞
n
n
E ((ξ1 +1 + · · · + ξk +1 )1{Zn =k} Fn ) = k=1 n
n
1{Zn =k} E (ξ1 +1 + · · · + ξk +1 Fn )
k=1 n
by (1.3). Since each ξj +1 is independent of Fn , the last expression
∞ 1{Zn =k} kµ = µZn =
k=1 Dividing both sides by µn+1 now gives the desired result.
Remark. The reader should notice that in the proof of (3.6) we broke things
down according to the value of Zn to get rid of the random index. A simpler
way of doing the last argument (that we will use in the future) is to use Exercise
1.1 to conclude that on {Zn = k }
n
n
E (Zn+1 Fn ) = E (ξ1 +1 + · · · + ξk +1 Fn ) = kµ = µZn Zn /µn is a nonnegative martingale, so (2.11) implies Zn /µn → a limit a.s.
We begin by identifying cases when the limit is trivial.
(3.7) Theorem. If µ < 1 then Zn = 0 for all n suﬃciently large, so Zn /µn → 0.
Proof E (Zn /µn ) = E (Z0 ) = 1, so E (Zn ) = µn . Now Zn ≥ 1 on {Zn > 0} so
P (Zn > 0) ≤ E (Zn ; Zn > 0) = E (Zn ) = µn → 0 exponentially fast if µ < 1.
The last answer should be intuitive. If each individual on the average gives
birth to less than one child, the species will die out. The next result shows that
after we exclude the trivial case in which each individual has exactly one child,
the same result holds when µ = 1.
m
(3.8) Theorem. If µ = 1 and P (ξi = 1) < 1 then Zn = 0 for all n suﬃciently
large. Section 4.3 Examples
Proof When µ = 1, Zn is itself a nonnegative martingale. Since Zn is integer
valued and by (2.11) converges to an a.s. ﬁnite limit Z∞ , we must have Zn = Z∞
m
for large n. If P (ξi = 1) < 1 and k > 0 then P (Zn = k for all n ≥ N ) = 0 for
any N , so we must have Z∞ ≡ 0.
When µ ≤ 1, the limit of Zn /µn is 0 because the branching process dies
out. Our next step is to show:
(3.9) Theorem. If µ > 1 then P (Zn > 0 for all n) > 0.
k
m
Proof For s ∈ [0, 1], let ϕ(s) =
k≥0 pk s where pk = P (ξi = k ). ϕ is
the generating function for the oﬀspring distribution pk . Diﬀerentiating and
referring to (9.2) in the Appendix for the justiﬁcation gives for s < 1 ∞ k pk sk−1 ≥ 0 ϕ (s) =
k=1
∞ k (k − 1)pk sk−2 ≥ 0 ϕ (s) =
k=2 So ϕ is increasing and convex, and lims↑1 ϕ (s) = ∞
k=1 kpk = µ. Our interest in ϕ stems from the following facts.
(a) If θm = P (Zm = 0) then θm = ∞
k=0 pk (θm−1 )k . Proof of (a) If Z1 = k , an event with probability pk , then Zm = 0 if and only
if all k families die out in the remaining m − 1 units of time, an independent
k
event with probability θm−1 . Summing over the disjoint possibilities for each k
gives the desired result.
(b) If ϕ (1) = µ > 1 there is a unique ρ < 1 so that ϕ(ρ) = ρ.
Proof of (b) ϕ(0) ≥ 0, ϕ(1) = 1, and ϕ (1) > 1, so ϕ(1 − ) < 1 − for small
. The last two observations imply the existence of a ﬁxed point. See Figure
4.3.1. To see it is unique, observe that µ > 1 implies pk > 0 for some k > 1, so
ϕ (θ) > 0 for θ > 0. Since ϕ is strictly convex, it follows that if ρ < 1 is a ﬁxed
point, then ϕ(x) < x for x ∈ (ρ, 1). 245 246 Chapter 4 Martingales
(c) As m ↑ ∞, θm ↑ ρ.
Proof of (c) θ0 = 0, ϕ(ρ) = ρ, and ϕ is increasing, so induction implies θm
is increasing and θn ≤ ρ. Let θ∞ = lim θm . Taking limits in θm = ϕ(θm−1 ), we
see θ∞ = ϕ(θ∞ ). Since θ∞ ≤ ρ, it follows that θ∞ = ρ.
Combining (a)–(c) shows P (Zn = 0 for some n) = lim θn = ρ < 1 and proves
(3.9).
The last result shows that when µ > 1, the limit of Zn /µn has a chance of
being nonzero. The best result on this question is due to Kesten and Stigum:
(3.10) Theorem. W = lim Zn /µn is not ≡ 0 if and only if pk k log k < ∞. For a proof, see Athreya and Ney (1972), p. 24–29. In the next section, we will
show that
k 2 pk < ∞ is suﬃcient for a nontrivial limit.
Exercise 3.12. Show that if P (lim Zn /µn = 0) < 1 then it is = ρ and hence
{lim Zn /µn > 0} = {Zn > 0 for all n} a.s.
Exercise 3.13. Galton and Watson who invented the process that bears their
names were interested in the survival of family names. Suppose each family
has exactly 3 children but coin ﬂips determine their sex. In the 1800s, only
male children kept the family name so following the male oﬀspring leads to a
branching process with p0 = 1/8, p1 = 3/8, p2 = 3/8, p3 = 1/8. Compute the
probability ρ that the family name will die out when Z0 = 1. 4.4. Doob’s Inequality, Convergence in L p We begin by proving a consequence of (2.8).
(4.1) Theorem. If Xn is a submartingale and N is a stopping time with
P (N ≤ k ) = 1 then
EX0 ≤ EXN ≤ EXk
Remark. Let Sn be a simple random walk with S0 = 1 and let N = inf {n :
Sn = 0}. (See Example 2.2 for more details.) ES0 = 1 > 0 = ESN so the ﬁrst
inequality need not hold for unbounded stopping times. In Section 4.7 we will
give conditions that guarantee EX0 ≤ EXN for unbounded N. Section 4.4 Doob’s Inequality, Convergence in Lp
Proof (2.8) implies XN ∧n is a submartingale, so it follows that
EX0 = EXN ∧0 ≤ EXN ∧k = EXN To prove the other inequality, let Kn = 1{N <n} = 1{N ≤n−1} . Kn is predictable,
so (2.7) implies (K · X )n = Xn − XN ∧n is a submartingale and it follows that
EXk − EXN = E (K · X )k ≥ E (K · X )0 = 0
Exercise 4.1. Show that if j ≤ k then E (Xj ; N = j ) ≤ E (Xk ; N = j ) and
sum over j to get a second proof of EXN ≤ EXk .
Exercise 4.2. Generalize the proof of (4.1) to show that if Xn is a submartingale and M ≤ N are stopping times with P (N ≤ k ) = 1 then EXM ≤ EXN .
Exercise 4.3. Use the stopping times from the Exercise 1.7 in Chapter 3 to
strengthen the conclusion of the previous exercise to E (XN FM ) ≥ XM .
We will see below that (4.1) is very useful. The ﬁrst indication of this is:
+
¯
(4.2) Doob’s inequality. Let Xm be a submartingale, Xn = max0≤m≤n Xm ,
¯ n ≥ λ}. Then
λ > 0, and A = {X
+
λP (A) ≤ EXn 1A ≤ EXn Proof Let N = inf {m : Xm ≥ λ or m = n}. Since XN ≥ λ on A,
λP (A) ≤ EXN 1A ≤ EXn 1A The second inequality follows from the fact that (4.1) implies EXN ≤ EXn
and we have XN = Xn on Ac . The second inequality in (4.2) is trivial.
Example 4.1. If we let Sn = ξ1 + · · · + ξn where the ξm are independent and
2
2
2
have Eξm = 0, σm = Eξm < ∞, then (2.3) implies Xn = Sn is a submartingale.
2
If we let λ = x and apply (4.2) to Xn , we get Kolmogorov’s inequality ((8.2)
in Chapter 1):
P max Sm  ≥ x 1≤m≤n ≤ x−2 var(Sn ) Using martingales, one can also prove a lower bound on the maximum that can
be used instead of the central limit theorem in our proof of the necessity of the
conditions in the three series theorem. (See Example 4.7 in Chapter 2.) 247 248 Chapter 4 Martingales
Exercise 4.4. Suppose in addition to the conditions introduced above that
2
2
2
ξm  ≤ K and let s2 =
n
m≤n σm . Exercise 2.6 implies that Sn − sn is a
martingale. Use this and (4.1) to conclude
P max Sm  ≤ x 1≤m≤n ≤ (x + K )2 /var(Sn ) 2
Exercise 4.5. Let Xn be a martingale with X0 = 0 and EXn < ∞. Show
that P max Xm ≥ λ 1≤m≤n 2
2
≤ EXn /(EXn + λ2 ) Hint: Use the fact that (Xn + c)2 is a submartingale and optimize over c.
Remark. Some readers may recognize the resemblance to the onesided Chebyshev bound in Exercise 3.6 of Chapter 1. Taking n = 1 and choosing an appropriate distribution for X1 shows that the inequality in Exercise 4.4 is sharp
also.
Integrating the inequality in (4.2) gives:
(4.3) Lp maximum inequality. If Xn is a submartingale then for 1 < p < ∞,
p p
p−1 ¯p
E ( Xn ) ≤ +
E ( Xn ) p ∗
Consequently, if Yn is a martingale and Yn = max0≤m≤n Ym ,
p p
p−1 ∗
E Y n p ≤ E ( Y n p ) Proof The second inequality follows by applying the ﬁrst to Xn = Yn . To
prove the ﬁrst we will, for reasons that will become clear in a moment, work
¯
¯
¯
¯
with Xn ∧ M rather than Xn . Since {Xn ∧ M ≥ λ} is always {Xn ≥ λ} or ∅,
this does not change the application of (4.2). Using (5.7) in Chapter 1, (4.2),
Fubini’s theorem, and a little calculus gives
∞ ¯
E ((Xn ∧ M )p ) = ¯
pλp−1 P (Xn ∧ M ≥ λ) dλ
0
∞ pλp−1 λ−1 ≤ +
Xn 1(Xn ∧M ≥λ) dP
¯ 0
¯
Xn ∧ M = +
Xn pλp−2 dλ dP
0 = p
p−1 +¯
Xn (Xn ∧ M )p−1 dP dλ Section 4.4 Doob’s Inequality, Convergence in Lp
If we let q = p/(p − 1) be the exponent conjugate to p and apply H¨lder’s
o
inequality, (3.3) in Chapter 1, we see that the above
+
¯
≤ q (E Xn p )1/p (E Xn ∧ M p )1/q ¯
If we divide both sides of the last inequality by (E Xn ∧ M p )1/q , we get
¯
E ( X n ∧ M p ) ≤ p
p−1 p
+
E ( Xn ) p Letting M → ∞ and using the monotone convergence theorem gives the desired
result.
Exercise 4.6. Another ﬁx for the integrability problem in (4.3). Note
+
+
that if E (Xn )p = ∞, there is nothing to show, and then show E (Xn )p < ∞
p
¯ n < ∞.
implies E X
Example 4.2. (4.3) is false when p = 1. Again, the counterexample is
provided by Example 2.2. Let Sn be a simple random walk starting from S0 = 1,
N = inf {n : Sn = 0}, and Xn = SN ∧n . (4.1) implies EXn = ESN ∧n = ES0 = 1
for all n. Using (1.7) in Chapter 3 with a = −1, b = M − 1 we have
P max Xm < M =
m ∞ M −1
M
∞ so E (maxm Xm ) = M =1 P (maxm Xm ≥ M ) = M =1 1/M = ∞. The monotone convergence theorem implies that E maxm≤n Xm ↑ ∞ as n ↑ ∞.
The next result gives an extension of (4.3) to p = 1. Since this is not one
of the most important results, the proof is left to the reader.
(4.4) Theorem. Let Xn be a submartingale and log+ x = max(log x, 0).
+
+
¯
E Xn ≤ (1 − e−1 )−1 {1 + E (Xn log+ (Xn ))} Remark. The last result is almost the best possible condition for sup Xn  ∈
L1 . Gundy has shown that if Xn is a positive martingale that has Xn+1 ≤ CXn
and EX0 log+ X0 < ∞, then E (sup Xn ) < ∞ implies sup E (Xn log+ Xn ) < ∞.
For a proof, see Neveu (1975) p. 71–73.
Exercise 4.7. Prove (4.4) by carrying out the following steps: (i) Imitate the
proof of (4.3) but use the trivial bound P (A) ≤ 1 for λ ≤ 1 to show
¯
E ( Xn ∧ M ) ≤ 1 + +
¯
Xn log(Xn ∧ M ) dP 249 250 Chapter 4 Martingales
(ii) Use calculus to show
a log b ≤ a log a + b/e ≤ a log+ a + b/e
From (4.3), we get the following:
(4.5) Lp convergence theorem. If Xn is a martingale with sup E Xn p < ∞
where p > 1, then Xn → X a.s. and in Lp .
+
Proof (EXn )p ≤ (E Xn )p ≤ E Xn p , so it follows from the martingale convergence theorem (2.10) that Xn → X a.s. The second conclusion in (4.3)
implies
p E sup Xm  ≤ (p/p − 1)p E Xn p 0≤m≤n Letting n → ∞ and using the monotone convergence theorem implies sup Xn  ∈
Lp . Since Xn − X p ≤ (2 sup Xn )p , it follows from the dominated convergence
theorem, that E Xn − X p → 0.
The most important special case of the results in this section occurs when
p = 2. To treat this case, the next two results are useful.
(4.6) Orthogonality of martingale increments. Let Xn be a martingale
2
with EXn < ∞ for all n. If m ≤ n and Y ∈ Fm has EY 2 < ∞ then
E ((Xn − Xm )Y ) = 0
Proof The CauchySchwarz inequality implies E (Xn − Xm )Y  < ∞. Using
(1.1f), (1.3), and the deﬁnition of a martingale,
E ((Xn − Xm )Y ) = E [E ((Xn − Xm )Y Fm )] = E [Y E ((Xn − Xm )Fm )] = 0
2
(4.7) Conditional variance formula. If Xn is a martingale with EXn < ∞
for all n,
2
2
E ((Xn − Xm )2 Fm ) = E (Xn Fm ) − Xm . Remark. This is the conditional analogue of E (X − EX )2 = EX 2 − (EX )2
and is proved in exactly the same way.
Proof Using the linearity of conditional expectation and then (1.3), we have 2
2
2
2
E (Xn − 2Xn Xm + Xm Fm ) = E (Xn Fm ) − 2Xm E (Xn Fm ) + Xm
2
2
2
= E (Xn Fm ) − 2Xm + Xm Section 4.4 Doob’s Inequality, Convergence in Lp
2
2
Exercise 4.8. Let Xn and Yn be martingales with EXn < ∞ and EYn < ∞.
n EXn Yn − EX0 Y0 = E (Xm − Xm−1 )(Ym − Ym−1 )
m=1 The next two results generalize (8.3) and (8.7) from Chapter 1. Let Xn , n ≥ 0,
be a martingale and let ξn = Xn − Xn−1 for n ≥ 1.
2
Exercise 4.9. If EX0 , ∞
m=1 2
Eξm < ∞ then Xn → X∞ a.s. and in L2 . 2
Exercise 4.10. If bm ↑ ∞ and ∞=1 Eξm /b2 < ∞ then Xn /bn → 0 a.s.
m
m
∞
2
In particular, if Eξn ≤ K < ∞ and m=1 b−2 < ∞ then Xn /bn → 0 a.s.
m Example 4.3. Branching processes. We continue the study begun at
the end of the last section. Using the notation introduced there, we suppose
m
m
µ = E (ξi ) > 1 and var(ξi ) = σ 2 < ∞. Let Xn = Zn /µn . Taking m = n − 1
in (4.7) and rearranging, we have
2
2
E (Xn Fn−1 ) = Xn−1 + E ((Xn − Xn−1 )2 Fn−1 ) To compute the second term, we observe
E ((Xn − Xn−1 )2 Fn−1 ) = E ((Zn /µn − Zn−1 /µn−1 )2 Fn−1 )
= µ−2n E ((Zn − µZn−1 )2 Fn−1 )
It follows from Exercise 1.1 that on {Zn−1 = k },
k E ((Zn − µZn−1 )2 Fn−1 ) = E n
ξi − µk 2 Fn−1 = kσ 2 = Zn−1 σ 2 i=1 Combining the last three equations gives
2
2
2
EXn = EXn−1 + E (Zn−1 σ 2 /µ2n ) = EXn−1 + σ 2 /µn+1
2
2
since E (Zn−1 /µn−1 ) = EZ0 = 1. Now EX0 = 1, so EX1 = 1 + σ 2 /µ2 , and
induction gives
n+1
2
EXn = 1 + σ 2 µ−k
k=2 2
This shows sup EXn < ∞, so Xn → X in L2 , and hence EXn → EX . EXn = 1
for all n, so EX = 1 and X is not ≡ 0. It follows from Exercise 3.12 that
{X > 0} = {Zn > 0 for all n }. 251 252 Chapter 4 Martingales * Square Integrable Martingales
For the rest of this section, we will suppose
2
Xn is a martingale with X0 = 0 and EXn < ∞ for all n
2
(2.4) implies Xn is a submartingale. It follows from Doob’s decomposition
2
(2.12) that we can write Xn = Mn + An , where Mn is a martingale, and from
formulas in (2.12) and (4.7) that
n n
2
2
E (Xm Fm−1 ) − Xm−1 = An =
m=1 E ((Xm − Xm−1 )2 Fm−1 )
m=1 An is called the increasing process associated with Xn . An can be thought
of as a path by path measurement of the variance at time n, and A∞ = lim An
as the total variance in the path. (4.9) and (4.10) describe the behavior of the
martingale on {An < ∞} and {An = ∞}, respectively. The key to the proof of
the ﬁrst result is the following:
(4.8) Theorem. E supm Xm 2 ≤ 4EA∞ .
Proof Applying the L2 maximum inequality (4.3) to Xn gives
E sup Xm 2
0≤m≤n 2
≤ 4EXn = 4EAn 2
2
since EXn = EMn + EAn and EMn = EM0 = EX0 = 0. Using the monotone
convergence theorem now gives the desired result. (4.9) Theorem. limn→∞ Xn exists and is ﬁnite a.s. on {A∞ < ∞}.
Proof Let a > 0. Since An+1 ∈ Fn , N = inf {n : An+1 > a2 } is a stopping
time. Applying (4.8) to XN ∧n and noticing AN ∧n ≤ a2 gives
E sup XN ∧n 2 ≤ 4a2 n so the L2 convergence theorem (4.5) implies that lim XN ∧n exists and is ﬁnite
a.s. Since a is arbitrary, the desired result follows.
The next result is a variation on the theme of Exercise 4.10. Section 4.4 Doob’s Inequality, Convergence in Lp
(4.10) Theorem. Let f ≥ 1 be increasing with
Xn /f (An ) → 0 a.s. on {A∞ = ∞}.
Proof ∞
0 f (t)−2 dt < ∞. Then Hm = f (Am )−1 is bounded and predictable, so (2.7) implies
n Yn ≡ ( H · X ) n = Xm − Xm−1
f (Am )
m=1 is a martingale If Bn is the increasing process associated with Yn then
Bn+1 − Bn = E ((Yn+1 − Yn )2 Fn )
=E (Xn+1 − Xn )2
Fn
f (An+1 )2 An+1 − An
f (An+1 )2 = since f (An+1 ) ∈ Fn . Our hypotheses on f imply that
∞ ∞ An+1 − An
≤
f (An+1 )2
n=0
n=0 f (t)−2 dt < ∞
[An ,An+1 ) so it follows from (4.9) that Yn → Y∞ , and the desired conclusion follows from
Kronecker’s lemma, (8.1) in Chapter 1.
√
Example 4.4. Let > 0 and f (t) = t(log t)1/2+ ∨ 1. Then f satisﬁes
the hypotheses of (4.10). Let ξ1 , ξ2 , . . . be independent with Eξm = 0 and
2
2
Eξm = σm . In this case, Xn = ξ1 + · · · + ξn is a square integrable martingale
∞
2
2
2
with An = σ1 + · · · + σn , so if i=1 σi = ∞, (4.10) implies Xn /f (An ) → 0
generalizing (8.7) in Chapter 1.
From (4.10) we get a result due to Dubins and Freedman (1965) that
extends (6.6) from Chapter 1 and (3.2) above.
(4.11) Second BorelCantelli Lemma, III. Suppose Bn is adapted to Fn
and let pn = P (Bn Fn−1 ). Then
n n 1B (m)
m=1 ∞ pm → 1 a.s. on
m=1 pm = ∞
m=1 Proof Deﬁne a martingale by X0 = 0 and Xn − Xn−1 = 1Bn − P (Bn Fn−1 )
for n ≥ 1 so that we have
n n 1B (m)
m=1 n pm
m=1 − 1 = Xn pm
m=1 253 254 Chapter 4 Martingales
The increasing process associated with Xn has
An − An−1 = E ((Xn − Xn−1 )2 Fn−1 )
= E (1Bn − pn )2 Fn−1 = pn − p2 ≤ pn
n
On {A∞ < ∞}, Xn → a ﬁnite limit by (4.9), so on {A∞ < ∞}∩{ m pm = ∞} n Xn pm → 0
m=1 {A∞ = ∞} = { m pm (1 − pm ) = ∞} ⊂ { m pm = ∞}, so on {A∞ = ∞} the
desired conclusion follows from (4.10) with f (t) = t ∨ 1.
Remark. The trivial example Bn = Ω for all n shows we may have A∞ < ∞
and
pm = ∞ a.s.
Example 4.5. Bernard Friedman’s urn. Consider a variant of Polya’s urn
(see Section 4.3) in which we add a balls of the color drawn and b balls of the
opposite color where a ≥ 0 and b > 0. We will show that if we start with
g green balls and r red balls, where g, r > 0, then the fraction of green balls
gn → 1/2. Let Gn and Rn be the number of green and red balls after the nth
draw is completed. Let Bn be the event that the nth ball drawn is green, and
let Dn be the number of green balls drawn in the ﬁrst n draws. It follows from
(4.11) that
n () ∞ gm−1 → 1 a.s. on Dn
m=1 gm−1 = ∞
m=1 which always holds since gm ≥ g/(g + r +(a + b)m). At this point, the argument
breaks into three cases.
Case 1. a = b = c. In this case, the result is trivial since we always add c balls
of each color.
Case 2. a > b. We begin with the observation
(∗) gn+1 = Gn+1
g + aDn + b(n − Dn )
=
Gn+1 + Rn+1
g + r + n(a + b) If limsupn→∞ gn ≤ x then ( ) implies limsupn→∞ Dn /n ≤ x and (since a > b)
lim sup gn+1 ≤
n→∞ b + (a − b)x
ax + b(1 − x)
=
a+b
a+b Section 4.4 Doob’s Inequality, Convergence in Lp
The righthand side is a linear function with slope < 1 and ﬁxed point at
1/2, so starting with the trivial upper bound x = 1 and iterating we conclude that lim sup gn ≤ 1/2. Interchanging the roles of red and green shows
lim inf n→∞ gn ≥ 1/2, and the result follows.
Case 3. a < b. The result is easier to believe in this case since we are adding
more balls of the type not drawn but is a little harder to prove. The trouble
is that when b > a and Dn ≤ xn, the righthand side of (∗) is maximized by
taking Dn = 0, so we need to also use the fact that if rn is fraction of red balls,
then
Rn+1
r + bDn + a(n − Dn )
rn+1 =
=
Gn+1 + Rn+1
g + r + n(a + b)
Combining this with the formula for gn+1 , it follows that if
lim sup gn ≤ x and
n→∞ lim sup rn ≤ y
n→∞ a(1 − y ) + by
a + (b − a)y
lim sup gn ≤
=
a+b
a+b
n→∞
a + (b − a)x
bx + a(1 − x)
=
lim sup rn ≤
a+b
a+b
n→∞ then
and Starting with the trivial bounds x = 1, y = 1 and iterating (observe the two
upper bounds are always the same), we conclude as in Case 2 that both limsups
are ≤ 1/2.
Remark. B. Friedman (1949) considered a number of diﬀerent urn models.
The result above is due to Freedman (1965), who proved the result by diﬀerent
methods. The proof above is due to Ornstein and comes from a remark in
Freedman’s paper.
(4.8) came from using (4.3). If we use (4.2) instead, we get a slightly better
result.
1/2 (4.12) Theorem. E (supn Xn ) ≤ 3EA∞ .
Proof As in the proof of (4.9) we let a > 0 and let N = inf {n : An+1 > a2 }.
This time, however, our starting point is
P sup Xm  > a
m ≤ P (N < ∞) + P sup XN ∧m  > a
m 2 P (N < ∞) = P (A∞ > a ). To bound the second term, we apply (4.2) to
2
XN ∧m with λ = a2 to get
P sup XN ∧m  > a
m≤n 2
≤ a−2 EXN ∧n = a−2 EAN ∧n ≤ a−2 E (A∞ ∧ a2 ) 255 256 Chapter 4 Martingales
Letting n → ∞ in the last inequality, substituting the result in the ﬁrst one,
and integrating gives
∞ ∞ P m 0 ∞ P (A∞ > a2 ) da + sup Xm  > a da ≤ a−2 E (A∞ ∧ a2 ) da 0 0 1/2 1/2 Since P (A∞ > a2 ) = P (A∞ > a), the ﬁrst integral is EA∞ . For the second,
we use (5.7) from Chapter 1 (in the ﬁrst and fourth steps), Fubini’s theorem,
and calculus to get
∞ a2 ∞
−2 a −2 2 E (A∞ ∧ a ) da = 0 a P (A∞ > b) db da
0 0
∞ ∞ P (A∞ > b) = a−2 da db √
b 0
∞ b−1/2 P (A∞ > b) db = 2EA1/2
∞ =
0 2
Exercise 4.11. Let ξ1 , ξ2 , . . . be i.i.d. with Eξi = 0 and Eξi < ∞. Let
Sn = ξ1 + · · · + ξn . (4.1) implies that for any stopping time N , ESN ∧n = 0.
Use (4.12) to conclude that if EN 1/2 < ∞ then ESN = 0. Remark. Let ξi in Exercise 4.11 take the values ±1 with equal probability,
and let T = inf {n : Sn = −1}. Since ST = −1 does not have mean 0, it follows
that ET 1/2 = ∞. If we recall from (3.4) in Chapter 3 that P (T > t) ∼ Ct−1/2 ,
we see that the result in Exercise 4.11 is almost the best possible.
1 4.5. Uniform Integrability, Convergence in L In this section, we will give necessary and suﬃcient conditions for a martingale
to converge in L1 . The key to this is the following deﬁnition. A collection of
random variables Xi , i ∈ I , is said to be uniformly integrable if
lim M →∞ sup E (Xi ; Xi  > M ) =0 i∈I If we pick M large enough so that the sup < 1, it follows that
sup E Xi  ≤ M + 1 < ∞
i∈I This remark will be useful several times below. Section 4.5 Uniform Integrability, Convergence in L1
A trivial example of a uniformly integrable family is a collection of random
variables that are dominated by an integrable random variable, i.e., Xi  ≤ Y
where EY < ∞. Our ﬁrst result gives an interesting example that shows that
uniformly integrable families can be very large.
(5.1) Theorem. Given a probability space (Ω, Fo , P ) and an X ∈ L1 , then
{E (X F ) : F is a σ ﬁeld ⊂ Fo } is uniformly integrable.
Proof If An is a sequence of sets with P (An ) → 0 then the dominated convergence theorem implies E (X ; An ) → 0. From the last result, it follows that
if > 0, we can pick δ > 0 so that if P (A) ≤ δ then E (X ; A) ≤ . (If not,
there are sets An with P (An ) ≤ 1/n and E (X ; An ) > , a contradiction.)
Pick M large enough so that E X /M ≤ δ . Jensen’s inequality and the
deﬁnition of conditional expectation imply
E ( E (X F ) ; E (X F ) > M ) ≤ E ( E (X F ) ; E (X F ) > M )
= E ( X  ; E (X F ) > M )
since {E (X F ) > M } ∈ F . Using Chebyshev’s inequality and recalling the
deﬁnition of M , we have
P {E (X F ) > M } ≤ E {E (X F )}/M = E X /M ≤ δ
So, by the choice of δ , we have
E (E (X F ); E (X F ) > M ) ≤
Since for all F was arbitrary, the collection is uniformly integrable. A common way to check uniform integrability is to use:
Exercise 5.1. Let ϕ ≥ 0 be any function with ϕ(x)/x → ∞ as x → ∞, e.g.,
ϕ(x) = xp with p > 1 or ϕ(x) = x log+ x. If Eϕ(Xi ) ≤ C for all i ∈ I , then
{Xi : i ∈ I } is uniformly integrable.
The relevance of uniform integrability to convergence in L1 is explained
by:
(5.2) Theorem. If Xn → X in probability then the following are equivalent:
(i) {Xn : n ≥ 0} is uniformly integrable.
(ii) Xn → X in L1 . 257 258 Chapter 4 Martingales
(iii) E Xn  → E X  < ∞.
Proof (i) implies (ii) Let
ϕ M ( x) = M
x
−M if x ≥ M
if x ≤ M
if x ≤ −M The triangle inequality implies
 X n − X  ≤ X n − ϕ M ( X n )  +  ϕ M ( X n ) − ϕ M ( X )  +  ϕ M ( X ) − X 
Since ϕM (Y ) − Y ) = (Y  − M )+ ≤ Y 1(Y >M ) , taking expected value gives
E X n − X  ≤ E ϕ M ( X n ) − ϕ M ( X )  + E (  X n  ;  X n  > M ) + E (  X  ;  X  > M )
(6.4) in Chapter 1 implies that ϕM (Xn ) → ϕM (X ) in probability, so the ﬁrst
term → 0 by the bounded convergence theorem. (See Exercise 6.3 in Chapter
1.) If > 0 and M is large, uniform integrability implies that the second term
≤ . To bound the third term, we observe that uniform integrability implies
sup E Xn  < ∞, so Fatou’s lemma (in the form given in Exercise 6.2 in Chapter
1) implies E X  < ∞, and by making M larger we can make the third term
≤ . Combining the last three facts shows lim sup E Xn − X  ≤ 2 . Since is
arbitrary, this proves (ii).
(ii) implies (iii) Jensen’s inequality implies
E Xn  − E X  ≤ E Xn  − X  ≤ E Xn − X  → 0
(iii) implies (i) Let ψM (x) = x on [0, M − 1], ψM = 0 on [M, ∞), and let
ψM be linear on [M − 1, M ]. The dominated convergence theorem implies that
if M is large, E X  − EψM (X ) ≤ /2. As in the ﬁrst part of the proof, the
bounded convergence theorem implies EψM (Xn ) → EψM (X ), so using (iii)
we get that if n ≥ n0
E (Xn ; Xn  > M ) ≤ E Xn  − EψM (Xn )
≤ E X  − EψM (X ) + /2 <
By choosing M larger, we can make E (Xn ; Xn  > M ) ≤
Xn is uniformly integrable. for 0 ≤ n < n0 , so We are now ready to state the main theorems of this section. We have
already done all the work, so the proofs are short. Section 4.5 Uniform Integrability, Convergence in L1
(5.3) Theorem. For a submartingale, the following are equivalent:
(i) It is uniformly integrable.
(ii) It converges a.s. and in L1 .
(iii) It converges in L1 .
Proof (i) implies (ii) Uniform integrability implies sup E Xn  < ∞ so
the martingale convergence theorem implies Xn → X a.s., and (5.2) implies
Xn → X in L1 .
(ii) implies (iii) Trivial (iii) implies (i) Xn → X in L1 implies Xn → X in probability, (see (5.3) in
Chapter 1) so this follows from (5.2).
Before proving the analogue of (5.3) for martingales, we will isolate two
parts of the argument that will be useful later.
(5.4) Lemma. If integrable random variables Xn → X in L1 then
E (Xn ; A) → E (X ; A)
Proof EXm 1A − EX 1A  ≤ E Xm 1A − X 1A  ≤ E Xm − X  → 0 (5.5) Lemma. If a martingale Xn → X in L1 then Xn = E (X Fn ).
Proof The martingale property implies that if m > n, E (Xm Fn ) = Xn ,
so if A ∈ Fn , E (Xn ; A) = E (Xm ; A). (5.4) implies E (Xm ; A) → E (X ; A),
so we have E (Xn ; A) = E (X ; A) for all A ∈ Fn . Recalling the deﬁnition of
conditional expectation, it follows that Xn = E (X Fn ).
(5.6) Theorem. For a martingale, the following are equivalent:
(i) It is uniformly integrable.
(ii) It converges a.s. and in L1 .
(iii) It converges in L1 .
(iv) There is an integrable random variable X so that Xn = E (X Fn ).
Proof (i) implies (ii) Since martingales are also submartingales, this follows from (5.3).
(ii) implies (iii) Trivial. 259 260 Chapter 4 Martingales
(iii) implies (iv) Follows from (5.5).
(iv) implies (i) This follows from (5.1).
The next result is related to (5.5) but goes in the other direction.
(5.7) Theorem. Suppose Fn ↑ F∞ , i.e., Fn is an increasing sequence of σ ﬁelds
and F∞ = σ (∪n Fn ). As n → ∞,
E (X Fn ) → E (X F∞ )
Proof a.s. and in L1 The ﬁrst step is to note that if m > n then (1.2) implies
E (E (X Fm )Fn ) = E (X Fn ) so Yn = E (X Fn ) is a martingale. (5.1) implies that Yn is uniformly integrable,
so (5.6) implies that Yn converges a.s. and in L1 to a limit Y∞ . The deﬁnition
of Yn and (5.5) imply E (X Fn ) = Yn = E (Y∞ Fn ), and hence
X dP =
A Y∞ dP for all A ∈ Fn A Since X and Y∞ are integrable, and ∪n Fn is a π system, the π − λ theorem
implies that the last result holds for all A ∈ F∞ . Since Y∞ ∈ F∞ , it follows
that Y∞ = E (X F∞ ).
Exercise 5.2. Let Z1 , Z2 , . . . be i.i.d. with E Zi  < ∞, let θ be an independent
r.v. with ﬁnite mean, and let Yi = Zi + θ. If Zi is normal(0,1) then in statistical
terms we have a sample from a normal population with variance 1 and unknown
mean. The distribution of θ is called the prior distribution, and P (θ ∈
·Y1 , . . . , Yn ) is called the posterior distribution after n observations. Show
that E (θY1 , . . . , Yn ) → θ a.s.
In the next two exercises, Ω = [0, 1), Ik,n = [k 2−n , (k + 1)2−n ), and Fn =
σ (Ik,n : 0 ≤ k < 2n ).
Exercise 5.3. f is said to be Lipschitz continuous if f (t) − f (s) ≤ K t − s
for 0 ≤ s, t < 1. Show that Xn = (f ((k +1)2−n ) − f (k 2−n ))/2−n on Ik,n deﬁnes
a martingale, Xn → X∞ a.s. and in L1 , and
b f (b) − f (a) = X∞ (ω ) dω
a Section 4.5 Uniform Integrability, Convergence in L1
Exercise 5.4. Suppose f is integrable on [0,1). E (f Fn ) is a step function
and → f in L1 . From this it follows immediately that if > 0, there is a step
function g on [0,1] with f − g  dx < . This approximation is much simpler
than the barehands approach we used in Exercise 4.3 of the Appendix, but of
course we are using a lot of machinery.
An immediate consequence of (5.7) is:
(5.8) L´vy’s 01 law. If Fn ↑ F∞ and A ∈ F∞ then E (1A Fn ) → 1A a.s.
e
To steal a line from Chung: “The reader is urged to ponder over the meaning
of this result and judge for himself whether it is obvious or incredible.” We will
now argue for the two points of view.
“It is obvious.” 1A ∈ F∞ , and Fn ↑ F∞ , so our best guess of 1A given the
information in Fn should approach 1A (the best guess given F∞ ).
“It is incredible.” Let X1 , X2 , . . . be independent and suppose A ∈ T , the tail
σ ﬁeld. For each n, A is independent of Fn , so E (1A Fn ) = P (A). As n → ∞,
the lefthand side converges to 1A a.s., so P (A) = 1A a.s., and it follows that
P (A) ∈ {0, 1}, i.e., we have proved Kolmogorov’s 01 law.
The last argument may not show that (5.8) is “too unusual or improbable to
be possible,” but this and other applications of (5.8) below show that it is a
very useful result.
Exercise 5.5. Let Xn be r.v.’s taking values in [0, ∞). Let D = {Xn = 0 for
some n ≥ 1} and assume
P (DX1 , . . . , Xn ) ≥ δ (x) > 0 a.s. on {Xn ≤ x}
Use (5.8) to conclude that P (D ∪ {limn Xn = ∞}) = 1.
Exercise 5.6. Let Zn be a branching process with oﬀspring distribution pk
(see the end of Section 4.3 for deﬁnitions). Use the last result to show that if
p0 > 0 then P (limn Zn = 0 or ∞) = 1.
Exercise 5.7. Let Xn ∈ [0, 1] be adapted to Fn . Let α, β > 0 with α + β = 1
and suppose
P (Xn+1 = α + βXn Fn ) = Xn P (Xn+1 = βXn Fn ) = 1 − Xn Show P (limn Xn = 0 or 1) = 1 and if X0 = θ then P (limn Xn = 1) = θ.
A more technical consequence of (5.7) is: 261 262 Chapter 4 Martingales
(5.9) Dominated convergence theorem for conditional expectations.
Suppose Yn → Y a.s. and Yn  ≤ Z for all n where EZ < ∞. If Fn ↑ F∞ then
E (Yn Fn ) → E (Y F∞ ) a.s. Proof Let WN = sup{Yn − Ym  : n, m ≥ N }. WN ≤ 2Z , so EWN < ∞.
Using monotonicity (1.1b) and applying (5.7) to WN gives
lim sup E (Yn − Y Fn ) ≤ lim E (WN Fn ) = E (WN F∞ )
n→∞ n→∞ The last result is true for all N and WN ↓ 0 as N ↑ ∞, so (1.1c) implies
E (WN F∞ ) ↓ 0, and Jensen’s inequality gives us
E (Yn Fn ) − E (Y Fn ) ≤ E (Yn − Y Fn ) → 0 a.s. as n → ∞
(5.7) implies E (Y Fn ) → E (Y F∞ ) a.s. The desired result follows from the last
two conclusions and the triangle inequality.
Exercise 5.8. Show that if Fn ↑ F∞ and Yn → Y in L1 then E (Yn Fn ) →
E (Y F∞ ) in L1 .
Example 5.1. Suppose X1 , X2 , . . . are uniformly integrable and → X a.s. (5.2)
implies Xn → X in L1 and combining this with Exercise 5.8 shows E (Xn F ) →
E (X F ) in L1 . We will now show that E (Xn F ) need not converge a.s. Let
Y1 , Y2 , . . . and Z1 , Z2 , . . . be independent r.v.’s with
P (Yn = 1) = 1/n P (Yn = 0) = 1 − 1/n P (Zn = n) = 1/n P (Zn = 0) = 1 − 1/n Let Xn = Yn Zn . P (Xn > 0) = 1/n2 so the BorelCantelli lemma implies
Xn → 0 a.s. E (Xn ; Xn  ≥ 1) = n/n2 , so Xn is uniformly integrable. Let
F = σ (Y1 , Y2 , . . .).
E (Xn F ) = Yn E (Zn F ) = Yn EZn = Yn
Since Yn → 0 in L1 but not a.s., the same is true for E (Xn F ). 4.6. Backwards Martingales
A backwards martingale (some authors call them reversed) is a martingale
indexed by the negative integers, i.e., Xn , n ≤ 0, adapted to an increasing
sequence of σ ﬁelds Fn with
E (Xn+1 Fn ) = Xn for n ≤ −1 Section 4.6 Backwards Martingales
Because the σ ﬁelds decrease as n ↓ −∞, the convergence theory for backwards
martingales is particularly simple.
(6.1) Theorem. X−∞ = limn→−∞ Xn exists a.s. and in L1 .
Proof Let Un be the number of upcrossings of [a, b] by X−n , . . . , X0 . The
upcrossing inequality (2.9) implies (b − a)EUn ≤ E (X0 − a)+ . Letting n → ∞
and using the monotone convergence theorem, we have EU∞ < ∞, so by the
remark after the proof of (2.10), the limit exists a.s. The martingale property
implies Xn = E (X0 Fn ), so (5.1) implies Xn is uniformly integrable and (5.2)
tells us that the convergence occurs in L1 .
Exercise 6.1. Show that if X0 ∈ Lp the convergence in (6.1) occurs in Lp .
The next result identiﬁes the limit in (6.1).
(6.2) Theorem. If X−∞ = limn→−∞ Xn and F−∞ = ∩n Fn , then X−∞ =
E (X0 F−∞ ).
Proof Clearly, X−∞ ∈ F−∞ . Xn = E (X0 Fn ), so if A ∈ F−∞ ⊂ Fn then
Xn dP =
A X0 dP
A (6.1) and (5.4) imply E (Xn ; A) → E (X−∞ ; A), so
X−∞ dP =
A X0 dP
A for all A ∈ F−∞ , proving the desired conclusion.
The next result is (5.7) backwards.
(6.3) Theorem. If Fn ↓ F−∞ as n ↓ −∞ (i.e., F−∞ = ∩n Fn ), then
E (Y Fn ) → E (Y F−∞ ) a.s. and in L1 Proof Xn = E (Y Fn ) is a backwards martingale, so (6.1) and (6.2) imply
that as n ↓ −∞, Xn → X−∞ a.s. and in L1 , where
X−∞ = E (X0 F−∞ ) = E (E (Y F0 )F−∞ ) = E (Y F−∞ )
Exercise 6.2. Prove the backwards analogue of (5.9). Suppose Yn → Y−∞
a.s. as n → −∞ and Yn  ≤ Z a.s. where EZ < ∞. If Fn ↓ F−∞ , then
E (Yn Fn ) → E (Y−∞ F−∞ ) a.s. 263 264 Chapter 4 Martingales
Even though the convergence theory for backwards martingales is easy,
there are some nice applications. For the rest of the section, we return to the
special space utilized in Section 3.1, so we can utilize deﬁnitions given there.
That is, we suppose
Ω = {(ω1 , ω2 , . . .) : ωi ∈ S }
F = S ×S ×...
Xn ( ω ) = ω n
Let En be the σ ﬁeld generated by events that are invariant under permutations
that leave n + 1, n + 2, . . . ﬁxed and let E = ∩n En be the exchangeable σ ﬁeld.
Example 6.1. The strong law of large numbers. Let ξ1 , ξ2 , . . . be
i.i.d. with E ξi  < ∞. Let Sn = ξ1 + · · · + ξn , let X−n = Sn /n, and let
F−n = σ (Sn , Sn+1 , Sn+2 , . . .) = σ (Sn , ξn+1 , ξn+2 , . . .)
To compute E (X−n F−n−1 ), we observe that if j, k ≤ n + 1, symmetry implies
E (ξj F−n−1 ) = E (ξk F−n−1 ), so
E (ξn+1 F−n−1 ) =
= 1
n+1 n+1 E (ξk F−n−1 )
k=1 1
Sn+1
E (Sn+1 F−n−1 ) =
n+1
n+1 Since X−n = (Sn+1 − ξn+1 )/n, it follows that
E (X−n F−n−1 ) = E (Sn+1 /nF−n−1 ) − E (ξn+1 /nF−n−1 )
Sn+1
Sn+1
Sn+1
=
−
=
= X−n−1
n
n(n + 1)
n+1
The last computation shows X−n is a backwards martingale, so it follows from
(6.1) and (6.2) that limn→∞ Sn /n = E (X−1 F−∞ ). Since F−n ⊂ En , F−∞ ⊂ E .
The HewittSavage 01 law ((1.1) in Chapter 3) says E is trivial, so we have
lim Sn /n = E (X−1 ) n→∞ a.s. Example 6.2. The Ballot Theorem. Let {ξj , 1 ≤ j ≤ n} be i.i.d. nonnegative integervalued r.v.’s, let Sk = ξ1 + · · · + ξk , and let G = {Sj < j for 1 ≤
j ≤ n}. Then
(6.4) P (GSn ) = (1 − Sn /n)+ Section 4.6 Backwards Martingales
Remark. To explain the name, let ξ1 , ξ2 , . . . , ξn be i.i.d. and take values 0 or
2 with probability 1/2 each. Interpreting 0’s and 2’s as votes for candidates A
and B , we see that
G = {A leads B throughout the counting}
so P (GA gets r votes ) = (1 − 2r/n)+ , the result in (3.2) in Chapter 3.
Proof The result is trivial when Sn ≥ n, so suppose Sn < n. Computations in
Example 6.1 show that X−j = Sj /j is a martingale w.r.t. F−j = σ (Sj , . . . , Sn ).
Let T = inf {k ≥ −n : Xk ≥ 1} and set T = −1 if the set is ∅. We claim that
XT = 1 on Gc . To check this, note that if Sj +1 < j + 1 then Sj ≤ j . Since
G ⊂ {T = −1} and S1 < 1 implies S1 = 0, we have XT = 0 on G. Noting
F−n = σ (Sn ) and using Exercise 4.3, we see that on {Sn < n}
P (Gc Sn ) = E (XT F−n ) = X−n = Sn /n
Example 6.3. HewittSavage 01 law. If X1 , X2 , . . . are i.i.d. and A ∈ E
then P (A) ∈ {0, 1}.
The key to the new proof is:
(6.5) Lemma. Suppose X1 , X2 , . . . are i.i.d. and let
An (ϕ) = 1
( n) k ϕ(Xi1 , . . . , Xik )
i where the sum is over all sequences of distinct integers 1 ≤ i1 , . . . , ik ≤ n and
(n)k = n(n − 1) · · · (n − k + 1)
is the number of such sequences. If ϕ is bounded,
An (ϕ) → Eϕ(X1 , . . . , Xk )
Proof a.s. An (ϕ) ∈ En , so
An (ϕ) = E (An (ϕ)En ) = 1
( n) k E (ϕ(Xi1 , . . . , Xik )En )
i = E (ϕ(X1 , . . . , Xk )En ) 265 266 Chapter 4 Martingales
since all the terms in the sum are the same. (6.3) with F−m = Em for m ≥ 1
implies that
E (ϕ(X1 , . . . , Xk )En ) → E (ϕ(X1 , . . . , Xk )E )
We want to show that the limit is E (ϕ(X1 , . . . , Xk )). The ﬁrst step is to observe
that there are k (n − 1)k−1 terms in An (ϕ) involving X1 and ϕ is bounded so if
we let 1 ∈ i denote the sum over sequences that contain 1.
1
( n) k ϕ(Xi1 , . . . , Xik ) ≤
1∈i k (n − 1)k−1
sup ϕ → 0
( n) k This shows that
E (ϕ(X1 , . . . , Xk )E ) ∈ σ (X2 , X3 , . . .)
Repeating the argument for 2, 3, . . . , k shows
E (ϕ(X1 , . . . , Xk )E ) ∈ σ (Xk+1 , Xk+2 , . . .)
Intuitively, if the conditional expectation of a r.v. is independent of the r.v. then
(a) E (ϕ(X1 , . . . , Xk )E ) = E (ϕ(X1 , . . . , Xk )) To show this, we prove:
(b) If EX 2 < ∞ and E (X G ) ∈ F with X independent of F then E (X G ) =
EX.
Proof Let Y = E (X G ) and note that (1.1.e) implies EY 2 ≤ EX 2 < ∞.
By independence, EXY = EX EY = (EY )2 since EY = EX . From the
geometric interpretation of conditional expectation, (1.4), E ((X − Y )Y ) = 0,
so EY 2 = EXY = (EY )2 and var(Y ) = EY 2 − (EY )2 = 0.
(a) holds for all bounded ϕ, so E is independent of Gk = σ (X1 , . . . , Xk ). Since
this holds for all k , and ∪k Gk is a π system that contains Ω, (4.1) in Chapter 1
implies E is independent of σ (∪k Gk ) ⊃ E , and we get the usual 01 law punch
line. If A ∈ E , it is independent of itself, and hence P (A) = P (A ∩ A) =
P (A)P (A), i.e., P (A) ∈ {0, 1}.
Example 6.4. de Finetti’s Theorem. A sequence X1 , X2 , . . . is said to be
exchangeable if for each n and permutation π of {1, . . . , n}, (X1 , . . . , Xn ) and
(Xπ(1) , . . . , Xπ(n) ) have the same distribution.
(6.6) de Finetti’s Theorem. If X1 , X2 , . . . are exchangeable then conditional
on E , X1 , X2 , . . . are independent and identically distributed. Section 4.6 Backwards Martingales
Proof Repeating the ﬁrst calculation in the proof of (6.5) and using the notation introduced there shows that for any exchangeable sequence:
An (ϕ) = E (An (ϕ)En ) = 1
( n) k E (ϕ(Xi1 , . . . , Xik )En )
i = E (ϕ(X1 , . . . , Xk )En )
since all the terms in the sum are the same. Again, (6.3) implies that
An (ϕ) → E (ϕ(X1 , . . . , Xk )E ) (6.7) This time, however, E may be nontrivial, so we cannot hope to show that the
limit is E (ϕ(X1 , . . . , Xk )).
Let f and g be bounded functions on Rk−1 and R, respectively. If we let
In,k be the set of all sequences of distinct integers 1 ≤ i1 , . . . , ik ≤ n, then
(n)k−1 An (f ) nAn (g ) = f (Xi1 , . . . , Xik−1 ) = g ( Xm )
m i∈In,k−1 f (Xi1 , . . . , Xik−1 )g (Xik )
i∈In,k
k−1 + f (Xi1 , . . . , Xik−1 )g (Xij )
i∈In,k−1 j =1 If we let ϕ(x1 , . . . , xk ) = f (x1 , . . . , xk−1 )g (xk ), note that
(n)k−1 n
n
=
( n) k
(n − k + 1) and (n)k−1
1
=
( n) k
(n − k + 1) then rearrange, we have
An (ϕ) = n
1
An (f )An (g ) −
n−k+1
n−k+1 k−1 An (ϕj )
j =1 where ϕj (x1 , . . . , xk−1 ) = f (x1 , . . . , xk−1 )g (xj ). Applying (6.7) to ϕ, f , g , and
all the ϕj gives
E (f (X1 , . . . , Xk−1 )g (Xk )E ) = E (f (X1 , . . . , Xk−1 )E )E (g (Xk )E )
It follows by induction that k E j =1 fj (Xj ) E = k E (fj (Xj )E )
j =1 267 268 Chapter 4 Martingales
When the Xi take values in a nice space, there is a regular conditional
distribution for (X1 , X2 , . . .) given E , and the sequence can be represented as
a mixture of i.i.d. sequences. Hewitt and Savage (1956) call the sequence presentable in this case. For the usual measure theoretic problems, the last result
is not valid when the Xi take values in an arbitrary measure space. See Dubins
and Freedman (1979) and Freedman (1980) for counterexamples.
The simplest special case of (6.7) occurs when the Xi ∈ {0, 1}. In this case
(6.8) Theorem. If X1 , X2 , . . . are exchangeable and take values in {0, 1} then
there is a probability distribution on [0, 1] so that
1 θk (1 − θ)n−k dF (θ) P (X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0) =
0 (6.8) is useful for people concerned about the foundations of statistics (see
Section 3.7 of Savage (1972)), since from the palatable assumption of symmetry one gets the powerful conclusion that the sequence is a mixture of i.i.d. sequences. (6.8) has been proved in a variety of diﬀerent ways. See Feller, Vol. II
(1971), p. 228–229 for a proof that is related to the moment problem. Diaconis
and Freedman (1980) have a nice proof that starts with the trivial observation
that the distribution of a ﬁnite exchangeable sequence Xm , 1 ≤ m ≤ n has the
form p0 H0,n + · · · + pn Hn,n where Hm,n is “drawing without replacement from
an urn with m ones and n − m zeros.” If m → ∞ and m/n → p then Hm,n
approaches product measure with density p. (6.8) follows easily from this, and
one can get bounds on the rate of convergence.
Exercises
6.3. Prove directly from the deﬁnition that if X1 , X2 , . . . ∈ {0, 1} are exchangeable
n−k
n
P (X1 = 1, . . . , Xk = 1Sn = m) =
n−m
m
6.4. If X1 , X2 , . . . ∈ R are exchangeable with EXi2 < ∞ then E (X1 X2 ) ≥ 0.
6.5. Use the ﬁrst few lines of the proof of (6.5) to conclude that if X1 , X2 , . . .
are i.i.d. with EXi = µ and var(Xi ) = σ 2 < ∞ then
n
2 −1 (Xi − Xj )2 → 2σ 2
1≤i<j ≤n Section 4.7 Optional Stopping Theorems 4.7. Optional Stopping Theorems
In this section, we will prove a number of results that allow us to conclude that
if Xn is a submartingale and M ≤ N are stopping times, then EXM ≤ EXN .
Example 2.2 shows that this is not always true, but Exercise 4.2 shows this is
true if N is bounded, so our attention will be focused on the case of unbounded
N.
(7.1) Theorem. If Xn is a uniformly integrable submartingale then for any
stopping time N , XN ∧n is uniformly integrable.
+
+
+
+
Proof Xn is a submartingale, so (4.1) implies EXN ∧n ≤ EXn . Since Xn is
uniformly integrable, it follows from the remark after the deﬁnition that
+
+
sup EXN ∧n ≤ sup EXn < ∞
n n Using the martingale convergence theorem (2.10) now gives XN ∧n → XN
a.s. (here X∞ = limn Xn ) and E XN  < ∞. With this established, the rest
is easy. We write
E (XN ∧n ; XN ∧n  > K ) = E (XN ; XN  > K, N ≤ n)
+ E (Xn ; Xn  > K, N > n)
Since E X  < ∞ and XN ∧n is uniformly integrable, if K is large then each term
is < /2.
From the last computation in the proof of (7.1), we get:
(7.2) Corollary. If E XN  < ∞ and Xn 1(N >n) is uniformly integrable, then
XN ∧n is uniformly integrable.
From (7.1), we immediately get:
(7.3) Theorem. If Xn is a uniformly integrable submartingale then for any
stopping time N ≤ ∞, we have EX0 ≤ EXN ≤ EX∞ , where X∞ = lim Xn .
Proof (4.1) implies EX0 ≤ EXN ∧n ≤ EXn . Letting n → ∞ and observing
that (7.1) and (5.3) imply XN ∧n → XN and Xn → X∞ in L1 gives the desired
result.
From (7.3), we get the following useful corollary. 269 270 Chapter 4 Martingales
(7.4) The Optional Stopping Theorem. If L ≤ M are stopping times and
YM ∧n is a uniformly integrable submartingale, then EYL ≤ EYM and
YL ≤ E (YM FL )
Proof Use the inequality EXN ≤ EX∞ in (7.3) with Xn = YM ∧n and N = L.
To prove the second result, let A ∈ FL and
L
M N= on A
on Ac is a stopping time by Exercise 1.7 in Chapter 3. Using the ﬁrst result now shows
EYN ≤ EYM . Since N = M on Ac , it follows from the last inequality and the
deﬁnition of conditional expectation that
E (YL ; A) ≤ E (YM ; A) = E (E (YM FL ); A)
Taking A = {YL − E (YM FL ) > }, we conclude P (A ) = 0 for all
the desired result follows. > 0 and The last result is the one we use the most (usually the ﬁrst inequality with
L = 0). (7.2) is useful in checking the hypothesis. A typical application is the
following generalization of Wald’s equation, (1.6) in Chapter 3.
(7.5) Theorem. Suppose Xn is a submartingale and E (Xn+1 − Xn Fn ) ≤ B
a.s. If EN < ∞ then XN ∧n is uniformly integrable and hence EXN ≥ EX0 .
Remark. As usual, using the last result twice shows that if X is a martingale
then EXN = EX0 . To recover Wald’s equation, let Sn be a random walk, let
µ = E (Sn − Sn−1 ), and apply the martingale result to Xn = Sn − nµ.
Proof We begin by observing that
∞ X N ∧ n  ≤  X 0  + Xm+1 − Xm 1(N >m)
m=0 To prove uniform integrability, it suﬃces to show that the righthand side has
ﬁnite expectation for then XN ∧n  is dominated by an integrable r.v. Now,
{N > m} ∈ Fm , so
E (Xm+1 − Xm ; N > m) = E (E (Xm+1 − Xm Fm ); N > m) ≤ BP (N > m)
and E ∞
m=0 Xm+1 − Xm 1(N >m) ≤ B ∞
m=0 P (N > m) = BEN < ∞. Section 4.7 Optional Stopping Theorems
Before we delve further into applications, we pause to prove one last stopping theorem that does not require uniform integrability.
(7.6) Theorem. If Xn is a nonnegative supermartingale and N ≤ ∞ is a
stopping time, then EX0 ≥ EXN where X∞ = lim Xn , which exists by (2.11).
Proof By (4.1), EX0 ≥ EXN ∧n . The monotone convergence theorem implies
E (XN ; N < ∞) = lim E (XN ; N ≤ n)
n→∞ and Fatou’s lemma implies
E (XN ; N = ∞) ≤ lim inf E (Xn ; N > n)
n→∞ Adding the last two lines and using our ﬁrst observation,
EXN ≤ lim inf EXN ∧n ≤ EX0
n→∞ Exercise 7.1. If Xn ≥ 0 is a supermartingale then P (sup Xn > λ) ≤ EX0 /λ.
Applications to random walks. For the rest of the section, including all the
exercises below, ξ1 , ξ2 , . . . are i.i.d., Sn = ξ1 + · · · + ξn , and Fn = σ (ξ1 , . . . , ξn ).
Example 7.1. Asymmetric simple random walk refers to the special case
in which P (ξi = 1) = p and P (ξi = −1) = q ≡ 1 − p.
(a) Suppose 0 < p < 1 and let ϕ(x) = {(1 − p)/p}x. Then ϕ(Sn ) is a martingale.
Proof
m}, Since Sn and ξn+1 are independent, Example 1.5 implies that on {Sn = E (ϕ(Sn+1 )Fn ) = p · 1−p
p = {1 − p + p} m+1 + (1 − p)
1−p
p m = ϕ ( Sn ) (b) If we let Tx = inf {n : Sn = x} then for a < 0 < b
P ( T a < Tb ) = 1−p
p ϕ(b) − ϕ(0)
ϕ(b) − ϕ(a) m−1 271 272 Chapter 4 Martingales
Proof Let N = Ta ∧ Tb . The ﬁrst step is to check that N < ∞ a.s. This
can be done using a slight modiﬁcation of the argument in Example 1.5 of
Chapter 3 or with the following argument. Since ϕ(SN ∧n ) is bounded, it is
uniformly integrable and (5.6) implies limn ϕ(SN ∧n ) exists a.s. and in L1 . Since
convergence to an interior point of (a, b) is impossible, we must have N < ∞
a.s. and
ϕ(0) = Eϕ(SN ) = P (Ta < Tb )ϕ(a) + P (Tb < Ta )ϕ(b)
Using P (Ta < Tb ) + P (Tb < Ta ) = 1 and solving gives the indicated result.
(c) Suppose 1/2 < p < 1. If a < 0 then P (minn Sn ≤ a) = P (Ta < ∞) =
{(1 − p)/p}−a . If b > 0 then P (Tb < ∞) = 1.
Proof Letting b → ∞ and noting ϕ(b) → 0 gives the ﬁrst result, since Ta < ∞
if and only if Ta < Tb for some b. For the second, note that ϕ(a) → ∞ as
a → −∞.
(d) Suppose 1/2 < p < 1. If b > 0 then ETb = b/(2p − 1).
Proof Xn = Sn − (p − q )n is a martingale. Since Tb ∧ n is a bounded stopping
time, (4.1) implies
0 = E (STb ∧n − (p − q )(Tb ∧ n))
Now b ≥ STb ∧n ≥ minm Sm and (c) implies E (inf m Sm ) > −∞, so the dominated convergence theorem implies ESTb ∧n → ESTb as n → ∞. The monotone
convergence theorem implies E (Tb ∧ n) ↑ ETb , so we have b = (p − q )ETb .
Remark. The reader should study the technique in this proof of (d) because
it is useful in a number of situations (e.g., the exercises below). We apply (4.1)
to the bounded stopping time Tb ∧ n, then let n → ∞, and use appropriate
convergence theorems. Here this is an alternative to showing ETb < ∞ in order
to check that XTb ∧n is uniformly integrable.
Exercise 7.2. Let Sn be an asymmetric simple random walk with p > 1/2,
and let σ 2 = 1 − (p − q )2 . Use the fact that Xn = (Sn − (p − q )n)2 − σ 2 n is a
martingale to show var(T1 ) = (1 − (p − q )2 )/(p − q )3 .
Exercise 7.3. Let Sn be a symmetric simple random walk starting at 0, and
2
let T = inf {n : Sn ∈ (−a, a)} where a is an integer. (i) Use the fact that Sn − n
/
is a martingale to show that ET = a2 . (ii) Find constants b and c so that
4
2
Yn = Sn − 6nSn + bn2 + cn is a martingale, and use this to compute ET 2 .
Exercise 7.4. Suppose ξi is not constant. Let ϕ(θ) = E exp(θξ1 ) < ∞ for Section 4.7 Optional Stopping Theorems
θ
θ ∈ (−δ, δ ), and let ψ (θ) = log ϕ(θ). (i) Xn = exp(θSn − nψ (θ)) is a martingale.
θ
θ
(ii) ψ is strictly convex. (iii) Show E Xn → 0 and conclude that Xn → 0 a.s. Exercise 7.5. Let Sn be asymmetric simple random walk with p ≥ 1/2. Let
T1 = inf {n : Sn = 1}. Use the martingale of Exercise 7.4 to conclude (i) if
θ > 0 then 1 = eθ Eϕ(θ)−T1 , where ϕ(θ) = peθ + qe−θ and q = 1 − p. (ii) Set
peθ + qe−θ = 1/s and then solve for x = e−θ to get
EsT1 = (1 − {1 − 4pqs2 }1/2 )/2qs
Exercise 7.6. Suppose ϕ(θo ) = E exp(θo ξ1 ) = 1 for some θo < 0 and ξi is not
constant. It follows from the result in Exercise 7.4 that Xn = exp(θo Sn ) is a
martingale. Let T = inf {n : Sn ∈ (a, b)} and Yn = Xn∧T . Use (7.4) to conclude
/
that EXT = 1 and P (ST ≤ a) ≤ exp(−θo a).
Exercise 7.7. Let Sn be the total assets of an insurance company at the end
of year n. In year n, premiums totaling c > 0 are received and claims ζn are
paid where ζn is Normal(µ, σ 2 ) and µ < c. To be precise, if ξn = c − ζn then
Sn = Sn−1 + ξn . The company is ruined if its assets drop to 0 or less. Show
that if S0 > 0 is nonrandom, then
P ( ruin ) ≤ exp(−2(c − µ)S0 /σ 2 ) Exercise 7.8. Let Zn be a branching process with oﬀspring distribution pk ,
deﬁned in part d of Section 4.3, and let ϕ(θ) =
pk θk . Suppose ρ < 1 has
Zn
ϕ(ρ) = ρ. Show that ρ is a martingale and use this to conclude P (Zn = 0
for some n ≥ 1Z0 = x) = ρx . 273 5 Markov Chains The main object of study in this chapter is (temporally homogeneous) Markov
chains on a countable state space S. That is, a sequence of r.v.’s Xn , n ≥ 0,
with
P (Xn+1 = j Fn ) = p(Xn , j )
where Fn = σ (X0 , . . . , Xn ), p(i, j ) ≥ 0 and j p(i, j ) = 1. The theory focuses
on the asymptotic behavior of pn (i, j ) ≡ P (Xn = j X0 = i). The basic results
are that
n (5.2) 1
pm (i, j )
n→∞ n
m=1
lim exists always and under a mild assumption called aperiodicity:
(5.5) lim pn (i, j ) n→∞ exists In nice situations, i.e., Xn is irreducible and positive recurrent, the limits in
(5.2) and (5.5) are a probability distribution that is independent of the starting
state i. In words, the chain converges to equilibrium as n → ∞. One of the
attractions of Markov chain theory is that these powerful conclusions come out
of assumptions that are satisﬁed in a large number of cases. 5.1. Deﬁnitions and Examples
We begin with a very general deﬁnition and then gradually specialize to the
situation described in the introductory paragraph. Let (S, S ) be a measurable
space. A sequence of random variables taking values in S is said to be a Markov
chain with respect to a ﬁltration Fn , if Xn ∈ Fn and for all B ∈ S
P (Xn+1 ∈ B Fn ) = P (Xn+1 ∈ B Xn )
In words, given the present, the rest of the past is irrelevant for predicting the
location of Xn+1 . As usual, we turn to an example to illustrate the deﬁnition. Section 5.1 Deﬁnitions and Examples
Example 1.1. Random walk. Let X0 , ξ1 , ξ2 , . . . ∈ Rd be independent and
let Xn = X0 + ξ1 + · · · + ξn . Xn is a Markov chain with respect to Fn =
σ (X0 , X1 , . . . , Xn ). (As in the case of a martingale, this is the smallest ﬁltration
we can get away with and the one we will usually use.) To prove our claim, we
want to show that if µj is the distribution of ξj , then
(∗) P (Xn+1 ∈ B Fn ) = µn+1 (B − Xn ) = P (Xn+1 ∈ B Xn ) To prove the ﬁrst equality, we note that Xn and ξn+1 are independent and
use Example 1.5 in Chapter 4. The second equality follows from:
(1.1) Lemma. If F ⊂ G and E (X G ) ∈ F then E (X F ) = E (X G ).
Proof Since F ⊂ G , (1.2) in Chapter 4 implies
E (X F ) = E (E (X G )F ) = E (X G ) since E (X G ) ∈ F .
Our next goal is to take the rather abstract object deﬁned at the beginning of the section and, without loss of generality, turn it into something more
concrete and easier to work with. We begin with two deﬁnitions:
A function p : S × S → R is said to be a transition probability if:
(i) For each x ∈ S , A → p(x, A) is a probability measure on (S, S ).
(ii) For each A ∈ S , x → p(x, A) is a measurable function.
We say Xn is a Markov chain (w.r.t. Fn ) with transition probabilities pn if
P (Xn+1 ∈ B Fn ) = pn (Xn , B )
When (S, S ) is nice, supposing the existence of a transition probability
entails no loss of generality, since the Markov property asserts that the conditional expectation w.r.t. Fn is the same as w.r.t. σ (Xn ), and then Exercise 1.16
in Chapter 4 implies the existence of a transition probability. Conversely, if we
suppose (S, S ) is a nice space, then given a sequence of transition probabilities
pn and an initial distribution µ on (S, S ), we can deﬁne a consistent set of
ﬁnite dimensional distributions by
P (Xj ∈ Bj , 0 ≤ j ≤ n) =
(1.2) µ(dx0 )
B0 p0 (x0 , dx1 )
B1 ··· pn−1 (xn−1 , dxn )
Bn 275 276 Chapter 5 Markov Chains
Since we have assumed that (S, S ) is nice, Kolmogorov’s theorem allows us to
construct a probability measure Pµ on sequence space (S {0,1,...} , S {0,1,...} ) so
that the coordinate maps Xn (ω ) = ωn have the desired distributions.
Notation. When µ = δx , a point mass at x, we use Px as an abbreviation for
Pδx . The measures Px are the basic objects because, once they are deﬁned, we
can deﬁne the Pµ (even for inﬁnite measures µ) by
Pµ (A) = µ(dx) Px (A) Our next step is to check that Xn is a Markov chain (with respect to Fn =
σ (X0 , X1 , . . . , Xn )). To prove this, we let A = {X0 ∈ B0 , X1 ∈ B1 , . . . , Xn ∈
Bn }, Bn+1 = B , and observe that using the deﬁnition of the integral, the
deﬁnition of A, and the deﬁnition of Pµ
1(Xn+1 ∈B ) dPµ = Pµ (A, Xn+1 ∈ B )
A = Pµ (X0 ∈ B0 , X1 ∈ B1 , . . . , Xn ∈ Bn , Xn+1 ∈ B )
= µ(dx0 ) p0 (x0 , dx1 ) · · · B0 B1 pn−1 (xn−1 , dxn ) pn (xn , Bn+1 )
Bn We would like to assert that the last expression is
= pn (Xn , B ) dPµ
A To do this, replace pn (xn , Bn ) by a general function f (xn ). If f is an indicator
function, the desired equality is true. Linearity implies that it is valid for
simple functions, and the bounded convergence theorem implies that it is valid
for bounded measurable f , e.g., f (x) = pn (x, Bn+1 ).
The collection of sets for which
1(Xn+1 ∈B ) dPµ =
A pn (Xn , B ) dPµ
A holds is a λsystem, and the collection for which it has been proved is a π system, so it follows from the π − λ theorem, (4.2) in Chapter 1, that the
equality is true for all A ∈ Fn . This shows that
P (Xn+1 ∈ B Fn ) = pn (Xn , B )
and it follows from (1.1) that Xn is a Markov chain with transition probabilities
pn . Section 5.1 Deﬁnitions and Examples
At this point, we have shown that given a sequence of transition probabilities and an initial distribution, we can construct a Markov chain. Conversely,
(1.3) Theorem. If Xn is a Markov chain with transition probabilities pn and
initial distribution µ, then the ﬁnite dimensional distributions are given by
(1.2).
Proof Our ﬁrst step is to show that if Xn has transition probability pn then
for any bounded measurable f
(1.4) E (f (Xn+1 )Fn ) = pn (Xn , dy )f (y ) The desired conclusion is a consequence of the next result. Let H = the collection of bounded functions for which the identity holds.
(1.5) Monotone class theorem. Let A be a π system that contains Ω and
let H be a collection of realvalued functions that satisﬁes: (i) If A ∈ A, then
1A ∈ H. (ii) If f, g ∈ H, then f + g , and cf ∈ H for any real number c. (iii)
If fn ∈ H are nonnegative and increase to a bounded function f , then f ∈ H.
Then H contains all bounded functions measurable with respect to σ (A).
Proof The assumption Ω ∈ A, (ii), and (iii) imply that G = {A : 1A ∈ H}
is a λsystem so by (i) and the π − λ theorem ((4.2) in Chapter 1) G ⊃ σ (A).
(ii) implies H contains all simple functions, and (iii) implies that H contains all
bounded measurable functions.
Returning to our main topic, we observe that familiar properties of conditional expectation and (1.4) imply
n n E fm (Xm ) =EE m=0 fm (Xm ) Fn−1
m=0
n−1 =E fm (Xm )E (fn (Xn )Fn−1 )
m=0
n−1 =E fm (Xm ) pn−1 (Xn−1 , dy )fn (y ) m=0 The last integral is a bounded measurable function of Xn−1 , so it follows by
induction that if µ is the distribution of X0 , then
n (1.6) fm (Xm ) E
m=0 = µ(dx0 )f0 (x0 ) p0 (x0 , dx1 )f1 (x1 ) 277 278 Chapter 5 Markov Chains
··· pn−1 (xn−1 , dxn )fn (xn ) that is, the ﬁnite dimensional distributions coincide with those in (1.2).
With (1.3) established, it follows that we can describe a Markov chain by
giving a sequence of transition probabilities pn . Having done this, we can and
will suppose that the random variables Xn are the coordinate maps (Xn (ω ) =
ωn ) on sequence space
(Ωo , F ) = (S {0,1,...} , S {0,1,...} )
We choose this representation because it gives us two advantages in investigating
the Markov chain: (i) For each initial distribution µ we have a measure Pµ
deﬁned by (1.2) that makes Xn a Markov chain with Pµ (X0 ∈ A) = µ(A). (ii)
We have the shift operators θn deﬁned in Section 3.1: (θn ω )(m) = ωm+n .
At this point, we have achieved our aim announced earlier in the section of
taking the abstract deﬁnition and turning it into something easier to work with.
Now we will take one further simpliﬁcation, this time with loss of generality,
and restrict our attention to the temporally homogeneous case, in which
the transition probability does not depend on n.
Having decided on the framework in which we will investigate things, we
can ﬁnally give some more examples. In each case, S is a countable set and
S = all subsets of S . Let p(i, j ) ≥ 0 and suppose j p(i, j ) = 1 for all i. Intuitively, p(i, j ) = P (Xn+1 = j Xn = i). From p(i, j ) we can deﬁne a transition
probability by
p(i, A) =
p(i, j )
j ∈A We will now give ﬁve concrete examples that will be our constant companions
as the story unfolds. In each case we will just give the transition probability
since it is enough to describe the Markov chain.
Example 1.2. Branching processes. S = {0, 1, 2, . . .}
i p(i, j ) = P ξm = j
m=1 where ξ1 , ξ2 , . . . are i.i.d. nonnegative integervalued random variables. In words
each of the i individuals at time n (or in generation n) gives birth to an independent and identically distributed number of oﬀspring. Here and in the next
four examples, we take the approach that the chain is deﬁned by giving its Section 5.1 Deﬁnitions and Examples
transition probability. To make the connection with our earlier discussion of
branching processes, do:
Exercise 1.1. Let Zn be the process deﬁned in part d of Section 4.3. Check
that Zn is a Markov chain with the indicated transition probability.
Example 1.3. Renewal chain. S = {0, 1, 2, . . .}, fk ≥ 0, and
p(0, j ) = fj +1
p(i, i − 1) = 1
p(i, j ) = 0 ∞
k=1 fk = 1. for j ≥ 0
for i ≥ 1
otherwise To explain the deﬁnition, let ξ1 , ξ2 , . . . be i.i.d. with P (ξm = j ) = fj , let T0 = i0
and for k ≥ 1 let Tk = Tk−1 + ξk . Tk is the time of the k th arrival in a renewal
process that has its ﬁrst arrival at time i0 . Let
Ym = 1 if m ∈ {T0 , T1 , T2 , . . .}
0 otherwise and let Xn = inf {m − n : m ≥ n, Ym = 1}. Ym = 1 if a renewal occurs at time
m, and Xn is the amount of time until the ﬁrst renewal ≥ n. It is clear that
if Xn = i > 0 then Xn+1 = i − 1. When Xn = 0, we have TNn = n, where
Nn = inf {k : Tk ≥ n} is a stopping time, so (1.3) of Chapter 3 implies ξNn +1
is independent of σ (X0 , ξ1 , . . . , ξNn ) ⊃ σ (X0 , . . . , Xn ). We have p(0, j ) = fj +1
since ξNn +1 = j + 1 implies Xn+1 = j .
Example 1.4. M/G/1 queue. In this model, customers arrive according to
a Poisson process with rate λ. (M is for Markov and refers to the fact that in a
Poisson process the number of arrivals in disjoint time intervals is independent.)
Each customer requires an independent amount of service with distribution F .
(G is for general service distribution. 1 indicates that there is one server.)
Let Xn be the number of customers waiting in the queue at the time the nth
customer enters service. To be precise, when X0 = x, the chain starts with x
people waiting in line and customer 0 just beginning her service.
The ﬁrst paragraph is for motivation only. To deﬁne our Markov chain Xn ,
let
∞
(λt)k
ak =
dF (t)
e−λt
k!
0
be the probability that k customers arrive during a service time. Let ξ1 , ξ2 , . . .
be i.i.d. with P (ξi = k − 1) = ak . We think of ξi as the net number of customers
to arrive during the ith service time, subtracting one for the customer who
completed service, so we deﬁne Xn by
(∗) Xn+1 = (Xn + ξn+1 )+ 279 280 Chapter 5 Markov Chains
The positive part only takes eﬀect when Xn = 0 and ξn+1 = −1 and reﬂects the
fact that when the queue has size 0 and no one arrives during the service time
the next queue size is 0, since we do not start counting until the next customer
arrives and then the queue length will be 0. It is easy to see that the sequence
deﬁned in (∗) is a Markov chain with transition probability
p(0, 0) = a0 + a1
p(j, j − 1 + k ) = ak if j ≥ 1 or k > 1 The formula for ak is rather complicated, and its exact form is not important,
so we will simplify things by assuming only that ak > 0 for all k ≥ 0 and
k≥0 ak = 1.
Example 1.5. Ehrenfest chain. S = {0, 1, . . . , r}
p(k, k + 1) = (r − k )/r
p(k, k − 1) = k/r
p(i, j ) = 0
otherwise
In words, there is a total of r balls in two urns; k in the ﬁrst and r − k in
the second. We pick one of the r balls at random and move it to the other
urn. Ehrenfest used this to model the division of air molecules between two
chambers (of equal size and shape) that are connected by a small hole. For an
interesting account of this chain, see Kac (1947a).
Example 1.6. Birth and death chains. S = {0, 1, 2, . . .} These chains are
deﬁned by the restriction p(i, j ) = 0 when i − j  > 1. The fact that these
processes cannot jump over any integers makes it particularly easy to compute
things for them.
That should be enough examples for the moment. We conclude this section
with some simple calculations. For a Markov chain on a countable state space,
(1.2) says
n P µ ( Xk = i k , 0 ≤ k ≤ n ) = µ ( i 0 ) p(im−1 , im )
m=1 When n = 1
P µ ( X1 = j ) = µ(i)p(i, j ) = µp(j )
i i.e., the product of the row vector µ with the matrix p. When n = 2,
p(i, j )p(j, k ) = p2 (i, k ) P i ( X2 = k ) =
j Section 5.2 Extensions of the Markov Property
i.e., the second power of the matrix p. Combining the two formulas and generalizing
µ(i)pn (i, j ) = µpn (j )
P µ ( Xn = j ) =
i Exercises
1.2. Suppose S = {1, 2, 3} and .1 0 . 9
p = .7 . 3 0 0 .4 .6 Compute p2 (1, 2) and p3 (2, 3) by considering the diﬀerent ways to get from 1
to 2 in two steps and from 2 to 3 in three steps.
1.3. Suppose S = {0, 1} and
p= 1−α
β α
1−β Use induction to show that
Pµ (Xn = 0) = β
β
+ (1 − α − β )n µ(0) −
α+β
α+β 1.4. Let ξ0 , ξ1 , . . . be i.i.d. ∈ {H, T }, taking each value with probability 1/2.
Show that Xn = (ξn , ξn+1 ) is a Markov chain and compute its transition probability p. What is p2 ?
1.5. Brothersister mating. In this scheme, two animals are mated, and
among their direct descendants two individuals of opposite sex are selected at
random. These animals are mated and the process continues. Suppose each
individual can be one of three genotypes AA, Aa, aa, and suppose that the
type of the oﬀspring is determined by selecting a letter from each parent. With
these rules, the pair of genotypes in the nth generation is a Markov chain with
six states:
AA, AA AA, Aa AA, aa Aa, Aa Aa, aa aa, aa Compute its transition probability.
1.6. Let ξ1 , ξ2 , . . . be i.i.d. ∈ {1, 2, . . . , N } and taking each value with probability 1/N . Show that Xn = {ξ1 , . . . , ξn } is a Markov chain and compute its
transition probability. 281 282 Chapter 5 Markov Chains
1.7. Let ξ1 , ξ2 , . . . be i.i.d. ∈ {−1, 1}, taking each value with probability 1/2.
Let S0 = 0, Sn = ξ1 + · · · ξn and Xn = max{Sm : 0 ≤ m ≤ n}. Show that Xn
is not a Markov chain.
1.8. Let θ, U1 , U2 , ... be independent and uniform on (0, 1). Let Xi = 1 if
Ui ≤ θ, = −1 if Ui > θ, and let Sn = X1 + · · · + Xn . In words, we ﬁrst pick
θ according to the uniform distribution and then ﬂip a coin with probability
θ of heads to generate a random walk. (i) Compute P (Xn+1 = 1X1 , . . . , Xn )
and (ii) conclude Sn is a temporally inhomogeneous Markov chain. This is due
to the fact that “Sn is a suﬃcient statistic for estimating θ.” The answer to
(i) is the estimator of a Bayesian who starts at time n with a uniform prior on
[0, 1]. 5.2. Extensions of the Markov Property
If Xn is a Markov chain with transition probability p, then by deﬁnition,
P (Xn+1 ∈ B Fn ) = p(Xn , B )
In this section, we will prove two extensions of the last equality in which
{Xn+1 ∈ B } is replaced by a bounded function of the future, h(Xn , Xn+1 , . . .),
and n is replaced by a stopping time N . These results, especially the second,
will be the keys to developing the theory of Markov chains.
As mentioned in Section 5.1, we can and will suppose that the Xn are the
coordinate maps on sequence space
(Ωo , F ) = (S {0,1,...} , S {0,1,...} )
Fn = σ (X0 , X1 , . . . , Xn ), and for each initial distribution µ we have a measure
Pµ deﬁned by (1.2) that makes Xn a Markov chain with Pµ (X0 ∈ A) = µ(A).
Deﬁne the shift operators θn : Ωo → Ωo by (θn ω )(m) = ω (m + n).
(2.1) The Markov property. Let Y : Ωo → R be bounded and measurable.
Eµ (Y ◦ θn Fn ) = EXn Y
Remark. Here the subscript µ on the lefthand side indicates that the conditional expectation is taken with respect to Pµ . The righthand side is the
function ϕ(x) = Ex Y evaluated at x = Xn . To make the connection with the
introduction of this section, let
Y (ω ) = h(ω0 , ω1 , . . .) Section 5.2 Extensions of the Markov Property
We denote the function by Y , a letter usually used for random variables, because
that’s exactly what Y is, a measurable function deﬁned on our probability space
Ωo .
Proof We begin by proving the result in a special case and then use the π − λ
and monotone class theorems to get the general result. Let A = {ω : ω0 ∈
A0 , . . . , ωm ∈ Am } and g0 , . . . gn be bounded and measurable. Applying (1.6)
with fk = 1Ak for k < m, fm = 1Am g0 , and fk = gk−m for m < k ≤ m + n
gives
n gk (Xm+k ); A Eµ
k=0 = µ(dx0 )
A0 · g0 (xm )
··· p(x0 , dx1 ) · · ·
A1 p(xm−1 , dxm )
Am p(xm , dxm+1 )g1 (xm+1 ) p(xm+n−1 , dxm+n )gn (xm+n )
n = Eµ EXm gk (Xk ) ; A
k=0 The collection of sets for which the last formula holds is a λsystem, and the
collection for which it has been proved is a π system, so using the π − λ theorem,
(4.2) in Chapter 1, shows that the last identity holds for all A ∈ Fm .
Fix A ∈ Fm and let H be the collection of bounded measurable Y for which
(∗) Eµ (Y ◦ θm ; A) = Eµ (EXm Y ; A) The last computation shows that (∗) holds when
Y (ω ) = gk (ωk )
0≤k≤n To ﬁnish the proof, we will apply the monotone class theorem (1.5). Let A be
the collection of sets of the form {ω : ω0 ∈ A0 , . . . , ωk ∈ Ak }. A is a π system,
so taking gk = 1Ak shows (i) of (1.5) holds. H clearly has properties (ii) and
(iii), so (1.5) implies that H contains the bounded functions measurable w.r.t
σ (A), and the proof is complete.
Exercise 2.1. Use the Markov property to show that if A ∈ σ (X0 , . . . , Xn )
and B ∈ σ (Xn , Xn+1 , . . .), then for any initial distribution µ
Pµ (A ∩ B Xn ) = Pµ (AXn )Pµ (B Xn ) 283 284 Chapter 5 Markov Chains
In words, the past and future are conditionally independent given the present.
Hint: Write the lefthand side as Eµ (Eµ (1A 1B Fn )Xn ).
The next two results illustrate the use of (2.1). We will see many other
applications below.
(2.2) ChapmanKolmogorov equation.
Px (Xm+n = z ) = P x ( Xm = y ) P y ( Xn = z )
y Proof Px (Xn+m = z ) = Ex (Px (Xn+m = z Fm )) = Ex (PXm (Xn = z )) by the
Markov property (2.1) since 1(Xn =z) ◦ θm = 1(Xn+m =z) .
(2.3) Theorem. Let Xn be a Markov chain and suppose
P ∪∞=n+1 {Xm ∈ Bm } Xn ≥ δ > 0 on {Xn ∈ An }
m
Then P ({Xn ∈ An i.o.} − {Xn ∈ Bn i.o.}) = 0.
Remark. To quote Chung, “The intuitive meaning of the preceding theorem
has been given by Doeblin as follows: if the chance of a pedestrian’s getting
run over is greater than δ > 0 each time he crosses a certain street, then he will
not be crossing it indeﬁnitely (since he will be killed ﬁrst)!”
Proof Let Λn = {Xn+1 ∈ Bn+1 } ∪ {Xn+2 ∈ Bn+2 } ∪ . . .
Λ = ∩Λn = {Xn ∈ Bn i.o.} and Γ = {Xn ∈ An i.o.}. Let Fn = σ (X0 , X1 , . . . , Xn ) and F∞ = σ (∪Fn ).
Using the Markov property and the dominated convergence theorem for conditional expectations, (5.9) in Chapter 4,
E (1Λn Xn ) = E (1Λn Fn ) → E (1Λ F∞ ) = 1Λ
On Γ, the lefthand side is ≥ δ i.o. This is only possible if Γ ⊂ Λ.
Exercise 2.2. A state a is called absorbing if Pa (X1 = a) = 1. Let D =
{Xn = a for some n ≥ 1} and let h(x) = Px (D). (i) Use (2.3) to conclude that
h(Xn ) → 0 a.s. on Dc . Here a.s. means Pµ a.s. for any initial distribution µ.
(ii) Obtain the result in Exercise 5.5 in Chapter 4 as a special case. Section 5.2 Extensions of the Markov Property
We are now ready for our second extension of the Markov property. Recall
N is said to be a stopping time if {N = n} ∈ Fn . As in Chapter 3, let
FN = {A : A ∩ {N = n} ∈ Fn for all n}
be the information known at time N , and let
θN ω = θn ω
∆ on {N = n}
on {N = ∞} where ∆ is an extra point that we add to Ωo . In (2.4) and its applications, we
will explicitly restrict our attention to {N < ∞}, so the reader does not have
to worry about the second part of the deﬁnition of θN .
(2.4) Strong Markov property. Suppose that for each n, Yn : Ω → R is
measurable and Yn  ≤ M for all n. Then
Eµ (YN ◦ θN FN ) = EXN YN on {N < ∞}
where the righthand side is ϕ(x, n) = Ex Yn evaluated at x = XN , n = N.
Proof Let A ∈ FN .
∞ Eµ (YN ◦ θN ; A ∩ {N < ∞}) = Eµ (Yn ◦ θn ; A ∩ {N = n})
n=0 Since A ∩ {N = n} ∈ Fn , using (2.1) now converts the right side into
∞ Eµ (EXn Yn ; A ∩ {N = n}) = Eµ (EXN YN ; A ∩ {N < ∞})
n=0 Remark. The reader should notice that the proof is trivial. All we do is break
things down according to the value of N , replace N by n, apply the Markov
property (2.2), and reverse the process. This is the standard technique for
proving results about stopping times.
The next example illustrates the use of (2.4), and explains why we want to
allow the Y that we apply to the shifted path to depend on n.
(2.5) Reﬂection principle. Let ξ1 , ξ2 , . . . be independent and identically distributed with a distribution that is symmetric about 0. Let Sn = ξ1 + · · · + ξn .
If a > 0 then
P sup Sm > a
m≤n ≤ 2P (Sn > a) 285 286 Chapter 5 Markov Chains
Remark. First, a trivial comment: The strictness of the inequality is not
important. If the result holds for >, it holds for ≥ and vice versa.
A second more important one: We do the proof in two steps because that
is how formulas like this are derived in practice. First, one computes intuitively
and then ﬁgures out how to extract the desired formula from (2.4).
Proof in words First note that if Z has a distribution that is symmetric
about 0, then
1
1
P (Z ≥ 0) ≥ P (Z > 0) + P (Z = 0) =
2
2
If we let N = inf {m ≤ n : Sm > a} (with inf ∅ = ∞), then on {N < ∞},
Sn − SN is independent of SN and has P (Sn − SN ≥ 0) ≥ 1/2. So
P (Sn > a) ≥ 1
P ( N ≤ n)
2 Formal Proof Let Ym (ω ) = 1 if m ≤ n and ωn−m > a, Ym (ω ) = 0 otherwise.
The deﬁnition of Ym is chosen so that (YN ◦ θN )(ω ) = 1 if ωn > a (and hence
N ≤ n), and = 0 otherwise. The strong Markov property implies
E0 (YN ◦ θN FN ) = ESN YN on {N < ∞} = {N ≤ n} To evaluate the righthand side, we note that if y > a, then
Ey Ym = Py (Sn−m > a) ≥ Py (Sn−m ≥ y ) ≥ 1/2
So integrating over {N ≤ n} and using the deﬁnition of conditional expectation
gives
1
P (N ≤ n) ≤ E0 (E0 (YN ◦ θN FN ); N ≤ n) = E0 (YN ◦ θN ; N ≤ n)
2
since {N ≤ n} ∈ FN . Recalling that YN ◦ θN = 1{Sn >a} , the last quantity
= E0 (1{Sn >a} ; N ≤ n) = P0 (Sn > a)
since {Sn > a} ⊂ {N ≤ n}.
Exercises
The next ﬁve exercises concern the hitting times
τA = inf {n ≥ 0 : Xn ∈ A} τy = τ{y} TA = inf {n ≥ 1 : Xn ∈ A} T y = T {y } Section 5.2 Extensions of the Markov Property
To keep the two deﬁnitions straight, note that the symbol τ is smaller than T .
Some of the results below are valid for a general S but for simplicity.
We will suppose throughout that S is countable.
2.3. First entrance decomposition. Let Ty = inf {n ≥ 1 : Xn = y }. Show
that
n Px (Ty = m)pn−m (y, y ) pn (x, y ) =
m=1 2.4. Show that n
m=0 P x ( Xm = x) ≥ n+k
m=k P x ( Xm = x) . 2.5. Suppose that S − C is ﬁnite and for each x ∈ S − C Px (τC < ∞) > 0.
Then there is an N < ∞ and > 0 so that Py (τC > kN ) ≤ (1 − )k .
2.6. Let h(x) = Px (τA < τB ). Suppose A ∩ B = ∅, S − (A ∪ B ) is ﬁnite, and
Px (τA∪B < ∞) > 0 for all x ∈ S − (A ∪ B ). (i) Show that
(∗) h ( x) = p(x, y )h(y ) for x ∈ A ∪ B
/ y (ii) Show that if h satisﬁes (∗) then h(X (n ∧ τA∪B )) is a martingale. (iii) Use
this and Exercise 2.5 to conclude that h(x) = Px (τA < τB ) is the only solution
of (∗) that is 1 on A and 0 on B.
2.7. Let Xn be a Markov chain with S = {0, 1, . . . , N } and suppose that Xn is
a martingale and Px (τ0 ∧ τN < ∞) > 0 for all x. (i) Show that 0 and N are
absorbing states, i.e., p(0, 0) = p(N, N ) = 1. (ii) Show Px (τN < τ0 ) = x/N.
2.8. Genetics Chains. Suppose S = {0, 1, . . . , N } and consider
(i) p(i, j ) = (ii) p(i, j ) = N
(i/N )j (1 − i/N )N −j
j
2i 2N − 2i
2N
j
N −j
N Show that these chains satisfy the hypotheses of Exercise 2.7.
2.9. In brothersister mating described in Exercise 1.4, AA, AA and aa, aa are
absorbing states. Show that the number of A’s in the pair is a martingale and
use this to compute the probability of getting absorbed in AA, AA starting from
each of the states.
2.10. Let τA = inf {n ≥ 0 : Xn ∈ A} and g (x) = Ex τA . Suppose that S − A is
ﬁnite and for each x ∈ S − A, Px (τA < ∞) > 0. (i) Show that
(∗) g ( x) = 1 + p(x, y )g (y )
y for x ∈ A
/ 287 288 Chapter 5 Markov Chains
(ii) Show that if g satisﬁes (∗), g (X (n ∧ τA )) + n ∧ τA is a martingale. (iii) Use
this to conclude that g (x) = Ex τA is the only solution of (∗) that is 0 on A.
2.11. Let ξ0 , ξ1 , . . . be i.i.d. ∈ {H, T }, taking each value with probability 1/2,
and let Xn = (ξn , ξn+1 ) be the Markov chain from Exercise 1.4. Let N1 =
inf {n ≥ 0 : (ξn , ξn+1 ) = (H, H )}. Use the results in the last exercise to compute
EN1 . [No, there is no missing subscript on E , but you will need to ﬁrst compute
g (x).]
2.12. Consider the Markov chain on {1, 2, . . . , N } with pij = 1/(i − 1) when
j < i, p11 = 1 and pij = 0 otherwise. We claim that
Ek T1 = 1 + 1/2 + · · · + 1/(k − 1)
Prove this by (i) using Exercise 2.10, OR (ii) letting Ij = 1 if Xn visits j , noticing that if X0 = k , T1 = I1 + · · · + Ik−1 where I1 , I2 , . . . , Ik−1 are independent. 5.3. Recurrence and Transience
In this section and the next two, we will consider only Markov chains on a
0
countable state space. Let Ty = 0, and for k ≥ 1, let
k
k
Ty = inf {n > Ty −1 : Xn = y }
k
1
Ty is the time of the k th return to y . The reader should note that Ty > 0 so
any visit at time 0 does not count. We adopt this convention so that if we let
1
Ty = Ty and ρxy = Px (Ty < ∞), then
k
(3.1) Theorem. Px (Ty < ∞) = ρxy ρk−1 .
yy Intuitively, in order to make k visits to y , we ﬁrst have to go from x to y and
then return k − 1 times to y.
Proof When k = 1, the result is trivial, so we suppose k ≥ 2. Let Y (ω ) = 1
k
if ωn = y for some n ≥ 1, Y (ω ) = 0 otherwise. If N = Ty −1 then Y ◦ θN = 1 if
k
Ty < ∞. The strong Markov property (2.4) implies
Ex (Y ◦ θN FN ) = EXN Y on {N < ∞} On {N < ∞}, XN = y , so the righthand side is Py (Ty < ∞) = ρyy , and it
follows that
k
Px (Ty < ∞) = Ex (Y ◦ θN ; N < ∞) = Ex (Ex (Y ◦ θN FN ); N < ∞)
k
= Ex (ρyy ; N < ∞) = ρyy Px (Ty −1 < ∞) Section 5.3 Recurrence and Transience
The result now follows by induction.
A state y is said to be recurrent if ρyy = 1 and transient if ρyy < 1. If y
k
is recurrent then (3.1) implies Py (Ty < ∞) = 1 for all k , so Py (Xn = y i.o.) = 1.
k
Exercise 3.1. Suppose y is recurrent and for k ≥ 0, let Rk = Ty be the time
of the k th return to y , and for k ≥ 1 let rk = Rk − Rk−1 be the k th interarrival
time. Use the strong Markov property to conclude that under Py , the vectors
vk = (rk , XRk−1 , . . . , XRk −1 ), k ≥ 1 are i.i.d. If y is transient and we let N (y ) =
to y at positive times, then ∞
n=1 ∞ ∞ Ex N (y ) = k
Px (Ty < ∞) Px (N (y ) ≥ k ) =
k=1
∞ (3.2) 1(Xn =y) be the number of visits k=1 ρxy ρk−1 =
yy =
k=1 ρxy
<∞
1 − ρyy Combining the last computation with our result for recurrent states gives a
result that generalizes (2.2) from Chapter 3.
(3.3) Theorem. y is recurrent if and only if Ey N (y ) = ∞.
Exercise 3.2. Let a ∈ S , fn = Pa (Ta = n), and un = Pa (Xn = a). (i) Show
that un = 1≤m≤n fm un−m . (ii) Let u(s) = n≥0 un sn , f (s) = n≥1 fn sn ,
and show u(s) = 1/(1 − f (s)). Setting s = 1 gives (3.2) for x = y = a.
Exercise 3.3. Consider asymmetric simple random walk on Z, i.e., we have
p(i, i + 1) = p, p(i, i − 1) = q = 1 − p. In this case,
p2m (0, 0) = 2m m m
pq
m and p2m+1 (0, 0) = 0 (i) Use the Taylor series expansion for h(x) = (1 − x)−1/2 to show u(s) =
(1 − 4pqs2 )−1/2 and use the last exercise to conclude f (s) = 1 − (1 − 4pqs2 )1/2 .
(ii) Set s = 1 to get the probability the random walk will return to 0 and check
that this is the same as the answer given in part (c) of Example 7.1 of Chapter
4.
The next result shows that recurrence is contagious.
(3.4) Theorem. If x is recurrent and ρxy > 0 then y is recurrent and ρyx = 1. 289 290 Chapter 5 Markov Chains
Proof We will ﬁrst show ρyx = 1 by showing that if ρxy > 0 and ρyx < 1
then ρxx < 1. Let K = inf {k : pk (x, y ) > 0}. There is a sequence y1 , . . . , yK −1
so that
p(x, y1 )p(y1 , y2 ) · · · p(yK −1 , y ) > 0
Since K is minimal, yi = x for 1 ≤ i ≤ K − 1. If ρyx < 1, we have
Px (Tx = ∞) ≥ p(x, y1 )p(y1 , y2 ) · · · p(yK −1 , y )(1 − ρyx ) > 0
a contradiction. So ρyx = 1.
To prove that y is recurrent, observe that ρyx > 0 implies there is an L so
that pL (y, x) > 0. Now
pL+n+K (y, y ) ≥ pL (y, x)pn (x, x)pK (x, y )
Summing over n, we see
∞ ∞ pL+n+K (y, y ) ≥ pL (y, x)pK (x, y )
n=1 pn (x, x) = ∞
n=1 so (3.3) implies y is recurrent.
Exercise 3.4. Use the strong Markov property to show that ρxz ≥ ρxy ρyz .
The next fact will help us identify recurrent states in examples. First we
need two deﬁnitions. C is closed if x ∈ C and ρxy > 0 implies y ∈ C . The
name comes from the fact that if C is closed and x ∈ C then Px (Xn ∈ C ) = 1
for all n. D is irreducible if x, y ∈ D implies ρxy > 0.
(3.5) Theorem. Let C be a ﬁnite closed set. Then C contains a recurrent
state. If C is irreducible then all states in C are recurrent.
Proof In view of (3.4), it suﬃces to prove the ﬁrst claim. Suppose it is false.
Then for all y ∈ C , ρyy < 1 and Ex N (y ) = ρxy /(1 − ρyy ), but this is ridiculous
since it implies
∞ ∞>
y ∈C ∞ pn (x, y ) = Ex N (y ) =
y ∈C n=1 ∞ pn (x, y ) =
n=1 y ∈C 1
n=1 The ﬁrst inequality follows from the fact that C is ﬁnite and the last equality
from the fact that C is closed.
To illustrate the use of the last result consider: Section 5.3 Recurrence and Transience
Example 3.1. Let Xn be a Markov chain with transition matrix
123
1010
2 .4 .6 0
3 . 3 0 .4
4000
5000
6000 456
000
000
.2 . 1 0
.3 . 7 0
.5 0 .5
.8 0 .2 Looking at the matrix, we see that:
(i) ρ34 > 0 and ρ43 = 0 so 3 must be transient, or we would contradict (3.4).
(ii) {1, 2} and {4, 5, 6} are irreducible closed sets, so (3.5) implies these states
are recurrent.
The last reasoning can be used to identify transient and recurrent states
when S is ﬁnite since for x ∈ S either: (i) there is a y with ρxy > 0 and ρyx = 0
and x must be transient, or (ii) ρxy > 0 implies ρyx > 0 . In case (ii), Exercise
3.4 implies Cx = {y : ρxy > 0} is an irreducible closed set. (If y, z ∈ Cx then
ρyz ≥ ρyx ρxz > 0. If ρyw > 0 then ρxw ≥ ρxy ρyw > 0, so w ∈ Cx .) So (3.5)
implies x is recurrent.
Exercise 3.5. Show that in the Ehrenfest chain (Example 1.5), all states are
recurrent.
Example 3.1 motivates the following:
(3.6) Decomposition theorem. Let R = {x : ρxx = 1} be the recurrent
states of a Markov chain. R can be written as ∪i Ri , where each Ri is closed
and irreducible.
Remark. This result shows that for the study of recurrent states we can,
without loss of generality, consider a single irreducible closed set.
Proof If x ∈ R let Cx = {y : ρxy > 0}. By (3.4), Cx ⊂ R, and if y ∈ Cx then
ρyx > 0. From this it follows easily that either Cx ∩Cy = ∅ or Cx = Cy . To prove
the last claim, suppose Cx ∩ Cy = ∅. If z ∈ Cx ∩ Cy then ρxy ≥ ρxz ρzy > 0, so if
w ∈ Cy we have ρxw ≥ ρxy ρyw > 0 and it follows that Cx ⊃ Cy . Interchanging
the roles of x and y gives Cy ⊃ Cx , and we have proved our claim. If we
let Ri be a listing of the sets that appear as some Cx , we have the desired
decomposition. 291 292 Chapter 5 Markov Chains
The rest of this section is devoted to examples. Speciﬁcally we concentrate
on the question: How do we tell whether a state is recurrent or transient?
Reasoning based on (3.4) works occasionally when S is inﬁnite.
Example 3.2. Branching process. If the probability of no children is positive then ρk0 > 0 and ρ0k = 0 for k ≥ 1, so (3.4) implies all states k ≥ 1 are
transient. The state 0 has p(0, 0) = 1 and is recurrent. It is called an absorbing state to reﬂect the fact that once the chain enters 0, it remains there for
all time.
If S is inﬁnite and irreducible, all that (3.4) tells us is that either all the
states are recurrent or all are transient, and we are left to ﬁgure out which case
occurs.
Example 3.3. Renewal chain. Since p(i, i − 1) = 1 for i ≥ 1, it is clear
that ρi0 = 1 for all i ≥ 1 and hence also for i = 0, i.e., 0 is recurrent. If we
recall that p(0, j ) = fj +1 and suppose that {k : fk > 0} is unbounded, then
ρ0i > 0 for all i and all states are recurrent. If K = sup{k : fk > 0} < ∞ then
{0, 1, . . . , K − 1} is an irreducible closed set of recurrent states and all states
k ≥ K are transient.
Example 3.4. Birth and death chains on {0, 1, 2, . . .}. Let
p(i, i + 1) = pi p(i, i − 1) = qi p(i, i) = ri where q0 = 0. Let N = inf {n : Xn = 0}. To analyze this example, we are going
to deﬁne a function ϕ so that ϕ(XN ∧n ) is a martingale. We start by setting
ϕ(0) = 0 and ϕ(1) = 1. For the martingale property to hold when Xn = k ≥ 1,
we must have
ϕ(k ) = pk ϕ(k + 1) + rk ϕ(k ) + qk ϕ(k − 1)
Using rk = 1 − (pk + qk ), we can rewrite the last equation as or qk (ϕ(k ) − ϕ(k − 1)) = pk (ϕ(k + 1) − ϕ(k ))
qk
ϕ(k + 1) − ϕ(k ) =
(ϕ(k ) − ϕ(k − 1))
pk Here and in what follows, we suppose that pk , qk > 0 for k ≥ 1. Otherwise, the
chain is not irreducible. Since ϕ(1) − ϕ(0) = 1, iterating the last result gives
m ϕ(m + 1) − ϕ(m) =
j =1
n−1 m and ϕ ( n) =
m=0 j =1 qj
pj for m ≥ 1 qj
pj for n ≥ 1 Section 5.3 Recurrence and Transience
if we interpret the product as 1 when m = 0. Let Tc = inf {n ≥ 1 : Xn = c}.
Now I claim that:
(3.7) Theorem. If a < x < b then
P x ( Ta < Tb ) = ϕ(b) − ϕ(x)
ϕ(b) − ϕ(a) P x ( Tb < Ta ) = ϕ(x) − ϕ(a)
ϕ(b) − ϕ(a) Proof If we let T = Ta ∧ Tb then ϕ(Xn∧T ) is a bounded martingale and
T < ∞ a.s. by Exercise 2.5, so ϕ(x) = Ex ϕ(XT ) by (7.3) in Chapter 4. Since
XT ∈ {a, b} a.s.,
ϕ(x) = ϕ(a)Px (Ta < Tb ) + ϕ(b)[1 − Px (Ta < Tb )]
and solving gives the indicated formula.
Remark. The answer and the proof should remind the reader of Example 1.5
in Chapter 3 and Exercise 7.3 in Chapter 4. To help remember the formula,
observe that for any α and β , if we let ψ (x) = αϕ(x) + β then ψ (Xn∧T ) is
also a martingale and the answer we get using ψ must be the same. The last
observation explains why the answer is a ratio of diﬀerences. To help remember
which one, observe that the answer is 1 if x = a and 0 if x = b.
Letting a = 0 and b = M in (3.7) gives
Px (T0 > TM ) = ϕ(x)/ϕ(M )
Letting M → ∞ and observing that TM ≥ M − x, Px a.s. we have proved:
(3.8) Theorem. 0 is recurrent if and only if ϕ(M ) → ∞ as M → ∞, i.e.,
∞ m ϕ(∞) ≡
m=0 j =1 qj
=∞
pj If ϕ(∞) < ∞ then Px (T0 = ∞) = ϕ(x)/ϕ(∞).
We will now see what (3.8) says about some concrete cases.
Example 3.5. Asymmetric simple random walk. Suppose pj = p and
qj = 1 − p for j ≥ 1. In this case,
n−1 ϕ ( n) =
m=0 1−p
p m 293 294 Chapter 5 Markov Chains
From (3.8), it follows that 0 is recurrent if and only if p ≤ 1/2, and if p > 1/2,
then
x
1−p
ϕ(∞) − ϕ(x)
=
Px (T0 < ∞) =
ϕ(∞)
p
Exercise 3.6. A gambler is playing roulette and betting $1 on black each
time. The probability she wins $1 is 18/38, and the probability she loses $1
is 20/38. (i) Calculate the probability that starting with $20 she reaches $40
before losing her money. (ii) Use the fact that Xn + 2n/38 is a martingale to
calculate E (T40 ∧ T0 ).
Example 3.6. To probe the boundary between recurrence and transience,
suppose pj = 1/2 + j where j ∼ Cj −α as j → ∞, and qj = 1 − pj . A little
arithmetic shows
qj
1/2 −
=
pj
1/2 + j =1− j 2j
1/2 + ≈ 1 − 4Cj −α for large j j Case 1: α > 1. It is easy to show that if 0 < δj < 1, then j (1 − δj ) > 0 if and
only if j δj < ∞, (see Exercise 3.5 in Chapter 4), so if α > 1, j ≤k (qj /pj ) ↓
a positive limit, and 0 is recurrent.
Case 2: α < 1. Using the fact that log(1 − δ ) ∼ −δ as δ → 0, we see that
k log k qj
4C 1−α
k
∼−
4Cj −α ∼ −
pj
1−α
j =1
j =1 as k → ∞ so, for k ≥ K ,
k j =1 qj
≤ exp
pj −2Ck 1−α
1−α ∞ and
k=0 k qj
<∞
p
j =1 j and hence 0 is transient.
Case 3: α = 1. Repeating the argument for Case 2 shows
k log
j =1 qj
∼ −4C log k
pj So, if C > 1/4, 0 is transient, and if C < 1/4, 0 is recurrent. The case C = 1/4
can go either way. Section 5.3 Recurrence and Transience
Example 3.7. M/G/1 queue. Let µ =
k ak be the mean number of
customers that arrive during one service time. We will now show that if µ > 1,
the chain is transient (i.e., all states are), but if µ ≤ 1, it is recurrent. For the
case µ > 1, we observe that if ξ1 , ξ2 , . . . are i.i.d. with P (ξm = j ) = aj +1 for
j ≥ −1 and Sn = ξ1 + · · · + ξn , then X0 + Sn and Xn behave the same until
time N = inf {n : X0 + Sn = 0}. When µ > 1, Eξm = µ − 1 > 0, so Sn → ∞
a.s., and inf Sn > −∞ a.s. It follows from the last observation that if x is large,
Px (N < ∞) < 1, and the chain is transient.
To deal with the case µ ≤ 1, we observe that it follows from arguments
in the last paragraph that Xn∧N is a supermartingale. Let T = inf {n : Xn ≥
M }. Since Xn∧N is a nonnegative supermartingale, using the optional stopping
theorem, (7.6) in Chapter 4, at time τ = T ∧ N , and observing Xτ ≥ M on
{T < N }, Xτ = 0 on {N < T } gives
x ≥ M Px (T < N )
Letting M → ∞ shows Px (N < ∞) = 1, so the chain is recurrent.
Remark. There is another way of seeing that the M/G/1 queue is transient
when µ > 1. If we consider the customers that arrive during a person’s service
time to be her children, then we get a branching process. Results in Section
4.3 imply that when µ ≤ 1 the branching process dies out with probability one
(i.e., the queue becomes empty), so the chain is recurrent. When µ > 1, (3.9)
in Chapter 4 implies Px (T0 < ∞) = ρx , where ρ is the smallest ﬁxed point of
the function
∞ ak θk ϕ(θ ) =
k=0 The next result encapsulates the techniques we used for birth and death
chains and the M/G/1 queue.
(3.9) Theorem. Suppose S is irreducible, and ϕ is a nonnegative function
with Ex ϕ(X1 ) ≤ ϕ(x) for x ∈ F , a ﬁnite set, and ϕ(x) → ∞ as x → ∞, i.e.,
/
{x : ϕ(x) ≤ M } is ﬁnite for any M < ∞, then the chain is recurrent.
Proof Let τ = inf {n > 0 : Xn ∈ F }. Our assumptions imply that Yn =
ϕ(Xn∧τ ) is a supermartingale. Let TM = inf {n > 0 : Xn ∈ F or ϕ(Xn ) > M }.
Since {x : ϕ(x) ≤ M } is ﬁnite and the chain is irreducible, TM < ∞ a.s. Using
(7.6) in Chapter 4 now, we see that
ϕ(x) ≥ Ex ϕ(XTM ) ≥ M Px (TM < τ )
since ϕ(XTM ) ≥ M when TM < τ . Letting M → ∞, we see that Px (τ < ∞) = 1
for all x ∈ F . So Py (Xn ∈ F i.o.) = 1 for all y ∈ S , and since F is ﬁnite,
/
Py (Xn = z i.o.) = 1 for some z ∈ F. 295 296 Chapter 5 Markov Chains
Exercise 3.7. Show that if we replace “ϕ(x) → ∞” by “ϕ(x) → 0” in the last
theorem and assume that ϕ(x) > 0 for x ∈ F , then we can conclude that the
chain is transient.
Exercise 3.8. Let Xn be a birth and death chain with pj − 1/2 ∼ C/j as
j → ∞ and qj = 1 − pj . (i) Show that if we take C < 1/4 then we can pick
α > 0 so that ϕ(x) = xα satisﬁes the hypotheses of (3.9). (ii) Show that when
C > 1/4, we can take α < 0 and apply Exercise 3.7.
Remark. An advantage of the method of Exercise 3.8 over that of Example 3.6
is that it applies if we assume Px (X1 − x ≤ M ) = 1 and Ex (X1 − x) ∼ 2C/x.
Exercise 3.9. f is said to be superharmonic if f (x) ≥ y p(x, y )f (y ), or
equivalently f (Xn ) is a supermartingale. Suppose p is irreducible. Show that p
is recurrent if and only if every nonnegative superharmonic function is constant.
Exercise 3.10. M/M/∞ queue. Consider a telephone system with an inﬁnite number of lines. Let Xn = the number of lines in use at time n, and
suppose
Xn Xn+1 = ξn,m + Yn+1
m=1 where the ξn,m are i.i.d. with P (ξn,m = 1) = p and P (ξn,m = 0) = 1 − p, and
Yn is an independent i.i.d. sequence of Poisson mean λ r.v.’s. In words, for each
conversation we ﬂip a coin with probability p of heads to see if it continues for
another minute. Meanwhile, a Poisson mean λ number of conversations start
between time n and n + 1. Use (3.9) with ϕ(x) = x to show that the chain is
recurrent for any p < 1. 5.4. Stationary Measures
A measure µ is said to be a stationary measure if
µ(x)p(x, y ) = µ(y )
x The last equation says Pµ (X1 = y ) = µ(y ). Using the Markov property and
induction, it follows that Pµ (Xn = y ) = µ(y ) for all n ≥ 1. If µ is a probability
measure, we call µ a stationary distribution, and it represents a possible
equilibrium for the chain. That is, if X0 has distribution µ then so does Xn
for all n ≥ 1. If we stretch our imagination a little, we can also apply this
interpretation when µ is an inﬁnite measure. (When the total mass is ﬁnite, Section 5.4 Stationary Measures
we can divide by µ(S ) to get a stationary distribution.) Before getting into the
theory, we consider some examples.
Example 4.1. Random walk. S = Zd . p(x, y ) = f (y − x), where f (z ) ≥ 0
and
f (z ) = 1. In this case, µ(x) ≡ 1 is a stationary measure since
p(x, y ) = f ( y − x) = 1 x x A transition probability that has x p(x, y ) = 1 is called doubly stochastic.
This is obviously a necessary and suﬃcient condition for µ(x) ≡ 1 to be a
stationary measure.
Example 4.2. Asymmetric simple random walk. S = Z.
p(x, x + 1) = p p(x, x − 1) = q = 1 − p By the last example, µ(x) ≡ 1 is a stationary measure. When p = q , µ(x) =
(p/q )x is a second one. To check this, we observe that
µ(x)p(x, y ) = µ(y + 1)p(y + 1, y ) + µ(y − 1)p(y − 1, y )
x = (p/q )y+1 q + (p/q )y−1 p = (p/q )y [p + q ] = (p/q )y
Example 4.3. The Ehrenfest chain. S = {0, 1, . . . , r}.
p(k, k + 1) = (r − k )/r p(k, k − 1) = k/r r
In this case, µ(x) = 2−r x is a stationary distribution. One can check this
without pencil and paper by observing that µ corresponds to ﬂipping r coins
to determine which urn each ball is to be placed in, and the transitions of the
chain correspond to picking a coin at random and turning it over. Alternatively,
you can pick up your pencil and check that µ(k + 1)p(k + 1, k ) + µ(k − 1)p(k − 1, k ) = µ(k )
Example 4.4. Birth and death chains. S = {0, 1, 2, . . .}
p(x, x + 1) = px p(x, x) = rx p(x, x − 1) = qx with q0 = 0 and p(i, j ) = 0 otherwise. In this case, there is the measure
x µ ( x) =
k=1 pk−1
qk 297 298 Chapter 5 Markov Chains
which has
x µ(x)p(x, x + 1) = px
k=1 pk−1
= µ(x + 1)p(x + 1, x)
qk Since p(x, y ) = 0 when x − y  > 1, it follows that
(4.1) µ(x)p(x, y ) = µ(y )p(y, x) for all x, y Summing over x gives
µ(x)p(x, y ) = µ(y )
x so (4.1) is stronger than being a stationary measure. (4.1) asserts that the
amount of mass that moves from x to y in one jump is exactly the same as the
amount that moves from y to x. A measure µ that satisﬁes (4.1) is said to be a
reversible measure. Since Examples 4.2 and 4.3 are birth and death chains,
they have reversible measures. In Example 4.1 (random walks), µ(x) ≡ 1 is a
reversible measure if and only if p(x, y ) = p(y, x).
The next exercise explains the name “reversible.”
Exercise 4.1. Let µ be a stationary measure and suppose X0 has “distribution” µ. Then Ym = Xn−m , 0 ≤ m ≤ n is a Markov chain with initial measure
µ and transition probability
q (x, y ) = µ(y )p(y, x)/µ(x)
q is called the dual transition probability. If µ is a reversible measure then
q = p.
Example 4.5. Random walks on graphs. A graph is described by giving
a countable set of vertices S and an adjacency matrix aij that has aij = 1 if
i and j are adjacent and 0 otherwise. To have an undirected graph with no
loops, we suppose aij = aji and aii = 0. If we suppose that
µ( i) = aij < ∞ and let p(i, j ) = aij /µ(i) j then p is a transition probability that corresponds to picking an edge at random
and jumping to the other end. It is clear from the deﬁnition that
µ(i)p(i, j ) = aij = aji = µ(j )p(j, i) Section 5.4 Stationary Measures
so µ is a reversible measure for p. A little thought reveals that if we assume
only that
aij = aji ≥ 0, µ ( i) = aij < ∞ and p(i, j ) = aij /µ(i)
j the same conclusion is valid. This is the most general example because if µ is
a reversible measure for p, we can let aij = µ(i)p(i, j ).
Reviewing the last ﬁve examples might convince you that most chains
have reversible measures. This is a false impression. The M/G/1 queue has
no reversible measures because if x > y + 1, p(x, y ) = 0 but p(y, x) > 0. The
renewal chain has similar problems.
(4.2) Theorem. Suppose p is irreducible. A necessary and suﬃcient condition
for the existence of a reversible measure is that (i) p(x, y ) > 0 implies p(y, x) >
0, and (ii) for any loop x0 , x1 , . . . , xn = x0 with 1≤i≤n p(xi , xi−1 ) > 0,
n p(xi−1 , xi )
=1
p(xi , xi−1 )
i=1
Proof To prove the necessity of this cycle condition, due to Kolmogorov,
we note that irreducibility implies that any stationary measure has µ(x) > 0
for all x, so (4.1) implies (i) holds. To check (ii), note that (4.1) implies that
for the sequences considered above
n n p(xi−1 , xi )
µ ( xi )
=
=1
p(xi , xi−1 ) i=1 µ(xi−1 )
i=1
To prove suﬃciency, ﬁx a ∈ S , set µ(a) = 1, and if x0 = a, x1 , . . . , xn = x is
a sequence with 1≤i≤n p(xi , xi−1 ) > 0 (irreducibility implies such a sequence
will exist), we let
n
p(xi−1 , xi )
µ ( x) =
p(xi , xi−1 )
i=1
The cycle condition guarantees that the last deﬁnition is independent of the
path. To check (4.1) now, observe that if p(y, x) > 0 then adding xn+1 = y to
the end of a path to x we have
µ ( x) p(x, y )
= µ( y )
p(y, x) 299 300 Chapter 5 Markov Chains
Only special chains have reversible measures, but as the next result shows,
many Markov chains have stationary measures.
(4.3) Theorem. Let x be a recurrent state, and let T = inf {n ≥ 1 : Xn = x}.
Then
T −1 µx (y ) = Ex ∞ = 1{Xn =y}
n=0 Px (Xn = y, T > n)
n=0 deﬁnes a stationary measure.
Proof This is called the “cycle trick.” The proof in words is simple. µx (y ) is
the expected number of visits to y in {0, . . . , T − 1}. µx p(y ) ≡
µx (z )p(z, y )
is the expected number of visits to y in {1, . . . , T }, which is = µx (y ) since
XT = X0 = x. To translate this intuition into a proof, let pn (x, y ) = Px (Xn =
¯
y, T > n) and use Fubini’s theorem to get
∞ µx (y )p(y, z ) =
y pn (x, y )p(y, z )
¯
n=0 y Case 1. z = x.
pn (x, y )p(y, z ) =
¯
y Px (Xn = y, T > n, Xn+1 = z )
y ¯
= Px (T > n + 1, Xn+1 = z ) = pn+1 (x, z )
so ∞ ∞ pn (x, y )p(y, z ) =
¯
n=0 y pn+1 (x, z ) = µx (z )
¯
n=0 since p0 (x, z ) = 0.
¯
Case 2. z = x.
pn (x, y )p(y, x) =
¯
y so Px (Xn = y, T > n, Xn+1 = x) = Px (T = n + 1)
y ∞ ∞ pn (x, y )p(y, x) =
¯
n=0 y since Px (T = 0) = 0. Px (T = n + 1) = 1 = µx (x)
n=0 Section 5.4 Stationary Measures
Remark. If x is transient, then we have µx p(z ) ≤ µx (z ) with equality for all
z = x.
Technical Note. To show that we are not cheating, we should prove that
µx (y ) < ∞ for all y . First, observe that µx p = µx implies µx pn = µx for all
n ≥ 1, and µx (x) = 1, so if pn (y, x) > 0 then µx (y ) < ∞. Since the last result
is true for all n, we see that µx (y ) < ∞ whenever ρyx > 0, but this is good
enough. By (3.4), when x is recurrent ρxy > 0 implies ρyx > 0, and it follows
from the argument above that µx (y ) < ∞. If ρxy = 0 then µx (y ) = 0.
Exercise 4.2. (i) Use the construction in the proof of (4.3) to show that
µ(j ) = k≥j fk+1 deﬁnes a stationary measure for the renewal chain (Example
1.3). (ii) Show that in this case the dual Markov chain deﬁned in Exercise 4.1
represents the age of the item at use at time n, i.e., the amount of time since
the last renewal ≤ n.
(4.3) allows us to construct a stationary measure for each closed set of
recurrent states. Conversely, we have:
(4.4) Theorem. If p is irreducible and recurrent (i.e., all states are) then the
stationary measure is unique up to constant multiples.
Proof Let ν be a stationary measure and let a ∈ S.
ν (y )p(y, z ) = ν (a)p(a, z ) + ν (z ) =
y ν (y )p(y, z )
y =a Using the last identity to replace ν (y ) on the righthand side,
ν (z ) = ν (a)p(a, z ) + ν (a)p(a, y )p(y, z )
y =a + ν (x)p(x, y )p(y, z )
x=a y =a = ν (a)Pa (X1 = z ) + ν (a)Pa (X1 = a, X2 = z )
+ Pν (X0 = a, X1 = a, X2 = z )
Continuing in the obvious way, we get
n ν (z ) = ν (a) Pa (Xk = a, 1 ≤ k < m, Xm = z )
m=1 + Pν (Xj = a, 0 ≤ j < n, Xn = z ) 301 302 Chapter 5 Markov Chains
The last term is ≥ 0. Letting n → ∞ gives ν (z ) ≥ ν (a)µa (z ), where µa is the
measure deﬁned in (4.3) for x = a. It follows from (4.3) that µa is a stationary
distribution with µa (a) = 1. (Here we are summing from 1 to T rather than
from 0 to T − 1.) To turn the ≥ in the last equation into =, we observe
ν (x)pn (x, a) ≥ ν (a) ν (a) =
x µa (x)pn (x, a) = ν (a)µa (a) = ν (a)
x Since ν (x) ≥ ν (a)µa (x) and the left and righthand sides are equal we must
have ν (x) = ν (a)µa (x) whenever pn (x, a) > 0. Since p is irreducible, it follows
that ν (x) = ν (a)µa (x) for all x ∈ S , and the proof is complete.
(4.3) and (4.4) make a good team. (4.3) gives us a formula for a stationary
distribution we call µx , and (4.4) shows it is unique up to constant multiples.
Together they allow us to derive a lot of formulas.
Exercise 4.3. Show that if p is irreducible and recurrent then
µx ( y ) µy ( z ) = µx ( z )
Exercise 4.4. Use (4.3) and (4.4) to show that for simple random walk, the
expected number of visits to k between successive visits to 0 is 1 for all k .
Exercise 4.5. Let wxy = Px (Ty < Tx ). Show that
µx (y ) = wxy /wyx
and use this to prove the result in the last exercise.
Exercise 4.6. Another proof of (4.4). Suppose p is irreducible and recurrent and let µ be the stationary measure constructed in (4.3). µ(x) > 0 for all
x, and
q (x, y ) = µ(y )p(y, x)/µ(x) ≥ 0
deﬁnes a “dual” transition probability. (See Exercise 4.1.) (i) Show that q is
irreducible and recurrent. (ii) Suppose ν (y ) ≥ x ν (x)p(x, y ) (i.e, ν is an excessive measure) and let h(x) = ν (x)/µ(x). Verify that h(y ) ≥
q (y, x)h(x)
and use Exercise 3.9 to conclude that h is constant, i.e., ν = cµ.
Remark. The last result is stronger than (4.4) since it shows that in the
recurrent case any excessive measure is a constant multiple of one stationary
measure. The remark after the proof of (4.3) shows that if p is irreducible and
transient, there is an excessive measure for each x ∈ S. Section 5.4 Stationary Measures
Having examined the existence and uniqueness of stationary measures, we
turn our attention now to stationary distributions, i.e., probability measures
π with πp = π . Stationary measures may exist for transient chains, e.g., random
walks in d ≥ 3, but
(4.5) Theorem. If there is a stationary distribution then all states y that have
π (y ) > 0 are recurrent.
Proof Since πpn = π , Fubini’s theorem implies
∞ ∞ pn (x, y ) = π ( x)
x n=1 π (y ) = ∞
n=1 when π (y ) > 0. Using (3.2) now gives
∞= π ( x)
x ρxy
1
≤
1 − ρyy
1 − ρyy since ρxy ≤ 1 and π is a probability measure. So ρyy = 1.
(4.6) Theorem. If p is irreducible and has stationary distribution π , then
π (x) = 1/Ex Tx
Remark. Recycling Chung’s quote regarding (5.8) in Chapter 4, we note that
the proof will make π (x) = 1/Ex Tx obvious, but it seems incredible that x Proof
(4.3), 1
1
p(x, y ) =
Ex Tx
Ey Ty Irreducibility implies π (x) > 0 so all states are recurrent by (4.5). From
∞ µx ( y ) = Px (Xn = y, Tx > n)
n=0 deﬁnes a stationary measure with µx (x) = 1, and Fubini’s theorem implies
∞ µx ( y ) =
y Px (Tx > n) = Ex Tx
n=0 By (4.4), the stationary measure is unique up to constant multiples, so π (x) =
µx (x)/Ex Tx . Since µx (x) = 1 by deﬁnition, the desired result follows. 303 304 Chapter 5 Markov Chains
If a state x has Ex Tx < ∞, it is said to be positive recurrent. A recurrent
state with Ex Tx = ∞ is said to be null recurrent. (5.1) will explain these
names. The next result helps us identify positive recurrent states.
(4.7) Theorem. If p is irreducible then the following are equivalent:
(i) Some x is positive recurrent.
(ii) There is a stationary distribution.
(iii) All states are positive recurrent.
Proof (i) implies (ii) If x is positive recurrent then
∞ Px (Xn = y, Tx > n)/Ex Tx π (y ) =
n=0 deﬁnes a stationary distribution.
(ii) implies (iii) (4.6) implies π (y ) = 1/Ey Ty , and irreducibility tells us
π (y ) > 0 for all y , so Ey Ty < ∞.
(iii) implies (i) Trivial.
Exercise 4.7. Renewal chain. Show that an irreducible renewal chain (Example 1.3) is positive recurrent (i.e., all the states are) if and only if µ =
k fk < ∞.
Exercise 4.8. Suppose p is irreducible and positive recurrent. Then Ex Ty < ∞
for all x, y.
Exercise 4.9. Suppose p is irreducible and has a stationary measure µ with
x µ(x) = ∞. Then p is not positive recurrent.
(4.7) shows that being positive recurrent is a class property. If it holds for
one state in an irreducible set, then it is true for all. Turning to our examples,
since µ(x) ≡ 1 is a stationary measure, Exercise 4.9 implies that random walks
(Examples 4.1 and 4.2) are never positive recurrent. The Ehrenfest chain (Example 4.3) is positive recurrent. To see this note that the state space is ﬁnite,
so there is a stationary distribution and the conclusion follows from (4.7).
Birth and death chains (Example 4.4) have a stationary distribution if and
only if
x
pk−1
<∞
qk
x
k=1 Section 5.4 Stationary Measures
By (3.8), the chain is recurrent if and only if
∞ m m=0 j =1 qj
=∞
pj When pj = p and qj = (1 − p) for j ≥ 1, there is a stationary distribution
if and only if p < 1/2 and the chain is transient when p > 1/2. In Section
3, we probed the boundary between recurrence and transience by looking at
examples with pj = 1/2 + j , where j ∼ C j −α as j → ∞ and C, α ∈ (0, ∞).
Since j ≥ 0 and hence pj −1 /qj ≥ 1 for large j , none of these chains have
stationary distributions. If we look at chains with pj = 1/2 − j , then all we
have done is interchange the roles of p and q , and results from the last section
imply that the chain is positive recurrent when α < 1, or α = 1 and C > 1/4.
Random walks on graphs (Example 4.5) are irreducible if and only if the
graph is connected. Since µ(i) ≥ 1 in the connected case, we have positive
recurrence if and only if the graph is ﬁnite.
Exercise 4.10. Compute the expected number of moves it takes a knight to
return to its initial position if it starts in a corner of the chessboard, assuming
there are no other pieces on the board, and each time it chooses a move at
random from its legal moves. (Note: A chessboard is {0, 1, . . . , 7}2 . A knight’s
move is Lshaped; two steps in one direction followed by one step in a perpendicular direction.)
Example 4.6. M/G/1 queue. Let µ =
k ak be the mean number of
customers that arrive during one service time. In Example 3.7, we showed that
the chain is recurrent if and only if µ ≤ 1. We will now show that the chain
is positive recurrent if and only if µ < 1. First, suppose that µ < 1. When
Xn > 0, the chain behaves like a random walk that has jumps with mean µ − 1,
so if N = inf {n ≥ 0 : Xn = 0} then XN ∧n − (µ − 1)(N ∧ n) is a martingale. If
X0 = x > 0 then the martingale property implies
x = Ex XN ∧n + (1 − µ)Ex (N ∧ n) ≥ (1 − µ)Ex (N ∧ n)
since XN ∧n ≥ 0, and it follows that Ex N ≤ x/(1 − µ).
To prove that there is equality, observe that Xn decreases by at most one
each time and for x ≥ 1, Ex Tx−1 = E1 T0 , so Ex N = cx. To identify the
constant, observe that
∞ E1 N = 1 + ak Ek N
k=0 so c = 1 + µc and c = 1/(1 − µ). If X0 = 0 then p(0, 0) = a0 + a1 and
p(0, k − 1) = ak for k ≥ 2. By considering what happens on the ﬁrst jump, we 305 306 Chapter 5 Markov Chains
see that (the ﬁrst term may look wrong, but recall k − 1 = 0 when k = 1)
∞ E0 T0 = 1 + ak
k=1 µ − (1 − a0 )
a0
k−1
=1+
=
<∞
1−µ
1−µ
1−µ This shows that the chain is positive recurrent if µ < 1. To prove the converse,
observe that the arguments above show that if E0 T0 < ∞ then Ek N < ∞ for
all k , Ek N = ck , and c = 1/(1 − µ), which is impossible if µ ≥ 1.
The last result when combined with (4.4) and (4.6) allows us to conclude
that the stationary distribution has π (0) = (1 − µ)/a0 . This may not seem like
much, but the equations in πp = π are:
π (0) = π (0)(a0 + a1 ) + π (1)a0
π (1) = π (0)a2 + π (1)a1 + π (2)a0
π (2) = π (0)a3 + π (1)a2 + π (2)a1 + π (3)a0
or, in general, for j ≥ 1
j +1 π (j ) = π (i)aj +1−i
i=0 The equations have a “triangular” form, so knowing π (0), we can solve for
π (1), π (2), . . . The ﬁrst expression,
π (1) = π (0)(1 − (a0 + a1 ))/a0
is simple, but the formulas get progressively messier, and there is no nice closed
form solution.
Example 4.7. M/M/∞ queue. In this chain, introduced in Exercise 3.9,
Xn Xn+1 = ξn,m + Yn+1
m=1 where ξn,m are i.i.d. Bernoulli with mean p and Yn+1 is an independent Poisson
mean λ. It follows from properties of the Poisson distribution that if Xn is
Poisson with mean µ, then Xn+1 is Poisson with mean µp + λ. Setting µ =
µp + λ, we ﬁnd that a Poisson distribution with mean µ = λ/(1 − p) is a
stationary distribution.
There is a general result that handles Examples 4.6 and 4.7 and is useful in
a number of other situations. This will be developed in the next two exercises. Section 5.5 Asymptotic Behavior
Exercise 4.11. Let Xn ≥ 0 be a Markov chain and suppose Ex X1 ≤ x − for
x > K , where > 0. Let Yn = Xn + n and τ = inf {n : Xn ≤ K }. Yn∧τ is a
positive supermartingale and the optional stopping theorem implies Ex τ ≤ x/ .
Exercise 4.12. Suppose that Xn has state space {0, 1, 2, . . .}, the conditions
of the last exercise hold when K = 0, and E0 X1 < ∞. Then 0 is positive
recurrent. We leave it to the reader to formulate and prove a similar result
when K > 0.
To close the section, we will give a selfcontained proof of
(4.8) Theorem. If p is irreducible and has a stationary distribution π then any
other stationary measure is a multiple of π.
Remark. This result is a consequence of (4.5) and (4.4), but we ﬁnd the
method of proof amusing.
Proof Since p is irreducible, π (x) > 0 for all x. Let ϕ be a concave function
that is bounded on (0, ∞), e.g., ϕ(x) = x/(x + 1). Deﬁne the entropy of µ by
ϕ E ( µ) =
y µ( y )
π (y ) π (y ) The reason for the name will become clear during the proof.
E (µp) = ϕ
y x ≥ ϕ
y x µ(x)p(x, y )
π (y )
µ ( x)
π ( x) π (y ) = ϕ
y x µ(x) π (x)p(x, y )
·
π ( x)
π (y ) π (y ) π (x)p(x, y )
π (y )
π (y ) since ϕ is concave, and ν (x) = π (x)p(x, y )/π (y ) is a probability distribution.
Since the π (y )’s cancel and y p(x, y ) = 1, the last expression = E (µ), and we
have shown E (µp) ≥ E (µ), i.e., the entropy of an arbitrary initial measure µ is
increased by an application of p.
If p(x, y ) > 0 for all x and y , and µp = µ, it follows that µ(x)/π (x) must
be constant, for otherwise there would be strict inequality in the application
of Jensen’s inequality. To get from the last special case to the general result,
observe that if p is irreducible
∞ 2−n pn (x, y ) > 0 for all x, y p(x, y ) =
¯
n=1 307 308 Chapter 5 Markov Chains
and µp = µ implies µp = µ.
¯ 5.5. Asymptotic Behavior
In this section, we will investigate the asymptotic behavior of Xn and pn (x, y ). a. Convergence Theorems
If y is transient, n pn (x, y ) < ∞, so pn (x, y ) → 0 as n → ∞. To deal with
the recurrent states, we let
n Nn ( y ) = 1{Xm =y}
m=1 be the number of visits to y by time n.
(5.1) Theorem. Suppose y is recurrent. For any x ∈ S , as n → ∞
Nn ( y )
1
→
1{Ty <∞}
n
Ey Ty Px a.s. Here 1/∞ = 0.
Proof Suppose ﬁrst that we start at y . Let R(k ) = min{n ≥ 1 : Nn (y ) = k } =
the time of the k th return to y . Let tk = R(k ) − R(k − 1), where R(0) = 0.
Since we have assumed X0 = y , t1 , t2 , . . . are i.i.d. and the strong law of large
numbers implies
R(k )/k → Ey Ty Py a.s.
Since R(Nn (y )) ≤ n < R(Nn (y ) + 1),
n
R(Nn (y ) + 1) Nn (y ) + 1
R(Nn (y ))
≤
<
·
Nn ( y )
Nn ( y )
Nn ( y ) + 1
Nn ( y )
Letting n → ∞, and recalling Nn (y ) → ∞ a.s. since y is recurrent, we have
n
→ Ey Ty
Nn ( y ) Py a.s. To generalize now to x = y , observe that if Ty = ∞ then Nn (y ) = 0 for all n
and hence
Nn (y )/n → 0 on {Ty = ∞} Section 5.5 Asymptotic Behavior
The strong Markov property implies that conditional on {Ty < ∞}, t2 , t3 , . . .
are i.i.d. and have Px (tk = n) = Py (Ty = n), so
R(k )/k = t1 /k + (t2 + · · · + tk )/k → 0 + Ey Ty Px a.s. Repeating the proof for the case x = y shows
Nn (y )/n → 1/Ey Ty Px a.s. on {Ty < ∞} and combining this with the result for {Ty = ∞} completes the proof.
Remark. (5.1) should help explain the terms positive and null recurrent. If
we start from x, then in the ﬁrst case the asymptotic fraction of time spent at
x is positive and in the second case it is 0.
Since 0 ≤ Nn (y )/n ≤ 1, it follows from the bounded convergence theorem
that Ex Nn (y )/n → Ex (1{Ty <∞} /Ey Ty ), so
n 1
pm (x, y ) → ρxy /Ey Ty
n m=1 (5.2) The last result was proved for recurrent y but also holds for transient y , since
in that case, Ey Ty = ∞, and the limit is 0, since m pm (x, y ) < ∞.
(5.2) shows that the sequence pn (x, y ) always converges in the Cesaro sense.
The next example shows that pn (x, y ) need not converge.
Example 5.1.
p= 01
10 p2 = 1
0 0
1 p3 = p, p4 = p2 , . . . A similar problem also occurs in the Ehrenfest chain. In that case, if X0 is
even, then X1 is odd, X2 is even, . . . so pn (x, x) = 0 unless n is even. It is easy
to construct examples with pn (x, x) = 0 unless n is a multiple of 3 or 17 or . . .
(5.5) below will show that this “periodicity” is the only thing that can
prevent the convergence of the pn (x, y ). First, we need a deﬁnition and two
preliminary results. Let x be a recurrent state, let Ix = {n ≥ 1 : pn (x, x) > 0},
and let dx be the greatest common divisor of Ix . dx is called the period of x.
The ﬁrst result says that the period is a class property.
(5.3) Lemma. If ρxy > 0 then dy = dx . 309 310 Chapter 5 Markov Chains
Proof Let K and L be such that pK (x, y ) > 0 and pL (y, x) > 0. (x is
recurrent, so ρyx > 0.)
pK +L (y, y ) ≥ pL (y, x)pK (x, y ) > 0
so dy divides K + L, abbreviated dy (K + L). Let n be such that pn (x, x) > 0.
pK +n+L (y, y ) ≥ pL (y, x)pn (x, x)pK (x, y ) > 0
so dy (K +n+L), and hence dy n. Since n ∈ Ix is arbitrary, dy dx . Interchanging
the roles of y and x gives dx dy , and hence dx = dy .
The next result implies that Ix ⊃ {m · dx : m ≥ m0 }. (Apply (5.4) to
pd(x) .)
(5.4) Lemma. If dx = 1 then pm (x, x) > 0 for m ≥ m0 .
Proof by example Suppose 4, 7 ∈ Ix . pm+n (x, x) ≥ pm (x, x)pn (x, x) so Ix
is closed under addition, i.e., if m, n ∈ Ix then m + n ∈ Ix . A little calculation
shows that in the example
Ix ⊃ { 4, 7, 8, 11, 12, 14, 15, 16, 18, 19, 20, 21, . . . } so the result is true with m0 = 18. (Once Ix contains four consecutive integers,
it will contain all the rest.)
Proof Our ﬁrst goal is to prove that Ix contains two consecutive integers.
Let n0 , n0 + k ∈ Ix . If k = 1, we are done. If not, then since the greatest
common divisor of Ix is 1, there is an n1 ∈ Ix so that k is not a divisor of
n1 . Write n1 = mk + r with 0 < r < k . Since Ix is closed under addition,
(m + 1)(n0 + k ) > (m + 1)n0 + n1 are both in Ix . Their diﬀerence is
k (m + 1) − n1 = k − r < k
Repeating the last argument (at most k times), we eventually arrive at a pair
of consecutive integers N, N + 1 ∈ Ix . It is now easy to show that the result
holds for m0 = N 2 . Let m ≥ N 2 and write m − N 2 = kN + r with 0 ≤ r < N .
Then
m = r + N 2 + kN = r(1 + N ) + (N − r + k )N ∈ Ix
(5.5) Convergence theorem. Suppose p is irreducible, aperiodic (i.e., all
states have dx = 1), and has stationary distribution π . Then, as n → ∞,
pn (x, y ) → π (y ). Section 5.5 Asymptotic Behavior
Proof ¯
Let S 2 = S × S . Deﬁne a transition probability p on S × S by
p((x1 , y1 ), (x2 , y2 )) = p(x1 , x2 )p(y1 , y2 )
¯ i.e., each coordinate moves independently. Our ﬁrst step is to check that p is
¯
irreducible. This may seem like a silly thing to do ﬁrst, but this is the only
step that requires aperiodicity. Since p is irreducible, there are K, L, so that
pK (x1 , x2 ) > 0 and pL (y1 , y2 ) > 0. From (5.4) it follows that if M is large
pL+M (x2 , x2 ) > 0 and pK +M (y2 , y2 ) > 0, so
pK +L+M ((x1 , y1 ), (x2 , y2 )) > 0
¯
Our second step is to observe that since the two coordinates are independent, π (a, b) = π (a)π (b) deﬁnes a stationary distribution for p, and (4.5) implies
¯
¯
that for p all states are recurrent. Let (Xn , Yn ) denote the chain on S ×S , and let
¯
T be the ﬁrst time that this chain hits the diagonal {(y, y ) : y ∈ S }. Let T(x,x)
be the hitting time of (x, x). Since p is irreducible and recurrent, T(x,x) < ∞
¯
a.s. and hence T < ∞ a.s. The ﬁnal step is to observe that on {T ≤ n}, the two
coordinates Xn and Yn have the same distribution. By considering the time
and place of the ﬁrst intersection and then using the Markov property,
n P (Xn = y, T ≤ n) = P (T = m, Xm = x, Xn = y )
m=1 x
n = P (T = m, Xm = x)P (Xn = y Xm = x)
m=1 x
n = P (T = m, Ym = x)P (Yn = y Ym = x)
m=1 x = P (Yn = y, T ≤ n)
To ﬁnish up, we observe that
P (Xn = y ) = P (Yn = y, T ≤ n) + P (Xn = y, T > n)
≤ P (Yn = y ) + P (Xn = y, T > n)
and similarly, P (Yn = y ) ≤ P (Xn = y ) + P (Yn = y, T > n). So
P (Xn = y ) − P (Yn = y ) ≤ P (Xn = y, T > n) + P (Yn = y, T > n)
and summing over y gives
P (Xn = y ) − P (Yn = y ) ≤ 2P (T > n)
y 311 312 Chapter 5 Markov Chains
If we let X0 = x and let Y0 have the stationary distribution π , then Yn has
distribution π , and it follows that
pn (x, y ) − π (y ) ≤ 2P (T > n) → 0
y proving the desired result. If we recall the deﬁnition of the total variation
distance given in Section 2.6, the last conclusion can be written as
pn (x, ·) − π (·) ≤ P (T > n) → 0 At ﬁrst glance, it may seem strange to prove the convergence theorem by
running independent copies of the chain. An approach that is slightly more
complicated but explains better what is happening is to deﬁne
q ((x1 , y1 ), (x2 , y2 )) = p(x1 , x2 )p(y1 , y2 ) if x1 = y1
p(x1 , x2 )
if x1 = y1 , x2 = y2
0
otherwise In words, the two coordinates move independently until they hit and then move
together. It is easy to see from the deﬁnition that each coordinate is a copy of
the original process. If T is the hitting time of the diagonal for the new chain
(Xn , Yn ), then Xn = Yn on T ≤ n, so it is clear that
P ( X n = y ) − P ( Y n = y )  ≤ 2 P ( X n = Y n ) = 2 P ( T > n )
y On the other hand, T and T have the same distribution so P (T > n) → 0,
and the conclusion follows as before. The technique used in the last proof is
called coupling. Generally, this term refers to building two sequences Xn and
Yn on the same space to conclude that Xn converges in distribution by showing
P (Xn = Yn ) → 0, or more generally, that for some metric ρ, ρ(Xn , Yn ) → 0 in
probability.
Having completed the proof of (5.5), we pause to show that coupling can
be fun and it doesn’t always happen.
Example 5.2. A coupling card trick. The following demonstration used by
E.B. Dynkin in his probability class is a variation of a card trick that appeared
in Scientiﬁc American. The instructor asks a student to write 100 random
digits from 0 to 9 on the blackboard. Another student chooses one of the ﬁrst
10 numbers and does not tell the instructor. If that digit is 7, say, she counts 7
places along the list, notes the digit at that location, and continues the process. Section 5.5 Asymptotic Behavior
If the digit is 0 she counts 10. A possible sequence is underlined on the list
below:
3 4 7 8 2 3 7 5 6 1 6 4 6 5 7 8 3 1 5 3 0 7 9 2 3 ...
The trick is that, without knowing the student’s ﬁrst digit, the instructor can
point to her ﬁnal stopping position. To this end, he picks the ﬁrst digit and
forms his own sequence in the same manner as the student and announces his
stopping position. He makes an error if the coupling time is larger than 100.
Numerical computations done by one of Dynkin’s graduate students show that
the probability of error is approximately 0.026.
Example 5.3. There is a transition probability that is irreducible, aperiodic,
and recurrent but has the property that 2 independent particles need not meet.
Let Sn be a modiﬁed twodimensional simple random walk that stays where it
is with probability 1/5 and jumps to each of its four neighbors with probability
1/5 each. Let T = inf {n ≥ 1 : Sn = (0, 0)} and fk = P (T = k ). Let p be the
transition probability with
p(0, j ) = fj +1 , p(i, i − 1) = 1, p(i, j ) = 0 otherwise p is the renewal chain corresponding to the distribution f , so p is recurrent.
1
2
fk > 0 for all k so p is irreducible and aperiodic. Let Sn and Sn be independent
1
2
i
copies of the random walk starting from S0 = S0 , and let Xn = inf {m − n :
i
m ≥ n, Sm = (0, 0)}. From our discussion of the renewal chain in Example
1
2
1.3, it follows that Xn and Xn are independent Markov chains with transition
1
2
1
2
probability p. It is easy to see that if Xn = Xn = j then Xn+j = Xn+j = 0, and
1
2
at this time the fourdimensional random walk (Sn , Sn ) is at (0,0,0,0). Since
1
2
P ((Sn , Sn ) = (0, 0, 0, 0) for some n ≥ 1) < 1 we have proved our claim.
(5.5) applies almost immediately to the examples considered in Section 1.
The M/G/1 queue has ak > 0 for all k ≥ 0, so if µ < 1, Px (Xn = y ) → π (y ) for
any x, y ≥ 0. The same result holds for the renewal chain, provided {k : fk > 0}
is unbounded (for irreducibility) and its greatest common divisor is 1. In this
case, Px (Xn = 0) → π (0) = 1/ν , where ν =
k fk is the mean time between
renewals.
Exercise 5.1. Historically, the ﬁrst chain for which (5.5) was proved was the
BernoulliLaplace model of diﬀusion. Suppose 2 urns, which we will call
left and right, have m balls each. b (which we will assume is ≤ m) balls are
black, and 2m − b are white. At each time, we pick one ball from each urn and 313 314 Chapter 5 Markov Chains
interchange them. Let the state at time n be the number of black balls in the
left urn. Compute the transition probability for this chain, ﬁnd its stationary
distribution, and use (5.5) to conclude that the chain approaches equilibrium
as n → ∞.
And now for something completely diﬀerent:
Example 5.4. Shuﬄing cards. The state of a deck of n cards can be
represented by a permutation, π (i) giving the location of the ith card. Consider
the following method of mixing the deck up. The top card is removed and
inserted under one of the n − 1 cards that remain. I claim that by following
the bottom card of the deck we can see that it takes about n log n moves to
mix up the deck. This card stays at the bottom until the ﬁrst time (T1 ) a
card is inserted below it. It is easy to see that when the k th card is inserted
below the original bottom card (at time Tk ), all k ! arrangements of the cards
below are equally likely, so at time τn = Tn−1 + 1 all n! arrangements are
equally likely. If we let T0 = 0 and tk = Tk − Tk−1 for 1 ≤ k ≤ n − 1, then
these r.v.’s are independent, and tk has a geometric distribution with success
probability k/(n − 1). These waiting times are the same as the ones in the
coupon collector’s problem (Example 5.3 in Chapter 1), so τn /(n log n) → 1 in
probability as n → ∞. For more on card shuﬄing, see Aldous and Diaconis
(1986).
Example 5.5. Random walk on the hypercube. Consider {0, 1}d as a
graph with edges connecting each pair of points that diﬀer in only one coordinate. Let Xn be a random walk on {0, 1}d that stays put with probability
1/2 and jumps to one of its d neighbors with probability 1/2d each. Let Yn be
another copy of the chain in which Y0 (and hence Yn , n ≥ 1) is uniformly distributed on {0, 1}d. We construct a coupling of Xn and Yn by letting U1 , U2 , . . .
be uniform on {1, 2, . . . , 2d}. At time n, the j th coordinates of X and Y are each
set equal to 1 if Un = 2j − 1 and are each set equal to 0 if Un = 2j . The other
coordinates are unchanged. Let Td = inf {m : {U1 , . . . , Um } = {1, 2, . . . , 2d}}.
When n ≥ Td , Xn = Yn . Results for the coupon collectors problem (Example
5.3 in Chapter 1) show that Td /(d log d) → 1 in probability as d → ∞. *b. Periodic Case
(5.6) Lemma. Suppose p is irreducible, recurrent, and all states have period d.
Fix x ∈ S , and for each y ∈ S , let Ky = {n ≥ 1 : pn (x, y ) > 0}. (i) There is an
ry ∈ {0, 1, . . . , d − 1} so that if n ∈ Ky then n = ry mod d, i.e., the diﬀerence
n − ry is a multiple of d. (ii) Let Sr = {y : ry = r} for 0 ≤ r < d. If y ∈ Si ,
z ∈ Sj , and pn (y, z ) > 0 then n = (j − i) mod d. (iii) S0 , S1 , . . . , Sd−1 are Section 5.5 Asymptotic Behavior
irreducible classes for pd , and all states have period 1.
Proof (i) Let m(y ) be such that pm(y) (y, x) > 0. If n ∈ Ky then pn+m(y) (x, x)
is positive so d(n + m). Let ry = (d − m(y )) mod d. (ii) Let m, n be such that
pn (y, z ), pm (x, y ) > 0. Since pn+m (x, z ) > 0, it follows from (i) that n + m = j
mod d. Since m = i mod d, the result follows. The irreducibility in (iii) follows
immediately from (ii). The aperiodicity follows from the deﬁnition of the period
as the g.c.d. {x : pn (x, x) > 0}.
A partition of the state space S0 , S1 , . . . , Sd−1 satisfying (ii) in (5.6) is
called a cyclic decomposition of the state space. Except for the choice of the
set to put ﬁrst, it is unique. (Pick an x ∈ S . It lies in some Sj , but once the
value of j is known, irreducibility and (ii) allow us to calculate all the sets.)
Exercise 5.2. Find the decomposition for the Markov chain with transition
probability
1234567
1 0 0 0 .5 .5 0 0
2 . 3 0 0 0 0 0 .7
30000001
40010000
50010000
60100000
7 0 0 0 .4 0 .6 0
(5.7) Convergence theorem, periodic case. Suppose p is irreducible, has
a stationary distribution π , and all states have period d. Let x ∈ S , and let
S0 , S1 , . . . , Sd−1 be the cyclic decomposition of the state space with x ∈ S0 . If
y ∈ Sr then
lim pmd+r (x, y ) = π (y )d
m→∞ Proof If y ∈ S0 then using (iii) in (5.6) and applying (5.5) to pd shows
lim pmd (x, y ) exists m→∞ To identify the limit, we note that (5.2) implies
n 1
pm (x, y ) → π (y )
n m=1
and (ii) of (5.6) implies pm (x, y ) = 0 unless dm, so the limit in the ﬁrst display
must be π (y )d. If y ∈ Sr with 1 ≤ r < d then
pmd+r (x, y ) = pr (x, z )pmd (z, y )
z ∈S r 315 316 Chapter 5 Markov Chains
Since y, z ∈ Sr it follows from the ﬁrst case in the proof that pmd (z, y ) → π (y )d
r
as m → ∞. pmd (z, y ) ≤ 1, and
z p (x, z ) = 1, so (5.7) follows from the
dominated convergence theorem. *c. Tail σ ﬁeld
Let Fn = σ (Xn+1 , Xn+2 , . . .) and T = ∩n Fn be the tail σ ﬁeld. The next result
is due to Orey. The proof we give is from Blackwell and Freedman (1964).
(5.8) Theorem. Suppose p is irreducible, recurrent, and all states have period
d, T = σ ({X0 ∈ Sr } : 0 ≤ r < d)
Remark. To be precise, if µ is any initial distribution and A ∈ T then there
is an r so that A = {X0 ∈ Sr } Pµ a.s.
Proof We build up to the general result in three steps. Case 1. Suppose P (X0 = x) = 1. Let T0 = 0, and for n ≥ 1, let Tn = inf {m >
Tn−1 : Xm = x} be the time of the nth return to x. Let
Vn = (X (Tn−1 ), . . . , X (Tn − 1))
The vectors Vn are i.i.d. by Exercise 3.1, and the tail σ ﬁeld is contained in the
exchangeable ﬁeld of the Vn , so the HewittSavage 01 law ((1.1) in Chapter 3,
proved there for r.v’s taking values in a general measurable space) implies that
T is trivial in this case.
Case 2. Suppose that the initial distribution is concentrated on one cyclic
class, say S0 . If A ∈ T then Px (A) ∈ {0, 1} for each x by case 1. If Px (A) = 0
for all x ∈ S0 then Pµ (A) = 0. Suppose Py (A) > 0, and hence = 1, for some
y ∈ S0 . Let z ∈ S0 . Since pd is irreducible and aperiodic on S0 , there is an n so
that pn (z, y ) > 0 and pn (y, y ) > 0. If we write 1A = 1B ◦ θn then the Markov
property implies
1 = Py (A) = Ey (Ey (1B ◦ θn Fn )) = Ey (EXn 1B )
so Py (B ) = 1. Another application of the Markov property gives
Pz (A) = Ez (EXn 1B ) ≥ pn (z, y ) > 0
so Pz (A) = 1, and since z ∈ S0 is arbitrary, Pµ (A) = 1. Section 5.5 Asymptotic Behavior
General Case. From case 2, we see that P (AX0 = y ) ≡ 1 or ≡ 0 on each
cyclic class. This implies that either {X0 ∈ Sr } ⊂ A or {X0 ∈ Sr } ∩ A = ∅
Pµ a.s. Conversely, it is clear that {X0 ∈ Sr } = {Xnd ∈ Sr i.o.} ∈ T , and the
proof is complete.
The next result will help us identify the tail σ ﬁeld in transient examples.
(5.9) Theorem. Suppose X0 has initial distribution µ. The equations
h(Xn , n) = Eµ (Z Fn ) and Z = lim h(Xn , n)
n→∞ set up a 11 correspondence between bounded Z ∈ T and bounded spacetime
harmonic functions, i.e., bounded h : S × {0, 1, . . .} → R, so that h(Xn , n)
is a martingale.
Proof Let Z ∈ T , write Z = Yn ◦ θn , and let h(x, n) = Ex Yn .
Eµ (Z Fn ) = Eµ (Yn ◦ θn Fn ) = h(Xn , n) by the Markov property, so h(Xn , n) is a martingale. Conversely, if h(Xn , n) is
a bounded martingale, using (2.10) and (5.6) from Chapter 4 shows h(Xn , n) →
Z ∈ T as n → ∞, and h(Xn , n) = Eµ (Z Fn ).
Exercise 5.3. A random variable Z with Z = Z ◦ θ, and hence = Z ◦ θn for
all n, is called invariant. Show there is a 11 correspondence between bounded
invariant random variables and bounded harmonic functions. We will have more
to say about invariant r.v.’s in Section 6.1.
Example 5.6. Simple random walk in d dimensions. We begin by constructing a coupling for this process. Let i1 , i2 , . . . be i.i.d. uniform on {1, . . . , d}.
Let ξ1 , ξ2 , . . . and η1 , η2 , . . . be i.i.d. uniform on {−1, 1}. Let ej be the j th unit
vector. Construct a coupled pair of ddimensional simple random walks by
Xn = Xn−1 + e(in )ξn
Yn = Yn−1 + e(in )ξn
Yn−1 + e(in )ηn i
in
if Xnn 1 = Yn−1
−
in
in
if Xn−1 = Yn−1 In words, the coordinate that changes is always the same in the two walks, and
once they agree in one coordinate, future movements in that direction are the
i
same. It is easy to see that if X0 − Y0i is even for 1 ≤ i ≤ d, then the two
random walks will hit with probability one.
Let L0 = {z ∈ Zd : z 1 + · · · + z d is even } and L1 = Zd − L0 . Although
we have only deﬁned the notion for the recurrent case, it should be clear that 317 318 Chapter 5 Markov Chains
L0 , L1 is the cyclic decomposition of the state space for simple random walk.
If Sn ∈ Li then Sn+1 ∈ L1−i and p2 is irreducible on each Li . To couple two
random walks starting from x, y ∈ Li , let them run independently until the ﬁrst
time all the coordinate diﬀerences are even, and then use the last coupling. In
the remaining case, x ∈ L0 , y ∈ L1 coupling is impossible.
The next result should explain our interest in coupling two ddimensional
simple random walks.
(5.10) Theorem. For ddimensional simple random walk,
T = σ ({X0 ∈ Li }, i = 0, 1)
Proof Let x, y ∈ Li , and let Xn , Yn be a realization of the coupling deﬁned
above for X0 = x and Y0 = y . Let h(x, n) be a bounded spacetime harmonic
function. The martingale property implies h(x, 0) = Ex h(Xn , n). If h ≤ C , it
follows from the coupling that
h(x, 0) − h(y, 0) = Eh(Xn , n) − Eh(Yn , n) ≤ 2CP (Xn = Yn ) → 0
so h(x, 0) is constant on L0 and L1 . Applying the last result to h (x, m) =
h(x, n + m), we see that h(x, n) = ai on Li . The martingale property implies
n
ai = a1−i , and the desired result follows from (5.9).
n
n+1
Example 5.7. Ornstein’s coupling. Let p(x, y ) = f (y − x) be the transition
probability for an irreducible aperiodic random walk on Z. To prove that the
tail σ ﬁeld is trivial, pick M large enough so that the random walk generated
by the probability distribution fM (x) with fM (x) = cM f (x) for x ≤ M and
fM (x) = 0 for x > M is irreducible and aperiodic. Let Z1 , Z2 , . . . be i.i.d. with
distribution f and let W1 , W2 , . . . be i.i.d. with distribution fM . Let Xn =
Xn−1 + Zn for n ≥ 1. If Xn−1 = Yn−1 , we set Xn = Yn . Otherwise, we let
Yn = Yn−1 + Zn
Yn−1 + Wn if Zn  > m
if Zn  ≤ m In words, the big jumps are taken in parallel and the small jumps are independent. The recurrence of onedimensional random walks with mean 0 implies
P (Xn = Yn ) → 0. Repeating the proof of (5.10), we see that T is trivial.
The tail σ ﬁeld in (5.10) is essentially the same as in (5.8). To get a more
interesting T , we look at: Section 5.5 Asymptotic Behavior
Example 5.8. Random walk on a tree. To facilitate deﬁnitions, we will
consider the system as a random walk on a group with 3 generators a, b, c that
have a2 = b2 = c2 = e, the identity element. To form the random walk, let
ξ1 , ξ2 , . . . be i.i.d. with P (ξn = x) = 1/3 for x = a, b, c, and let Xn = Xn−1 ξn .
(This is equivalent to a random walk on the tree in which each vertex has degree
3 but the algebraic formulation is convenient for computations.) Let Ln be the
length of the word Xn when it has been reduced as much as possible, with
Ln = 0 if Xn = e. The reduction can be done as we go along. If the last letter
of Xn−1 is the same as ξn , we erase it, otherwise we add the new letter. It is
easy to see that Ln is a Markov chain with a transition probability that has
p(0, 1) = 1 and
p(j, j − 1) = 1/3 p(j, j + 1) = 2/3 for j ≥ 1 As n → ∞, Ln → ∞. From this, it follows easily that the word Xn has a limit
i
in the sense that the ith letter Xn stays the same for large n. Let X∞ be the
i
i
i
limiting word, i.e., X∞ = lim Xn . T ⊃ σ (X∞ , i ≥ 1), but it is easy to see that
c
this is not all. If S0 = the words of even length, and S1 = S0 , then Xn ∈ Si
implies Xn+1 ∈ S1−i , so {X0 ∈ S0 } ∈ T . Can the reader prove that we have
now found all of T ? As Fermat once said, “I have a proof but it won’t ﬁt in the
margin.”
Remark. This time the solution does not involve elliptic curves but uses
“hpaths.” See Furstenburg (1970) or decode the following: “Condition on
the exit point (the inﬁnite word). Then the resulting RW is an hprocess,
which moves closer to the boundary with probability 2/3 and farther with
probability 1/3 (1/6 each to the two possibilities). Two such random walks
couple, provided they have same parity.” The quote is from Robin Pemantle,
who says he consulted Itai Benajamini and Yuval Peres.
Exercises
5.4. M/G/1 queue. Let ξ1 , ξ2 , . . . be i.i.d. with P (ξm = k ) = ak+1 for k ≥ −1,
where the ak are deﬁned by the formula in Example 1.4. In particular, it follows
that aj > 0 for all j ≥ 0. Let Sn = x + ξ1 + · · · + ξn , where x ≥ 0, and let
− Xn = S n + min Sm m≤n (i) Show that Xn has the same distribution as the M/G/1 queue (Example 1.4)
starting from X0 = x. (ii) Use this to conclude that if µ =
k ak < 1, then as
n→∞
1
{m ≤ n : Xm−1 = 0, ξm = −1} → (1 − µ) a.s.
n 319 320 Chapter 5 Markov Chains
This gives a roundabout way of getting the result π (0) = (1 − µ)/a0 proved in
Example 4.6.
5.5. Strong law for additive functionals. Suppose p is irreducible and has
stationary distribution π . Let f be a function that has
f (y )π (y ) < ∞. Let
k
Tx be the time of the k th return to x. (i) Show that
f
k
k
Vk = f (X (Tx )) + · · · + f (X (Tx +1 − 1)), k ≥ 1 are i.i.d. k
with E Vkf  < ∞. (ii) Let Kn = inf {k : Tx ≥ n} and show that
K 1nf
EV1f
Vm →
=
1
n m=1
Ex Tx f (y )π (y ) Pµ − a.s. f  (iii) Show that max1≤m≤n Vm /n → 0 and conclude
n 1
f ( Xm ) →
n m=1 f (y )π (y ) Pµ − a.s. y for any initial distribution µ.
5.6. Central limit theorem for additive functionals. Suppose in addition
f 
to the conditions in the Exercise 5.5 that
f (y )π (y ) = 0, and Ex (Vk )2 < ∞.
(i) Use the random index central limit theorem (Exercise 4.6 in Chapter 2) to
conclude that for any initial distribution µ
K n
1
√
V f ⇒ cχ under Pµ
n m=1 m f  √
(ii) Show that max1≤m≤n Vm / n → 0 in probability and conclude
n 1
√
f (Xm ) ⇒ cχ under Pµ
n m=1
5.7. Ratio Limit Theorems. (5.1) does not say much in the null recurrent
case. To get a more informative limit theorem, suppose that y is recurrent
and m is the (unique up to constant multiples) stationary measure on Cy =
{z : ρyz > 0}. Let Nn (z ) = {m ≤ n : Xn = z }. Break up the path at
successive returns to y and show that Nn (z )/Nn (y ) → m(z )/m(y ) Px a.s. for
all x, z ∈ Cy . Note that n → Nn (z ) is increasing, so this is much easier than
the previous problem. Section 5.5 Asymptotic Behavior
5.8. We got (5.2) from (5.1) by taking expected value. This does not work for
the ratio in the previous exercise, so we need another approach. Suppose z = y .
(i) Let pn (x, z ) = Px (Xn = z, Ty > n) and decompose pm (x, z ) according to
¯
the value of J = sup{j ∈ [1, m) : Xj = y } to get
n n
m p (x, z ) =
m=1 n−j n−1
j pm (x, z ) +
¯
m=1 p (x, y )
j =1 pk (y, z )
¯
k=1 (ii) Show that
n n pm (x, z ) pm (x, y ) → m=1 m=1 m( z )
m( y ) 5.9. Show that if S is ﬁnite and p is irreducible and aperiodic, then there is an
m so that pm (x, y ) > 0 for all x, y .
5.10. Show that if S is ﬁnite, p is irreducible and aperiodic, and T is the coupling
time deﬁned in the proof of (5.5) then P (T > n) ≤ Crn for some r < 1 and
C < ∞. So the convergence to equilibrium occurs exponentially rapidly in this
case. Hint: First consider the case in which p(x, y ) > 0 for all x and y and
reduce the general case to this one by looking at a power of p.
5.11. For any transition matrix p, deﬁne
αn = sup
i,j 1
2 pn (i, k ) − pn (j, k )
k The 1/2 is there because for any i and j we can deﬁne r.v.’s X and Y so that
P (X = k ) = pn (i, k ), P (Y = k ) = pn (j, k ), and
pn (i, k ) − pn (j, k ) P (X = Y ) = (1/2)
k Show that αm+n ≤ αn αm . Here you may ﬁnd the coupling interpretation may
help you from getting lost in the algebra.
Remark. Using (9.1) in Chapter 1, we can conclude that
1
1
log αn → inf
log αm
m≥1 m
n
so if αm < 1 for some m, it approaches 0 exponentially fast. 321 322 Chapter 5 Markov Chains *5.6. General State Space
In this section, we will generalize the results from the last three sections to a
collection of Markov chains with uncountable state space called Harris chains.
The developments here are motivated by three ideas. First, the proofs in the last
two sections work if there is one point in the state space that the chain hits with
probability one. (Think, for example, about the construction of the stationary
measure in (4.3).) Second, a recurrent Harris chain can be modiﬁed to contain
such a point. Third, the collection of Harris chains is a comfortable level of
generality; broad enough to contain a large number of interesting examples, yet
restrictive enough to allow for a rich theory.
We say that a Markov chain Xn is a Harris chain if we can ﬁnd sets
A, B ∈ S , a function q with q (x, y ) ≥ > 0 for x ∈ A, y ∈ B , and a probability
measure ρ concentrated on B so that:
(i) If τA = inf {n ≥ 0 : Xn ∈ A}, then Pz (τA < ∞) > 0 for all z ∈ S.
(ii) If x ∈ A and C B then p(x, C ) ≥ C q (x, y ) ρ(dy ). To explain the deﬁnition we turn to some examples:
Example 6.1. Countable state space. If S is countable and there is a point
a with ρxa > 0 for all x (a condition slightly weaker than irreducibility) then
we can take A = {a}, B = {b}, where b is any state with p(a, b) > 0, µ = δb
the point mass at b, and q (a, b) = p(a, b).
Conversely, if S is countable and (A , B ) is a pair for which (i) and (ii) hold,
then we can without loss of generality reduce B to a single point b. Having
done this, if we set A = {b}, pick c so that p(b, c) > 0, and set B = {c}, then
(i) and (ii) hold with A and B both singletons.
Example 6.2. Chains with continuous densities. Suppose Xn ∈ Rd is a
Markov chain with a transition probability that has p(x, dy ) = p(x, y ) dy where
(x, y ) → p(x, y ) is continuous. Pick (x0 , y0 ) so that p(x0 , y0 ) > 0. Let A and B
be open sets around x0 and y0 that are small enough so that p(x, y ) ≥ > 0 on
A × B . If we let ρ(C ) = B ∩ C /B , where B  is the Lebesgue measure of B ,
then (ii) holds. If (i) holds, then Xn is a Harris chain.
For concrete examples, consider:
(a) Diﬀusion processes are a large class of examples that lie outside the
scope of this book, but are too important to ignore. When things are nice,
speciﬁcally, if the generator of X has H¨lder continuous coeﬃcients satisfying
o
suitable growth conditions, see the Appendix of Dynkin (1965), then P (X1 ∈
dy ) = p(x, y ) dy , and p satisﬁes the conditions above. 5.6 General State Space
(b) ARMAP’s. Let ξ1 , ξ2 , . . . be i.i.d. and Vn = θVn−1 + ξn . Vn is called an
autoregressive moving average process or armap for short. We call Vn
a smooth armap if the distribution of ξn has a continuous density g . In this
case p(x, dy ) = g (y − θx) dy with (x, y ) → g (y − θx) continuous.
In the analyzing the behavior of armap’s there are a number of cases to
consider depending on the nature of the support of ξn . We call Vn a simple
armap if the density function for ξn is positive for at all points in R. In this
case we can take A = B = [−1/2, 1/2] with ρ = the restriction of Lebesgue
measure.
(c) The discrete OrnsteinUhlenbeck process is a special case of (a) and
(b). Let ξ1 , ξ2 , . . . be i.i.d. standard normals and let Vn = θVn−1 + ξn . The
OrnsteinUhlenbeck process is a diﬀusion process {Vt , t ∈ [0, ∞)} that models
the velocity of a particle suspended in a liquid. See, e.g., Breiman (1968) Section
16.1. Looking at Vt at integer times (and dividing by a constant to make the
variance 1) gives a Markov chain with the indicated distributions.
Example 6.3. GI/G/1 queue, or storage model. Let ξ1 , ξ2 , . . . be i.i.d.
and deﬁne Wn inductively by Wn = (Wn−1 + ξn )+ . If P (ξn < 0) > 0 then
we can take A = B = {0} and (i) and (ii) hold. To explain the ﬁrst name in
the title, consider a queueing system in which customers arrive at times of a
renewal process, i.e., at times 0 = T0 < T1 < T2 . . . with ζn = Tn − Tn−1 , n ≥ 1
i.i.d. Let ηn , n ≥ 0, be the amount of service time the nth customer requires
and let ξn = ηn−1 − ζn . I claim that Wn is the amount of time the nth customer
has to wait to enter service. To see this, notice that the (n − 1)th customer
adds ηn−1 to the server’s workload, and if the server is busy at all times in
[Tn−1 , Tn ), he reduces his workload by ζn . If Wn−1 + ηn−1 < ζn then the server
has enough time to ﬁnish his work and the next arriving customer will ﬁnd an
empty queue.
The second name in the title refers to the fact that Wn can be used to model
the contents of a storage facility. For an intuitive description, consider water
reservoirs. We assume that rain storms occur at times of a renewal process
{Tn : n ≥ 1}, that the nth rainstorm contributes an amount of water ηn , and
that water is consumed at constant rate c. If we let ζn = Tn − Tn−1 as before,
and ξn = ηn−1 − cζn , then Wn gives the amount of water in the reservoir just
before the nth rainstorm.
History Lesson. Doeblin was the ﬁrst to prove results for Markov chains on
general state space. He supposed that there was an n so that pn (x, C ) ≥ ρ(C )
for all x ∈ S and C ⊂ S . See Doob (1953), Section V.5, for an account of
his results. Harris (1956) generalized Doeblin’s result by observing that it was
k
enough to have a set A so that (i) holds and the chain viewed on A (Yk = X (TA ), 323 324 Chapter 5 Markov Chains
k
k
0
where TA = inf {n > TA−1 : Xn ∈ A} and TA = 0) satisﬁes Doeblin’s condition.
Our formulation, as well as most of the proofs in this section, follows Athreya
and Ney (1978). For a nice description of the “traditional approach,” see Revuz
(1984). ¯
Given a Harris chain on (S, S ), we will construct a Markov chain Xn with
¯ S ), where S = S ∪ {α} and S = {B , B ∪ {α} :
¯
¯
¯
transition probability p on (S,
¯
B ∈ S}. The aim, as advertised earlier, is to manufacture a point α that the
process hits with probability 1 in the recurrent case.
If x ∈ S − A
If x ∈ A
If x = α p(x, C ) = p(x, C ) for C ∈ S
¯
p(x, {α}) =
¯
p(x, C ) = p(x, C ) − ρ(C ) for C ∈ S
¯
¯
p(α, D) = ρ(dx)¯(x, D) for D ∈ S
¯
p ¯
Intuitively, Xn = α corresponds to Xn being distributed on B according to
ρ. Here and in what follows we will reserve A and B for the special sets that
occur in the deﬁnition and use C and D for generic elements of S . We will
often simplify notation by writing p(x, α) instead of p(x, {α}), µ(α) instead of
¯
¯
µ({α}), etc.
Our next step is to prove three technical lemmas, (6.1)–(6.3), that will help
us develop the theory below. Deﬁne a transition probability v by
v (x, {x}) = 1 if x∈S v (α, C ) = ρ(C ) In words, V leaves mass in S alone but returns the mass at α to S and distributes
it according to ρ.
(6.1) Lemma. v p = p and pv = p.
¯¯
¯
Proof Before giving the proof, we would like to remind the reader that measures multiply the transition probability on the left, i.e., in the ﬁrst case we
want to show µv p = µp. If we ﬁrst make a transition according to v and then
¯
¯
one according to p, this amounts to one transition according to p, since only
¯
¯
mass at α is aﬀected by v and
p(α, D) =
¯ ρ(dx)¯(x, D)
p The second equality also follows easily from the deﬁnition. In words, if p acts
¯
ﬁrst and then v , then v returns the mass at α to where it came from.
From (6.1), it follows easily that we have: 5.6 General State Space
(6.2) Lemma. Let Yn be an inhomogeneous Markov chain with p2k = v and
¯
p2k+1 = p. Then Xn = Y2n is a Markov chain with transition probability p and
¯
¯
Xn = Y2n+1 is a Markov chain with transition probability p.
(6.2) shows that there is an intimate relationship between the asymptotic
¯
behavior of Xn and of Xn . To quantify this, we need a deﬁnition. If f is a
¯
¯
bounded measurable function on S , let f = vf , i.e., f (x) = f (x) for x ∈ S and
¯(α) = f dρ.
f
(6.3) Lemma. If µ is a probability measure on (S, S ) then
¯¯
Eµ f (Xn ) = Eµ f (Xn )
¯
¯
Proof Observe that if Xn and Xn are constructed as in (6.2), and P (X0 ∈
¯
¯
S ) = 1 then X0 = X0 and Xn is obtained from Xn by making a transition
according to v.
¯
(6.1)–(6.3) will allow us to obtain results for Xn from those for Xn . We
¯
turn now to the task of generalizing the results of Sections 5.3–5.5 to Xn . To
facilitate comparison with the results for countable state space, we will break
this section into four subsections, the ﬁrst three of which correspond to Sections
5.3–5.5. In the fourth subsection, we take an in depth look at Example 6.3.
Before developing the theory, we will give one last example that explains why
some of the statements are messy.
Example 6.4. Perverted O.U. process. Take the discrete O.U. process of
Example 6.2 and modify the transition probability at the integers x ≥ 2 so that
p(x, {x + 1}) = 1 − x−2
p(x, A) = x−2 A for A ⊂ (0, 1) p is the transition probability of a Harris chain, but
P2 (Xn = n + 2 for all n) > 0
I can sympathize with the reader who thinks that such chains will not arise “in
applications,” but it seems easier (and better) to adapt the theory to include
them than to modify the assumptions to exclude them. a. Recurrence and Transience
We begin with the dichotomy between recurrence and transience. Let R =
¯
inf {n ≥ 1 : Xn = α}. If Pα (R < ∞) = 1 then we call the chain recurrent, 325 326 Chapter 5 Markov Chains
otherwise we call it transient. Let R1 = R and for k ≥ 2, let Rk = inf {n >
¯
Rk−1 : Xn = α} be the time of the k th return to α. The strong Markov
¯
property implies Pα (Rk < ∞) = Pα (R < ∞)k , so Pα (Xn = α i.o.) = 1 in the
recurrent case and = 0 in the transient case. It is easy to generalize (3.3) to
the current setting.
¯
Exercise 6.1. Xn is recurrent if and only if ∞
n=1 pn (α, α) = ∞.
¯ The next result generalizes (3.4).
∞
−n n
p (α, C ). In the recurrent case, if
¯
(6.4) Theorem. Let λ(C ) =
n=1 2
¯ n ∈ C i.o.) = 1. For λa.e. x, Px (R < ∞) = 1.
λ(C ) > 0 then Pα (X Proof The ﬁrst conclusion follows from (2.3). For the second let D = {x :
Px (R < ∞) < 1} and observe that if pn (α, D) > 0 for some n, then
¯
Pα (Xm = α i.o.) ≤ pn (α, dx)Px (R < ∞) < 1
¯ Remark. Example 6.4 shows that we cannot expect to have Px (R < ∞) = 1
for all x. To see that even when the state space is countable, we need not hit
every point starting from α do
Exercise 6.2. If Xn is a recurrent Harris chain on a countable state space,
then S can only have one irreducible set of recurrent states but may have a
nonempty set of transient states. For a concrete example, consider a branching
process in which the probability of no children p0 > 0 and set A = B = {0}.
Exercise 6.3. Suppose Xn is a recurrent Harris chain. Show that if (A , B )
is another pair satisfying the conditions of the deﬁnition, then (6.4) implies
¯
Pα (Xn ∈ A i.o.) = 1, so the recurrence or transience does not depend on the
choice of (A, B ).
As in Section 6.3, we need special methods to determine whether an example is recurrent or transient.
Exercise 6.4. In the GI/G/1 queue, the waiting time Wn and the random
walk Sn = X0 + ξ1 + · · · + ξn agree until N = inf {n : Sn < 0}, and at this
time WN = 0. Use this observation to show that Example 6.3 is recurrent when
Eξn ≤ 0 and transient when Eξn > 0.
Exercise 6.5. Let Vn be a simple smooth armap with E ξi  < ∞. Show that 5.6 General State Space
if θ < 1 then Ex V1  ≤ x for x ≥ M . Use this and ideas from the proof of
(3.9) to show that the chain is recurrent in this case.
Exercise 6.6. Let Vn be an armap (not necessarily smooth or simple) and
suppose θ > 1. Let γ ∈ (1, θ) and observe that if x > 0 then Px (V1 < γx) ≤
C/((θ − γ )x), so if x is large, Px (Vn ≥ γ n x for all n) > 0.
Remark. In the case θ = 1 the chain Vn discussed in the last two exercises is
a random walk with mean 0 and hence recurrent.
Exercise 6.7. In the discrete O.U. process, Xn+1 is normal with mean θXn
and variance 1. What happens to the recurrence and transience if instead Yn+1
is normal with mean 0 and variance β 2 Yn ? b. Stationary Measures
(6.5) Theorem. In the recurrent case, there is a stationary measure.
Proof ¯
Let R = inf {n ≥ 1 : Xn = α}, and let
R−1 µ(C ) = Eα
¯ ∞ 1{Xn ∈C }
¯
n=0 ¯
Pα (Xn ∈ C, R > n) =
n=0 Repeating the proof of (4.3) shows that µp = µ. If we let µ = µv then it follows
¯¯ ¯
¯
from (6.1) that µv p = µpv = µv , so µ p = µ.
¯
¯¯
¯
Exercise 6.8. Let Gk,δ = {x : pk (x, α) ≥ δ }. Show that µ(Gk,δ ) ≤ 2k/δ and
¯
¯
use this to conclude that µ and hence µ is σ ﬁnite.
¯
Exercise 6.9. Let λ be the measure deﬁned in (6.4). Show that µ < λ and
¯<
λ < µ.
<¯
Exercise 6.10. Let Vn be an armap (not necessarily smooth or simple) with
θ < 1 and E log+ ξn  < ∞. Show that m≥0 θm ξm converges a.s. and deﬁnes
a stationary distribution for Vn .
To investigate uniqueness of the stationary measure, we begin with:
(6.6) Lemma. If ν is a σ ﬁnite stationary measure for p, then ν (A) < ∞ and
ν = ν p is a stationary measure for p with ν (α) < ∞.
¯
¯
¯
¯ 327 328 Chapter 5 Markov Chains
Proof We will ﬁrst show that ν (A) < ∞. If ν (A) = ∞ then part (ii) of the
deﬁnition implies ν (C ) = ∞ for all sets C with ρ(C ) > 0. If B = ∪i Bi with
ν (Bi ) < ∞ then ρ(Bi ) = 0 by the last observation and ρ(B ) = 0 by countable
subadditivity, a contradiction. So ν (A) < ∞ and ν (α) = ν p(α) = ν (A) < ∞.
¯
¯
Using the fact that ν p = ν , we ﬁnd
ν p(C ) = ν (C ) − ν (A)ρ(B ∩ C )
¯
the last subtraction being welldeﬁned since ν (A) < ∞, and it follows that
ν v = ν . To check ν p = ν , we observe that (6.1) and the last result imply
¯
¯¯
¯
ν p = ν vp = ν p = ν .
¯¯ ¯ ¯
¯¯
(6.7) Theorem. Suppose p is recurrent. If ν is a σ ﬁnite stationary measure
then ν = ν (α)µ, where µ is the measure constructed in the proof of (6.5).
¯
Proof By (6.6), it suﬃces to prove that if ν is a stationary measure for p with
¯
¯
ν (α) < ∞ then ν = ν (α)¯. Repeating the proof of (4.4) with a = α, it is easy
¯
¯¯µ
to show that ν (C ) ≥ ν (α)¯(C ). Continuing to compute as in that proof:
¯
¯µ
ν (α) =
¯ ν (dx)¯n (x, α) ≥ ν (α)
¯
p
¯ µ(dx)¯n (x, α) = ν (α)¯(α) = ν (α)
¯
p
¯µ
¯ ¯
¯µ
Let Sn = {x : pn (x, α) > 0}. By assumption, ∪n Sn = S . If ν (D) > ν (α)¯(D)
for some D, then ν (D ∩ Sn ) > ν (α)¯(D ∩ Sn ), and it follows that ν (α) > ν (α)
¯
¯µ
¯
¯
a contradiction. c. Convergence Theorem
We say that a recurrent Harris chain Xn is aperiodic if g.c.d. {n ≥ 1 :
pn (α, α) > 0} = 1. This occurs, for example, if we can take A = B in the
deﬁnition for then p(α, α) > 0.
(6.8) Theorem. Let Xn be an aperiodic recurrent Harris chain with stationary
distribution π . If Px (R < ∞) = 1 then as n → ∞,
pn (x, ·) − π (·) → 0
Note. Here
denotes the total variation distance between the measures.
(6.4) guarantees that π a.e. x satisﬁes the hypothesis.
Proof In view of (6.3), it suﬃces to prove the result for p. We begin by ob¯
serving that the existence of a stationary probability measure and the uniqueness result in (6.7) imply that the measure constructed in (6.5) has Eα R = 5.6 General State Space
µ(S ) < ∞. As in the proof of (5.5), we let Xn and Yn be independent
¯
copies of the chain with initial distributions δx and π , respectively, and let
τ = inf {n ≥ 0 : Xn = Yn = α}. For m ≥ 0, let Sm (resp. Tm ) be the times
at which Xn (resp. Yn ) visit α for the (m + 1)th time. Sm − Tm is a random
walk with mean 0 steps, so M = inf {m ≥ 1 : Sm = Tm } < ∞ a.s., and it
follows that this is true for τ as well. The computations in the proof of (5.5)
show P (Xn ∈ C ) − P (Yn ∈ C ) ≤ P (τ > n). Since this is true for all C ,
pn (x, ·) − π (·) ≤ P (τ > n), and the proof is complete.
Exercise 6.11. Use Exercise 6.1 and imitate the proof of (4.5) to show that a
Harris chain with a stationary distribution must be recurrent.
Exercise 6.12. Show that an armap with θ < 1 and E log+ ξn  < ∞ converges
in distribution as n → ∞. Hint: Recall the construction of π in Exercise 6.10. d. GI/G/1 Queue
For the rest of the section, we will concentrate on the GI/G/1 queue. Let
ξ1 , ξ2 , . . . be i.i.d., let Wn = (Wn−1 + ξn )+ , and let Sn = ξ1 + · · · + ξn . Recall
ξn = ηn−1 −ζn , where the η ’s are service times, ζ ’s are the interarrival times, and
suppose Eξn < 0 so that Exercise 6.11 implies there is a stationary distribution.
Exercise 6.13. Let mn = min(S0 , S1 , . . . , Sn ), where Sn is the random walk
deﬁned above. (i) Show that Sn − mn =d Wn . (ii) Let ξm = ξn+1−m for
1 ≤ m ≤ n. Show that Sn − mn = max(S0 , S1 , . . . , Sn ). (iii) Conclude that as
n → ∞ we have Wn ⇒ M ≡ max(S0 , S1 , S2 , . . .).
Explicit formulas for the distribution of M are in general diﬃcult to obtain. However, this can be done if either the arrival or service distribution is
exponential. One reason for this is:
Exercise 6.14. Suppose X , Y ≥ 0 are independent and P (X > x) = e−λx .
Show that P (X − Y > x) = ae−λx , where a = P (X − Y > 0).
Example 6.5. Exponential service time. Suppose P (ηn > x) = e−βx
and Eζn > Eηn . Let T = inf {n : Sn > 0} and L = ST , setting L = −∞ if
T = ∞. The lack of memory property of the exponential distribution implies
that P (L > x) = re−βx , where r = P (T < ∞). To compute the distribution
of the maximum, M , let T1 = T and let Tk = inf {n > Tk−1 : Sn > STk−1 } for
k ≥ 2. (1.3) in Chapter 3 implies that if Tk < ∞ then S (Tk+1 ) − S (Tk ) =d L
and is independent of S (Tk ). Using this and breaking things down according 329 330 Chapter 5 Markov Chains
to the value of K = inf {k : Lk+1 = −∞}, we see that for x > 0 the density
function
∞ rk (1 − r)e−βx β k xk−1 /(k − 1)! = βr(1 − r)e−βx(1−r) P ( M = x) =
k=1 To complete the calculation, we need to calculate r. To do this, let
ϕ(θ) = E exp(θξn ) = E exp(θηn−1 )E exp(−θζn )
which is ﬁnite for 0 < θ < β since ζn ≥ 0 and ηn−1 has an exponential distribution. It is easy to see that
ϕ (0) = Eξn < 0 lim ϕ(θ) = ∞
θ ↑β so there is a θ ∈ (0, β ) with ϕ(θ) = 1. Exercise 7.6 in Chapter 4 implies exp(θSn )
is a martingale. (4.1) in Chapter 4 implies 1 = E exp(θST ∧n ). Letting n → ∞
and noting that (Sn T = n) has an exponential distribution and Sn → −∞ on
{T = ∞}, we have
∞
rβ
eθx βe−βx dx =
1=r
β−θ
0
Example 6.6. Poisson arrivals. Suppose P (ζn > x) = e−αx and Eζn > Eηn .
¯
Let Sn = −Sn . Reversing time as in (ii) of Exercise 6.14, we see (for n ≥ 1)
¯
¯
max Sk < Sn ∈ A P 0≤k<n =P ¯
¯
min Sk > 0, Sn ∈ A 1≤k≤n Let ψn (A) be the common value of the last two expression and let ψ (A) =
n≥0 ψn (A). ψn (A) is the probability the random walk reaches a new maximum (or ladder height, see Example 1.4 in Chapter 3) in A at time n, so ψ (A)
is the number of ladder points in A with ψ ({0}) = 1. Letting the random walk
take one more step
P ¯
¯
min Sk > 0, Sn+1 ≤ x 1≤k≤n = F (x − z ) dψn (z ) The last identity is valid for n = 0 if we interpret the lefthand side as F (x).
¯
Let τ = inf {n ≥ 1 : Sn ≤ 0} and x ≤ 0. Integrating by parts on the righthand
side and then summing over n ≥ 0 gives
∞ ¯
P ( S τ ≤ x) = P
n=0 (6.9) ¯
¯
min Sk > 0, Sn+1 ≤ x 1≤k≤n ψ [0, x − y ] dF (y ) =
y ≤x 5.6 General State Space
The limit y ≤ x comes from the fact that ψ ((−∞, 0)) = 0.
¯
¯
¯
¯
Let ξn = Sn − Sn−1 = −ξn . Exercise 6.14 implies P (ξn > x) = ae−αx . Let
¯
¯
¯
¯ = inf {n : Sn > 0}. E ξn > 0 so P (T < ∞) = 1. Let J = ST . As in the
¯
T
−αx
previous example, P (J > x) = e
. Let Vn = J1 + · · · + Jn . Vn is a rate α
Poisson process, so ψ [0, x − y ] = 1 + α(x − y ) for x − y ≥ 0. Using (6.9) now
and integrating by parts gives
¯
P ( S τ ≤ x) =
(6.10) (1 + α(x − y )) dF (y )
y ≤x
x = F ( x) + α F (y ) dy for x ≤ 0 −∞ ¯
¯
Since P (Sn = 0) = 0 for n ≥ 1, −Sτ has the same distribution as ST , where
T = inf {n : Sn > 0}. Combining this with part (ii) of Exercise 6.13 gives a
“formula” for P (M > x). Straightforward but somewhat tedious calculations
show that if B (s) = E exp(−sηn ), then
E exp(−sM ) = (1 − α · Eη )s
s − α + αB (s) a result known as the PollaczekKhintchine formula. The computations we
omitted can be found in Billingsley (1979) on p. 277 or several times in Feller,
Vol. II (1971). 331 6 Ergodic Theorems Xn , n ≥ 0, is said to be a stationary sequence if for each k ≥ 1 it has the same
distribution as the shifted sequence Xn+k , n ≥ 0. The basic fact about these
sequences, called the ergodic theorem, is that if E f (X0 ) < ∞ then
n−1 (2.1) 1
lim
f ( Xm )
n→∞ n
m=0 exists a.s. If Xn is ergodic (a generalization of the notion of irreducibility for Markov
chains) then the limit is Ef (X0 ). Sections 6.1 and 6.2 develop the theory
needed to prove the ergodic theorem. The remaining ﬁve sections of this chapter
develop various complements and, with the exception of Section 6.7, which gives
applications of a result proved in Section 6.6, can be read in any order. In
Section 6.3, we apply the ergodic theorem to study the recurrence of stationary
sequences. In Section 6.4, we study “mixing,” an asymptotic independence
property stronger than ergodicity. In Section 6.5, we discuss entropy and give
a proof of the ShannonMcMillanBreiman theorem for Xn ’s taking values in
a ﬁnite set. In Section 6.6, we prove the subadditive ergodic theorem. As ﬁve
examples in Section 6.6 and four applications in Section 6.7 should indicate,
this result is a useful generalization of the ergodic theorem. 6.1. Deﬁnitions and Examples
X0 , X1 , . . . is said to be a stationary sequence if for every k , the sequence
Xk , Xk+1 , . . . has the same distribution, i.e., for each n, (X0 , . . . , Xn ) and
(Xk , . . . , Xk+n ) have the same distribution. We begin by giving four examples that will be our constant companions.
Example 1.1. X0 , X1 , . . . are i.i.d.
Example 1.2. Let Xn be a Markov chain with transition probability p(x, A)
and stationary distribution π , i.e., π (A) = π (dx) p(x, A). If X0 has distribution π then X0 , X1 , . . . is a stationary sequence. A special case to keep in Section 6.1 Deﬁnitions and Examples
mind for counterexamples is the chain with state space S = {0, 1} and transition probability p(x, {1 − x}) = 1. In this case, the stationary distribution
has π (0) = π (1) = 1/2 and (X0 , X1 , . . .) = (0, 1, 0, 1, . . .) or (1, 0, 1, 0, . . .) with
probability 1/2 each.
Example 1.3. Rotation of the circle. Let Ω = [0, 1), F = Borel subsets,
P = Lebesgue measure. Let θ ∈ (0, 1), and for n ≥ 0, let Xn (ω ) = (ω + nθ)
mod 1, where x mod 1 = x − [x], [x] being the greatest integer ≤ x. To see the
reason for the name, map [0, 1) into C by x → exp(2πix). This example is a
special case of the last one. Let p(x, {y }) = 1 if y = (x + θ) mod 1.
To make new examples from old, we can use:
(1.1) Theorem. If X0 , X1 , . . . is a stationary sequence and g : R{0,1,...} → R
is measurable then Yk = g (Xk , Xk+1 , . . .) is a stationary sequence.
Proof If x ∈ R{0,1,...} , let gk (x) = g (xk , xk+1 , . . .), and if B ∈ R{0,1,...} let
A = {x : (g0 (x), g1 (x), . . .) ∈ B } To check stationarity now, we observe:
P (ω : (Y0 , Y1 , . . .) ∈ B ) = P (ω : (X0 , X1 , . . .) ∈ A)
= P (ω : (Xk , Xk+1 , . . .) ∈ A)
= P (ω : (Yk , Yk+1 , . . .) ∈ B )
Example 1.4. Bernoulli shift. Ω = [0, 1), F = Borel subsets, P = Lebesgue
measure. Y0 (ω ) = ω and for n ≥ 1, let Yn (ω ) = (2 Yn−1 (ω )) mod 1. This
example is a special case of (1.1). Let X0 , X1 , . . . be i.i.d. with P (Xi = 0) =
∞
P (Xi = 1) = 1/2, and let g (x) = i=0 xi 2−(i+1) . The name comes from the
fact that multiplying by 2 shifts the X ’s to the left. This example is also a
special case of Example 1.2. Let p(x, {y }) = 1 if y = (2x) mod 1.
Examples 1.3 and 1.4 are special cases of the following situation.
Example 1.5. Let (Ω, F , P ) be a probability space. A measurable map ϕ :
Ω → Ω is said to be measure preserving if P (ϕ−1 A) = P (A) for all A ∈
F . Let ϕn be the nth iterate of ϕ deﬁned inductively by ϕn = ϕ(ϕn−1 ) for
n ≥ 1, where ϕ0 (ω ) = ω . We claim that if X ∈ F , then Xn (ω ) = X (ϕn ω )
deﬁnes a stationary sequence. To check this, let B ∈ Rn+1 and A = {ω :
(X0 (ω ), . . . , Xn (ω )) ∈ B }. Then
P ((Xk , . . . , Xk+n ) ∈ B ) = P (ϕk ω ∈ A) = P (ω ∈ A) = P ((X0 , . . . , Xn ) ∈ B ) 333 334 Chapter 6 Ergodic Theorems
The last example is more than an important example. In fact, it is the only
example! If Y0 , Y1 , . . . is a stationary sequence taking values in a nice space,
Kolmogorov’s extension theorem, (7.1) in the Appendix, allows us to construct
a measure P on sequence space (S {0,1,...} , S {0,1,...} ), so that the sequence
Xn (ω ) = ωn has the same distribution as that of {Yn , n ≥ 0}. If we let ϕ
be the shift operator, i.e., ϕ(ω0 , ω1 , . . .) = (ω1 , ω2 , . . .), and let X (ω ) = ω0 , then
ϕ is measure preserving and Xn (ω ) = X (ϕn ω ).
In some situations, e.g., in the proof of (3.3) below, it is useful to observe:
(1.2) Theorem. Any stationary sequence {Xn , n ≥ 0} can be embedded in a
twosided stationary sequence {Yn : n ∈ Z}.
Proof We observe that
P (Y−m ∈ A0 , . . . , Yn ∈ Am+n ) = P (X0 ∈ A0 , . . . , Xm+n ∈ Am+n ) is a consistent set of ﬁnite dimensional distributions, so a trivial generalization
of the Kolmogorov extension theorem implies there is a measure P on (S Z , S Z )
so that the variables Yn (ω ) = ωn have the desired distributions.
In view of the observations above, it suﬃces to give our deﬁnitions and
prove our results in the setting of Example 1.5. Thus, our basic set up consists
of
(Ω, F , P )
ϕ
Xn ( ω ) = X ( ϕ n ω ) a probability space
a map that preserves P
where X is a random variable We will now give some important deﬁnitions. Here and in what follows we
assume ϕ is measurepreserving. A set A ∈ F is said to be invariant if
ϕ−1 A = A. (Here, as usual, two sets are considered to be equal if their symmetric diﬀerence has probability 0.) Some authors call A almost invariant if
P (A∆ϕ−1 (A)) = 0. We call such sets invariant and call B invariant in the
strict sense if B = ϕ−1 (B ).
Exercise 1.1. Show that the class of invariant events I is a σ ﬁeld, and X ∈ I
if and only if X is invariant, i.e., X ◦ ϕ = X a.s.
Exercise 1.2. (i) Let A be any set, let B = ∪∞ ϕ−n (A). Show ϕ−1 (B ) ⊂ B .
n=0
(ii) Let B be any set with ϕ−1 (B ) ⊂ B and let C = ∩∞ ϕ−n (B ). Show that
n=0
ϕ−1 (C ) = C . (iii) Show that A is almost invariant if and only if there is a C
invariant in the strict sense with P (A∆C ) = 0. Section 6.1 Deﬁnitions and Examples
A measurepreserving transformation on (Ω, F , P ) is said to be ergodic if
I is trivial, i.e., for every A ∈ I , P (A) ∈ {0, 1}. If ϕ is not ergodic then the
space can be split into two sets A and Ac , each having positive measure so that
ϕ(A) = A and ϕ(Ac ) = Ac . In words, ϕ is not “irreducible.”
To investigate further the meaning of ergodicity, we turn to our examples.
For an i.i.d. sequence (Example 1.1), we begin by observing that if Ω =
R{0,1,...} and ϕ is the shift operator, then an invariant set A has {ω : ω ∈ A} =
{ω : ϕω ∈ A} ∈ σ (X1 , X2 , . . .). Iterating gives
A ∈ ∩∞ σ (Xn , Xn+1 , . . .) = T ,
n=1 the tail σ ﬁeld so I ⊂ T . For an i.i.d. sequence, Kolmogorov’s 01 law implies T is trivial, so
I is trivial and the sequence is ergodic (i.e., when the corresponding measure
is put on sequence space Ω = R{0,1,2,,...} the shift is).
Turning to Markov chains (Example 1.2), suppose the state space S is
countable and the stationary distribution has π (x) > 0 for all x ∈ S . By (4.5)
and (3.6) in Chapter 5, all states are recurrent, and we can write S = ∪Ri , where
the Ri are disjoint irreducible closed sets. If X0 ∈ Ri then with probability
one, Xn ∈ Ri for all n ≥ 1 so {ω : X0 (ω ) ∈ Ri } ∈ I . The last observation
shows that if the Markov chain is not irreducible then the sequence is not
ergodic. To prove the converse, observe that if A ∈ I , 1A ◦ θn = 1A where
θn (ω0 , ω1 , . . .) = (ωn , ωn+1 , . . .). So if we let Fn = σ (X0 , . . . , Xn ), the shift
invariance of 1A and the Markov property imply
Eπ (1A Fn ) = Eπ (1A ◦ θn Fn ) = h(Xn )
where h(x) = Ex 1A . L´vy’s 01 law implies that the lefthand side converges
e
to 1A as n → ∞. If Xn is irreducible and recurrent then for any y ∈ S , the
righthand side = h(y ) i.o., so either h(x) ≡ 0 or h(x) ≡ 1, and Pπ (A) ∈ {0, 1}.
This example also shows that I and T may be diﬀerent. When the transition
probability p is irreducible I is trivial, but if all the states have period d > 1,
T is not. In (5.8) of Chapter 5, we showed that if S0 , . . . , Sd−1 is the cyclic
decomposition of S , then T = σ ({X0 ∈ Sr } : 0 ≤ r < d).
Exercise 1.3. Give an example of an ergodic measure preserving transformation T on (Ω, F , P ) so that T 2 is not ergodic.
Rotation of the circle (Example 1.3) is not ergodic if θ = m/n where
m < n are positive integers. If B is a Borel subset of [0, 1/n) and
−1
A = ∪n=0 (B + k/n)
k 335 336 Chapter 6 Ergodic Theorems
then A is invariant. Conversely, if θ is irrational, then ϕ is ergodic. To prove
this, we need a fact from Fourier analysis. If f is a measurable function on [0, 1)
with f 2 (x) dx < ∞, then f can be written as f (x) = k ck e2πikx where the
equality is in the sense that as K → ∞
K ck e2πikx → f (x) in L2 [0, 1)
k=−K and this is possible for only one choice of the coeﬃcients ck = f (x)e−2πikx dx.
Now
f (ϕ(x)) =
ck e2πik(x+θ) =
(ck e2πikθ )e2πikx
k k The uniqueness of the coeﬃcients ck implies that f (ϕ(x)) = f (x) if and only
if ck (e2πikθ − 1) = 0. If θ is irrational, this implies ck = 0 for k = 0, so f is
constant. Applying the last result to f = 1A with A ∈ I shows that A = ∅ or
[0, 1) a.s.
Exercise 1.4. A direct proof of ergodicity. (i) Show that if θ is irrational,
xn = nθ mod 1 is dense in [0,1). Hint: All the xn are distinct, so for any
N < ∞, xn − xm  ≤ 1/N for some m < n ≤ N . (ii) Use Exercise 3.1 in the
Appendix to show that if A is a Borel set with A > 0, then for any δ > 0 there
is an interval J = [a, b) so that A ∩ J  > (1 − δ )J . (iii) Combine this with (i)
to conclude P (A) = 1.
Finally the Bernoulli shift (Example 1.4) is ergodic. To prove this, we
recall that the stationary sequence Yn (ω ) = ϕn (ω ) can be represented as
∞ 2−(m+1) Xn+m Yn =
m=0 where X0 , X1 , . . . are i.i.d. with P (Xk = 1) = P (Xk = 0) = 1/2, and use the
following fact:
(1.3) Theorem. Let g : R{0,1,...} → R be measurable. If X0 , X1 , . . . is an
ergodic stationary sequence, then Yk = g (Xk , Xk+1 , . . .) is ergodic.
Proof Suppose X0 , X1 , . . . is deﬁned on sequence space with Xn (ω ) = ωn .
If B has {ω : (Y0 , Y1 , . . .) ∈ B } = {ω : (Y1 , Y2 , . . .) ∈ B } then A = {ω :
(Y0 , Y1 , . . .) ∈ B } is shift invariant.
Remark. The proofs of (1.1) and (1.3) generalize easily to functions g : RZ →
R of a twosided stationary sequence Xn , n ∈ Z. An example of the use of Section 6.2 Birkhoﬀ’s Ergodic Theorem this generalization is the following: Let ξn , n ∈ Z be i.i.d. with Eξn = 0 and
2
Eξn < ∞. Let cn , n ∈ Z, be constants with
c2 < ∞. (8.3) in Chapter 1
n
implies that Xn = m cn−m ξm , n ≥ 0, converges a.s., generalizations of (1.1)
and (1.3) imply that it is stationary and ergodic.
Exercises
1.5. Use Fourier analysis as in Example 1.3 to prove that Example 1.4 is ergodic.
1.6. Continued fractions. Let ϕ(x) = 1/x − [1/x] for x ∈ (0, 1) and A(x) =
[1/x], where [1/x] = the largest integer ≤ 1/x. an = A(ϕn x), n = 0, 1, 2, . . .
gives the continued fraction representation of x, i.e.,
x = 1/(a0 + 1/(a1 + 1/(a2 + 1/ . . .)))
Show that ϕ preserves µ(A) = 1
log 2 dx
A 1+x for A ⊂ (0, 1). Remark. In his (1959) monograph, Kac claimed that it was “entirely trivial”
to check that ϕ is ergodic but retracted his claim in a later footnote. We leave
it to the reader to construct a proof or look up the answer in RyllNardzewski
(1951). Chapter 9 of L´vy (1937) is devoted to this topic and is still interesting
e
reading today.
1.7. Independent blocks. Let X1 , X2 , . . . be a stationary sequence. Let
n < ∞ and let Y1 , Y2 , . . . be a sequence so that (Ynk+1 , . . . , Yn(k+1) ), k ≥ 0 are
i.i.d. and (Y1 , . . . , Yn ) = (X1 , . . . , Xn ). Finally, let ν be uniformly distributed
on {1, 2, . . . , n} and let Zm = Yν +m for m ≥ 1. Show that Z is stationary and
ergodic. 6.2. Birkhoﬀ ’s Ergodic Theorem
Throughout this section, ϕ is a measurepreserving transformation on (Ω, F , P ).
We begin by proving a result that is usually referred to as:
(2.1) The ergodic theorem. For any X ∈ L1 ,
n−1 1
X (ϕm ω ) → E (X I )
n m=0 a.s. and in L1 This result due to Birkhoﬀ (1931) is sometimes called the pointwise or individual
ergodic theorem because of the a.s. convergence in the conclusion. When the
sequence is ergodic, the limit is the mean EX . In this case, if we take X = 1A ,
it follows that the asymptotic fraction of time ϕm ∈ A is P (A). 337 338 Chapter 6 Ergodic Theorems
The proof we give is based on an odd integration inequality due to Yosida
and Kakutani (1939).
(2.2) Maximal ergodic lemma. Let Xj (ω ) = X (ϕj ω ), Sk (ω ) = X0 (ω )+ . . . +
Xk−1 (ω ), and Mk (ω ) = max(0, S1 (ω ), . . . , Sk (ω )). Then E (X ; Mk > 0) ≥ 0.
Proof We follow Garsia (1965). The proof is not intuitive, but none of the
steps are diﬃcult. If j ≤ k then Mk (ϕω ) ≥ Sj (ϕω ), so adding X (ω ) gives
X (ω ) + Mk (ϕω ) ≥ X (ω ) + Sj (ϕω ) = Sj +1 (ω )
and rearranging we have
X (ω ) ≥ Sj +1 (ω ) − Mk (ϕω ) for j = 1, . . . , k
Trivially, X (ω ) ≥ S1 (ω ) − Mk (ϕω ), since S1 (ω ) = X (ω ) and Mk (ϕω ) ≥ 0.
Therefore
E (X (ω ); Mk > 0) ≥ max(S1 (ω ), . . . , Sk (ω )) − Mk (ϕω ) dP
{Mk >0} Mk (ω ) − Mk (ϕω ) dP =
{Mk >0} Now Mk (ω ) = 0 and Mk (ϕω ) ≥ 0 on {Mk > 0}c , so the last expression is
≥ Mk (ω ) − Mk (ϕω ) dP = 0 since ϕ is measure preserving.
Proof of (2.1) E (X I ) is invariant under ϕ (see Exercise 1.1), so letting
X = X − E (X I ) we can assume without loss of generality that E (X I ) = 0.
¯
¯
Let X = lim sup Sn /n, let > 0, and let D = {ω : X (ω ) > }. Our goal is to
¯
¯
prove that P (D) = 0. X (ϕω ) = X (ω ), so D ∈ I . Let
X ∗ (ω ) = (X (ω ) − )1D (ω ) ∗
Sn (ω ) = X ∗ (ω ) + . . . + X ∗ (ϕn−1 ω ) ∗
∗
∗
Mn (ω ) = max(0, S1 (ω ), . . . , Sn (ω )) F = ∪n Fn = ∗
Fn = {Mn > 0} ∗
sup Sk /k > 0
k≥1 Since X ∗ (ω ) = (X (ω ) − )1D (ω ) and D = {lim sup Sk /k > }, it follows that
F= sup Sk /k >
k≥1 ∩D =D Section 6.2 Birkhoﬀ’s Ergodic Theorem (2.2) implies that E (X ∗ ; Fn ) ≥ 0. Since E X ∗  ≤ E X  + < ∞, the dominated convergence theorem implies E (X ∗ ; Fn ) → E (X ∗ ; F ), and it follows that
E (X ∗ ; F ) ≥ 0. The last conclusion looks innocent, but F = D ∈ I , so it implies
0 ≤ E (X ∗ ; D) = E (X − ; D) = E (E (X I ); D) − P (D) = − P (D)
since E (X I ) = 0. The last inequality implies that
0 = P (D) = P (lim sup Sn /n > )
and since > 0 is arbitrary, it follows that lim sup Sn /n ≤ 0. Applying the last
result to −X shows that Sn /n → 0 a.s.
To prove that convergence occurs in L1 , let
XM (ω ) = X (ω )1(X (ω)≤M ) and XM (ω ) = X (ω ) − XM (ω ) The part of the ergodic theorem we have proved implies
n−1 1
X (ϕm ω ) → E (XM I )
n m=0 M a.s. Since XM is bounded, the bounded convergence theorem implies
n−1 E 1
X (ϕm ω ) − E (XM I ) → 0
n m=0 M To handle XM , we observe
n−1 E n−1 1
1
XM ( ϕ m ω ) ≤
E X M ( ϕ m ω )  = E X M 
n m=0
n m=0 and E E (XM I ) ≤ EE (XM I ) = E XM . So
n−1 1
E
X (ϕm ω ) − E (XM I ) ≤ 2E XM 
n m=0 M
and it follows that
n−1 lim sup E
n→∞ 1
X (ϕm ω ) − E (X I ) ≤ 2E XM 
n m=0 339 340 Chapter 6 Ergodic Theorems
As M → ∞, E XM  → 0 by the dominated convergence theorem, so we have
completed the proof of (2.1).
Exercise 2.1. Show that if X ∈ Lp with p > 1 then the convergence in (2.1)
occurs in Lp .
Exercise 2.2. (i) Show that if gn (ω ) → g (ω ) a.s. and E (supk gk (ω )) < ∞,
then
n−1
1
lim
gm (ϕm ω ) = E (g I ) a.s.
n→∞ n
m=0
(ii) Show that if we suppose only that gn → g in L1 , we get L1 convergence.
Before turning to examples, we would like to prove a useful result that is
a simple consequence of (2.2):
(2.3) Wiener’s maximal inequality. Let Xj (ω ) = X (ϕj ω ), Sk (ω ) = X0 (ω )+
· · · + Xk−1 (ω ), Ak (ω ) = Sk (ω )/k , and Dk = max(A1 , . . . , Ak ). If α > 0 then
P (Dk > α) ≤ α−1 E X 
Proof Let B = {Dk > α}. Applying (2.2) to X = X − α, with Xj (ω ) =
X (ϕj ω ), Sk = X0 (ω ) + · · · + Xk−1 , and Mk = max(0, S1 , . . . , Sk ) we conclude
that E (X ; Mk > 0) ≥ 0. Since {Mk > 0} = {Dk > α} ≡ B , it follows that
E X  ≥ X dP ≥
B αdP = αP (B )
B Exercise 2.3. Use (2.3) and the truncation argument at the end of the proof
of (2.1) to conclude that if (2.1) holds for bounded r.v.’s, then it holds whenever
E X  < ∞.
Our next step is to see what (2.1) says about our examples.
Example 2.1. i.i.d. sequences. Since I is trivial, the ergodic theorem implies
that
n−1
1
Xm → EX0 a.s. and in L1
n m=0
The a.s. convergence is the strong law of large numbers. Section 6.2 Birkhoﬀ’s Ergodic Theorem Remark. We can prove the L1 convergence in the law of large numbers without
invoking the ergodic theorem. To do this, note that
n 1
X + → EX +
n m=1 m n a.s. E 1
X+
n m=1 m = EX + n 1
+
and use (5.2) in Chapter 4 to conclude that n m=1 Xm → EX + in L1 . A
similar result for the negative part and the triangle inequality now give the
desired result. Example 2.2. Markov chains. Let Xn be an irreducible Markov chain on a
countable state space that has a stationary distribution π . Let f be a function
with
f ( x ) π ( x ) < ∞
x In Section 1, we showed that I is trivial, so applying the ergodic theorem to
f (X0 (ω )) gives
n−1 1
f ( Xm ) →
n m=0 f ( x) π ( x) a.s. and in L1 x For another proof, see Exercise 5.5 in Chapter 5.
Example 2.3. Rotation of the circle. Ω = [0, 1) ϕ(ω ) = (ω + θ) mod 1.
Suppose that θ ∈ (0, 1) is irrational, so Example 1.3 implies that I is trivial. If
we set X (ω ) = 1A (ω ), with A a Borel subset of [0,1), then the ergodic theorem
implies
n−1
1
1(ϕm ω∈A) → A a.s.
n m=0
where A denotes the Lebesgue measure of A. The last result for ω = 0 is usually called Weyl’s equidistribution theorem, although Bohl and Sierpinski
should also get credit. For the history and a nonprobabilistic proof, see Hardy
and Wright (1959), p. 390–393.
To recover the number theoretic result, we will now show that:
(2.4) Theorem. If A = [a, b) then the exceptional set is ∅.
Proof Let Ak = [a + 1/k, b − 1/k ). If b − a > 2/k , the ergodic theorem implies
n−1 1
2
1A (ϕm ω ) → b − a −
n m=0 k
k 341 342 Chapter 6 Ergodic Theorems
for ω ∈ Ωk with P (Ωk ) = 1. Let G = ∩Ωk , where the intersection is over
integers k with b − a > 2/k . P (G) = 1 so G is dense in [0,1). If x ∈ [0, 1) and
ωk ∈ G with ωk − x < 1/k , then ϕm ωk ∈ Ak implies ϕm x ∈ A, so
n−1 lim inf
n→∞ 2
1
1A (ϕm x) ≥ b − a −
n m=0
k for all large enough k . Noting that k is arbitrary and applying similar reasoning
to Ac shows
n−1
1
1A (ϕm x) → b − a
n m=0
Example 2.4. Benford’s law. As Gelfand ﬁrst observed, the equidistribution
theorem says something interesting about 2m . Let θ = log10 2, 1 ≤ k ≤ 9, and
Ak = [log10 k, log10 (k + 1)) where log10 y is the logarithm of y to the base 10.
Taking x = 0 in the last result, we have
n−1 1
1A (ϕm 0) → log10
n m=0 k+1
k A little thought reveals that the ﬁrst digit of 2m is k if and only if mθ mod
1 ∈ Ak . Taking k = 1, for example, we have shown that the asymptotic fraction
of time 1 is the ﬁrst digit of 2m is log10 2 = .3010.
The limit distribution on {1, . . . , 9} is called Benford’s (1938) law, although
it was discovered by Newcomb (1881). As Raimi (1976) explains, in many tables
the observed frequency with which k appears as a ﬁrst digit is approximately
log10 ((k + 1)/k ). He mentions powers of two as an example. Two other data
sets that ﬁt this very well are (i) the street addresses of the ﬁrst 342 person
in American Men of Science, 1938, and (ii) the kilowatt hours of 1243 electric
bills for October 1969 from Honiara in the British Solomon Islands. We leave
it to the reader to ﬁgure out why Benford’s law should appear in the last two
situations.
Example 2.5. Bernoulli shift. Ω = [0, 1), ϕ(ω ) = (2ω ) mod 1. Let
i1 , . . . , ik ∈ {0, 1}, let r = i1 2−1 + · · · + ik 2−k , and let X (ω ) = 1 if r ≤ ω <
r + 2−k . In words, X (ω ) = 1 if the ﬁrst k digits of the binary expansion of ω
are i1 , . . . , ik . The ergodic theorem implies that
n−1 1
X (ϕm ω ) → 2−k
n m=0 a.s. Section 6.3 Recurrence
i.e., in almost every ω ∈ [0, 1) the pattern i1 , . . . , ik occurs with its expected
frequency. Since there are only a countable number of patterns of ﬁnite length,
it follows that almost every ω ∈ [0, 1) is normal, i.e., all patterns occur with
their expected frequency. This is the binary version of Borel’s (1909) normal
number theorem. 6.3. Recurrence
In this section, we will study the recurrence properties of stationary sequences.
Our ﬁrst result is an application of the ergodic theorem. Let X1 , X2 , . . . be a
stationary sequence taking values in Rd , let Sk = X1 + · · · + Xk , let A = {Sk = 0
for all k ≥ 1}, and let Rn = {S1 , . . . , Sn } be the number of points visited at
time n. Kesten, Spitzer, and Whitman, see Spitzer (1964), p. 40, proved the
next result when the Xi are i.i.d. In that case, I is trivial, so the limit is P (A).
(3.1) Theorem. As n → ∞, Rn /n → E (1A I ) a.s.
Proof Suppose X1 , X2 , . . . are constructed on (Rd ){0,1,...} with Xn (ω ) = ωn ,
and let ϕ be the shift operator. It is clear that
n 1A (ϕm ω ) Rn ≥
m=1 since the righthand side = {m : 1 ≤ m ≤ n, S = Sm for all
the ergodic theorem now gives
lim inf Rn /n ≥ E (1A I )
n→∞ > m}. Using a.s. To prove the opposite inequality, let Ak = {S1 = 0, S2 = 0, . . . , Sk = 0}. It is
clear that
n−k 1Ak (ϕm ω ) Rn ≤ k +
m=1 since the sum on the righthand side = {m : 1 ≤ m ≤ n − k, S = Sm for
m < ≤ m + k }. Using the ergodic theorem now gives
lim sup Rn /n ≤ E (1Ak I )
n→∞ As k ↑ ∞, Ak ↓ A, so the monotone convergence theorem for conditional
expectations, (1.1c) in Chapter 4, implies
E (1Ak I ) ↓ E (1A I ) as k ↑ ∞ 343 344 Chapter 6 Ergodic Theorems
and the proof is complete.
Exercise 3.1. Let gn = P (S1 = 0, . . . , Sn = 0) for n ≥ 1 and g0 = 1. Show
n
that ERn = m=1 gm−1 .
From (3.1), we get a result about the recurrence of random walks with
stationary increments that is (for integer valued random walks) a generalization
of the ChungFuchs theorem, (2.7) in Chapter 3.
(3.2) Theorem. Let X1 , X2 , . . . be a stationary sequence taking values in Z
with E Xi  < ∞. Let Sn = X1 + · · · + Xn , and let A = {S1 = 0, S2 = 0, . . .}.
(i) If E (X1 I ) = 0 then P (A) = 0. (ii) If P (A) = 0 then P (Sn = 0 i.o.) = 1.
Remark. In words, mean zero implies recurrence. The condition E (X1 I ) = 0
is needed to rule out trivial examples that have mean 0 but are a combination
of a sequence with positive and negative means, e.g., P (Xn = 1 for all n) =
P (Xn = −1 for all n) = 1/2.
Proof of (i) If E (X1 I ) = 0 then the ergodic theorem implies Sn /n → 0 a.s.
Now
max Sk /n = lim sup lim sup 1≤k≤n n→∞ n→∞ max Sk /n ≤ K ≤k≤n max Sk /k
k≥K for any K and the righthand side ↓ 0 as K ↑ ∞. The last conclusion leads
easily to
lim n→∞ max Sk  1≤k≤n n=0 Since
Rn ≤ 1 + 2 max Sk 
1≤k≤n it follows that Rn /n → 0 and (3.1) implies P (A) = 0.
Proof of (ii) Let Fj = {Si = 0 for i < j, Sj = 0} and Gj,k = {Sj +i − Sj = 0
for i < k , Sj +k − Sj = 0}. P (A) = 0 implies that
P (Fk ) = 1. Stationarity
implies P (Gj,k ) = P (Fk ), and for ﬁxed j the Gj,k are disjoint, so ∪k Gj,k = Ω
a.s. It follows that
P (Fj ∩ Gj,k ) = P (Fj )
k and P (Fj ∩ Gj,k ) = 1
j,k On Fj ∩ Gj,k , Sj = 0 and Sj +k = 0, so we have shown P (Sn = 0 at least two
times ) = 1. Repeating the last argument shows P (Sn = 0 at least k times) = 1
for all k , and the proof is complete. Section 6.3 Recurrence
Exercise 3.2. Imitate the proof of (i) in (3.2) to show that if we assume
P (Xi > 1) = 0, EXi > 0, and the sequence Xi is ergodic in addition to the
hypotheses of (3.2), then P (A) = EXi .
Remark. You have proved the last result twice for asymmetric simple random
walk (Exercise 1.13 in Chapter 3, Exercise 7.3 in Chapter 4). For more general
random walks, the conclusion is new. It is interesting to note that we can use
martingale theory to prove a result for random walks that do not skip over
integers on the way down.
Exercise 3.3. Suppose X1 , X2 , . . . are i.i.d. integervalued with P (Xi < −1) =
0 and EXi > 0. If P (Xi = −1) > 0 there is a unique θ < 0 so that E exp(θXi ) =
1. Let Sn = X1 + · · · + Xn and N = inf {n : Sn < 0}. Show that exp(θSn ) is
a martingale and use the optional stopping theorem to conclude that P (N <
∞) = eθ .
Extending the reasoning in the proof of part (ii) of (3.2) gives a result of
Kac (1947b). Let X0 , X1 , . . . be a stationary sequence taking values in (S, S ).
Let A ∈ S , let T0 = 0, and for n ≥ 1, let Tn = inf {m > Tn−1 : Xm ∈ A} be the
time of the nth return to A.
(3.3) Theorem. If P (Xn ∈ A at least once) = 1, then under P (·X0 ∈ A),
tn = Tn − Tn−1 is a stationary sequence with E (T1 X0 ∈ A) = 1/P (X0 ∈ A).
Remark. If Xn is an irreducible Markov chain on a countable state space
S starting from its stationary distribution π , and A = {x}, then (3.3) says
Ex Tx = 1/π (x), which is (4.6) in Chapter 5. (3.3) extends that result to an
arbitrary A ⊂ S and drops the assumption that Xn is a Markov chain.
Proof We ﬁrst show that under P (·X0 ∈ A), t1 , t2 , . . . is stationary. To cut
down on . . .’s, we will only show that
P (t1 = m, t2 = nX0 ∈ A) = P (t2 = m, t3 = nX0 ∈ A)
It will be clear that the same proof works for any ﬁnitedimensional distribution.
Our ﬁrst step is to extend {Xn , n ≥ 0} to a twosided stationary sequence
{Xn , n ∈ Z} using (1.2). Let Ck = {X−1 ∈ A, . . . , X−k+1 ∈ A, X−k ∈ A}.
/
/
∪K Ck
k=1 c = {Xk ∈ A for − K ≤ k ≤ −1}
/ The last event has the same probability as {Xk ∈ A for 1 ≤ k ≤ K }, so letting
/
K → ∞, we get P (∪∞ Ck ) = 1. To prove the desired stationarity, we let
k=1 345 346 Chapter 6 Ergodic Theorems
Ij,k = {i ∈ [j, k ] : Xi ∈ A} and observe that
∞ P (t2 = m, t3 = n, X0 ∈ A) = P (X0 ∈ A, t1 = , t2 = m, t3 = n)
=1
∞ = P (I0, +m+n P (I− ,m+n = {0, , + m, + m + n}) =1
∞ = = {− , 0, m, m + n}) =1
∞ = P (C , X0 ∈ A, t1 = m, t2 = n)
=1 To complete the proof, we compute
∞ ∞ P (t1 ≥ k X0 ∈ A) = P (X0 ∈ A)−1 E (t1 X0 ∈ A) =
k=1 P (t1 ≥ k, X0 ∈ A)
k=1 ∞ = P (X0 ∈ A)−1 P (Ck ) = 1/P (X0 ∈ A)
k=1 since the Ck are disjoint and their union has probability 1.
In the next two exercises, we continue to use the notation of (3.3).
Exercise 3.4. Show that if P (Xn ∈ A at least once) = 1 and A ∩ B = ∅ then
1(Xm ∈B ) X0 ∈ A E
1≤m≤T1 = P ( X0 ∈ B )
P (X0 ∈ A) When A = {x} and Xn is a Markov chain, this is the “cycle trick” for deﬁning
a stationary measure. See (4.3) in Chapter 5.
¯
Exercise 3.5. Consider the special case in which Xn ∈ {0, 1}, and let P =
P (·X0 = 1). Here A = {1} and so T1 = inf {m > 0 : Xm = 1}. Show
¯
¯
P (T1 = n) = P (T1 ≥ n)/ET1 . When t1 , t2 , . . . are i.i.d., this reduces to the
formula for the ﬁrst waiting time in a stationary renewal process.
In checking the hypotheses of Kac’s theorem, a result Poincar´ proved in
e
1899 is useful. First, we need a deﬁnition. Let TA = inf {n ≥ 1 : ϕn (ω ) ∈ A}. 6.4 Mixing
(3.4) Theorem. Suppose ϕ : Ω → Ω preserves P , that is, P ◦ ϕ−1 = P . (i)
TA < ∞ a.s. on A, that is, P (ω ∈ A, TA = ∞) = 0. (ii) {ϕn (ω ) ∈ A i.o.} ⊃ A.
(iii) If ϕ is ergodic and P (A) > 0, then P (ϕn (ω ) ∈ A i.o.) = 1.
Remark. Note that in (i) and (ii) we assume only that ϕ is measurepreserving.
Extrapolating from Markov chain theory, the conclusions can be “explained” by
noting that: (i) the existence of a stationary distribution implies the sequence
is recurrent, and (ii) since we start in A we do not have to assume irreducibility.
Conclusion (iii) is, of course, a consequence of the ergodic theorem, but as the
selfcontained proof below indicates, it is a much simpler fact.
Proof Let B = {ω ∈ A, TA = ∞}. A little thought shows that if ω ∈ ϕ−m B
then ϕm (ω ) ∈ A, but ϕn (ω ) ∈ A for n > m, so the ϕ−m B are pairwise disjoint.
/
The fact that ϕ is measurepreserving implies P (ϕ−m B ) = P (B ), so we must
have P (B ) = 0 (or P would have inﬁnite mass). To prove (ii), note that for
any k , ϕk is measurepreserving, so (i) implies
0 = P (ω ∈ A, ϕnk (ω ) ∈ A for all n ≥ 1)
/
/
≥ P (ω ∈ A, ϕm (ω ) ∈ A for all m ≥ k )
Since the last probability is 0 for all k , (ii) follows. Finally, for (iii), note that
B ≡ {ω : ϕn (ω ) ∈ A i.o.} is invariant and ⊃ A by (b), so P (B ) > 0, and it
follows from ergodicity that P (B ) = 1. *6.4. Mixing
A measurepreserving transformation ϕ on (Ω, F , P ) is called mixing if for all
measurable sets A and B
(4.1) lim P (A ∩ ϕ−n B ) = P (A)P (B ) n→∞ A sequence Xn , n ≥ 0, is said to be mixing if the corresponding shift on
sequence space is. To see that mixing implies ergodicity, observe that if A is
invariant, then (4.1) implies P (A) = P (A)2 , i.e., P (A) ∈ {0, 1}. In the other
direction, ergodicity implies
n−1 1
1B (ϕm ω ) → P (B )
n m=0 a.s. so integrating over A and using the bounded convergence theorem gives
n−1 (4.2) 1
P (A ∩ ϕ−m B ) → P (A)P (B )
n m=0 347 348 Chapter 6 Ergodic Theorems
i.e., (4.1) holds in a Cesaro sense. To see that (4.2) is equivalent to ergodicity,
we note that the ergodic theorem implies
n−1 1
1B (ϕm ω ) → E (1B I )
n m=0 a.s. Integrating over A using the bounded convergence theorem and using (4.2), we
have
E (1B I ) dP = P (A)P (B )
A Since this holds for all A, E (1B I ) = P (B ). Since that holds for all B , I must
be trivial.
As usual when we meet a new concept, we turn to our examples to see
what it means. To handle Examples 1.1 and 1.2, we use a general result. For
this result, we suppose that ϕ is the shift operator on sequence space, and
Xn (ω ) = ωn . Let Fn = σ (Xn , Xn+1 , . . .) and T = ∩n Fn .
(4.3) Theorem. If T is trivial then ϕ is mixing and
lim sup P (A ∩ ϕ−n B ) − P (A)P (B ) = 0 (∗) n→∞ B Conversely, if (∗) holds, T is trivial.
Proof Let C = ϕ−n B ∈ Fn .
P (A ∩ C ) − P (A)P (C ) = (1A − P (A)) dP
C = P (AFn ) − P (A) dP
C ≤ P (AFn ) − P (A) dP → 0 since P (AFn ) → P (A) in L1 by (6.3) in Chapter 4.
Conversely, if A ∈ T has P (A) ∈ (0, 1) and we write A = ϕ−n Bn (which is
possible since A ∈ Fn ), then P (Bn ) = P (A), so
P (A ∩ ϕ−n Bn ) − P (A)P (Bn ) = P (A) − P (A)2
and (∗) is false. 6.4 Mixing
Combining (4.3), the deﬁnition of mixing, and (4.2) we have
T is trivial iﬀ
ϕ is mixing iﬀ
ϕ is ergodic iﬀ supB P (A ∩ ϕ−n B ) − P (A)P (B ) → 0
P (A ∩ ϕ−n B ) − P (A)P (B ) → 0
1
n n−1
m=0 P (A ∩ ϕ−m B ) → P (A)P (B ) for all A
for all A, B
for all A, B Example 4.1. i.i.d. sequences are mixing since T is trivial.
Example 4.2. Markov chains. Suppose the state space S is countable, p is
irreducible, and has stationary distribution π . If p is aperiodic, T is trivial by
(5.8) in Chapter 5, and the sequence is mixing. Conversely, if p has period d
and cyclic decomposition S0 , S1 , . . . , Sd−1
lim P (X0 ∈ S0 , Xnd+1 ∈ S0 ) = 0 = π (S0 )2 n→∞ so the sequence is not mixing.
Example 4.3. Rotation of the circle. Let Ω = [0, 1) and ϕ(ω ) = (ω + θ)
mod 1, where θ is irrational. To prove that this transformation is not mixing,
we begin by observing that nθ mod 1 is dense. (Use Exercise 1.3 or recall that
(2.4) shows for intervals the exceptional set in the ergodic theorem is ∅.) Since
nθ mod 1 is dense, there is a sequence nk → ∞ with nk θ mod 1 → 1/2. Let
A = B = [0, 1/3). If k is large then A ∩ ϕ−nk B = ∅, so P (A ∩ ϕ−nk B ) does not
converge to 1/9, and the transformation is not mixing.
To treat our last example, we will use:
(4.4) Lemma. Let A be a π system. If
() lim P (A ∩ ϕ−n B ) = P (A)P (B ) n→∞ holds for A, B ∈ A, then ( ) holds for A, B ∈ σ (A).
Proof As the reader can probably guess, we are going to use the π −λ theorem,
(4.2) in Chapter 1. Fix A ∈ A and let L be the collection of B for which ( )
holds. By assumption, L ⊃ A. To begin to check the assumptions of the
π − λ theorem, we note that Ω ∈ L. Let B1 ⊃ B2 be in L, ϕ−n (B1 − B2 ) =
ϕ−n B1 − ϕ−n B2 , so
P (A ∩ ϕ−n (B1 − B2 )) = P (A ∩ ϕ−n B1 ) − P (A ∩ ϕ−n B2 )
and
lim P (A ∩ ϕ−n (B1 − B2 )) = P (A)P (B1 ) − P (A)P (B2 ) = P (A)P (B1 − B2 ) n→∞ 349 350 Chapter 6 Ergodic Theorems
To ﬁnish the proof that L is a λsystem, let Bk ∈ L with Bk ↑ B.
P (A ∩ ϕ−n B ) − P (A ∩ ϕ−n Bk ) ≤ P (ϕ−n B ) − P (ϕ−n Bk ) = P (B ) − P (Bk )
since ϕn preserves P .
At this point, we have checked that L is a λsystem, so the π − λ theorem
implies that L ⊃ σ (L). In other words, if A ∈ A then ( ) holds for B ∈ σ (A).
Fixing B ∈ σ (A), the reader can now repeat the last argument to show that
( ) holds for A, B ∈ σ (A). Indeed, the proof is simpler since you do not have
to worry about what ϕ−n does.
Example 4.4. Bernoulli shift. Ω = [0, 1), ϕ(ω ) = 2ω mod 1. Let Fn =
σ ([k 2−n , (k + 1)2−n ); 0 ≤ k < 2n ) and A = ∪Fn . If A ∈ Fn , B ∈ A, and then A
and ϕ−m B are independent for m ≥ n, i.e., P (A ∩ ϕ−m B ) = P (A)P (ϕ−m B ) =
P (A)P (B ) since ϕ is measurepreserving. So (4.3) implies ϕ is mixing.
Up to this point, all our examples either have a trivial tail ﬁeld (Examples
4.1, 4.2, and 4.4) or are not mixing (Example 4.3). To complete the picture, we
close the section with some examples that are mixing but have nontrivial tail
ﬁelds.
Example 4.5. Let Xn , n ∈ Z, be i.i.d. with P (Xn = 1) = P (Xn = 0) = 1/2
and let
∞
2−(m+1) Xn−m (ω ) Zn (ω ) =
m=0 Zn is mixing, but T = σ (Xn : n ∈ Z).
Proof Since all the Xk , k ≤ n, can be recovered by expanding Zn in its binary
decimal representation, the claim about T is clear. To prove that Zn is mixing,
we use (4.4) with A = sets of the form {ω : Zi (ω ) ∈ Gi , 0 ≤ i ≤ k }, where the
Gi are open intervals. Now
n−k−1 2−(m+1) Xn−m + 2−(n−k) Zk Zn =
m=0 so if Fk = σ (Xj : j ≤ k ) and G0 is an open interval,
P (Zn ∈ G0 Fk ) = µn−k−1 (G0 − 2−(n−k) Zk )
where µM is the distribution of
x ∈ G0 }. 0≤m≤M 2−(m+1) X−m and G0 − c = {x − c : 6.4 Mixing
Thinking about the binary digits of a number chosen at random from [0, 1)
it is easy to see that as M → ∞, µM ⇒ µ∞ , the uniform distribution on (0, 1).
Since the uniform distribution has no atoms, so for a.e. ω
µn−k−1 (G0 − 2−(n−k) Zk ) → µ∞ (G0 )
Integrating over A ∈ Fk and using the bounded convergence theorem and deﬁnition of conditional expectation
P (Zn ∈ G0 , A) → µ∞ (G0 )P (A)
Letting B = {ω : Z0 (ω ) ∈ G0 } we have shown
P (A ∩ ϕ−n B ) → P (A)P (B ) for A ∈ Fk The last argument generalizes easily to B = {ω : Zi (ω ) ∈ Gi , 0 ≤ i ≤ k }. This
checks (4.4) and completes the proof.
Our last example, borrowed from Lasota and MacKay (1985), is named for
its famous cousin: geodesic ﬂow on a compact Riemannian manifold M with
negative curvature, e.g., the two hole torus with a suitable metric. For more on
these examples, see Anosov (1963), (1967).
Example 4.6. Anosov map. Ω = [0, 1)2 , ϕ(x, y ) = (x + y, x + 2y ). By
drawing a picture, it is easy to see that ϕ maps Ω 1 − 1 onto itself. Since the
Jacobian
11
J = det
=1
12
ϕ preserves Lebesgue measure. Since ϕ is invertible, the entire sequence can be
recovered from one term, and T is far from trivial. We will now show that ϕ is
mixing. To begin, we observe that an induction argument shows
ϕn (x, y ) = (a2n−2 x + a2n−1 y, a2n−1 x + a2n y )
where the an are the Fibonacci numbers given by a0 = a1 = 1 and an+1 =
an + an−1 for n ≥ 1. (To help check this, note that a2n−2 + 2a2n−1 + a2n =
a2n + a2n+1 and then subtract a2n from each side.) Let
f (x, y ) = exp(2πi(px + qy ))
Since [0,1] g (x, y ) = exp(2πi(rx + sy )) exp(2πikx)dx = 0 unless k = 0, it follows that
1 1 f (x, y )g (ϕn (x, y )) dx dy = 0
0 0 351 352 Chapter 6 Ergodic Theorems
unless
(a) ra2n−2 + sa2n−1 + p = 0 ra2n−1 + sa2n + q = 0 Now the diﬀerence equation bn+1 − bn − bn−1 = 0 has a twoparameter
family of solutions given by
bn = C1 √
1+ 5
2 √
1− 5
2 n + C2 n so the Fibonacci numbers are given by (b) √
1+ 5
2 1
an = √
5 n+1 1
−√
5 √
1− 5
2 n+1 for n ≥ 0 To see this, note that bn = λn is a solution of bn+1 − bn − bn−1 = 0 if and only
if λ2 − λ − 1 = 0, and any solution is determined by the values of b0 and b1 .
From (b), we see that
√
1+ 5
a2n
a2n−1
lim
= lim
=
n→∞ a2n−1
n→∞ a2n−2
2
√
Since (1 + 5)/2 is irrational, (a) cannot hold for inﬁnitely many n unless
p = q = r = s = 0. The last result implies that if
k f (x, y ) = aj exp(2πi(pj x + qj y ))
j =1 (c)
g (x, y ) = bj exp(2πi(rj x + sj y ))
j =1 then as n → ∞ 1 1 f (x, y )g (ϕn (x, y )) dx dy → 0
0 0 To ﬁnish the proof, we observe
(4.5) Lemma. Since the f and g for which (c) holds are dense in
1 1 L2 = {f :
0 1 1 f 2 (x, y ) dx dy < ∞} f (x, y ) dx dy = 0,
0 0 0 0 6.5 Entropy
it follows that ϕ is mixing.
Proof Let 1 1 h(x, y )k (x, y ) dx dy < h, k >=
0 0 and h 2 =< h, h >1/2 . Adding and subtracting < h, g ◦ ϕn > then using the
CauchySchwarz inequality gives
 < f, g ◦ ϕn > − < h, k ◦ ϕn >  ≤  < f, g ◦ ϕn > − < h, g ◦ ϕn > 
+  < h, g ◦ ϕn > − < h, k ◦ ϕn > 
≤ f −h
= f −h g ◦ ϕn 2 + h 2 (g − k ) ◦ ϕn
g 2 + h 2 (g − k ) 2 2
2 2 Suppose now that h = 1A − P (A), k = 1B − P (B ) ∈ L2 , pick f and g so that
0
(c) holds and h − f 2 , k − g 2 are < . The last inequality implies
lim sup P (A ∩ ϕ−n B ) − P (A)P (B ) = lim sup  < h, k ◦ ϕn > 
n→∞ n→∞ ≤ (k
and since 2 + )+ h 2 is arbitrary the proof is complete. *6.5. Entropy
Throughout this section, we will suppose that Xn , n ∈ Z, is an ergodic stationary sequence taking values in a ﬁnite set S . Let
p(x0 , . . . , xn−1 ) = P (X0 = x0 , . . . , Xn−1 = xn−1 )
and
p(xn xn−1 , . . . , x0 ) = P (Xn = xn Xn−1 = xn−1 , . . . , X0 = x0 )
whenever the conditioning event has positive probability. Deﬁne random variables
p(X0 , . . . , Xn ) and p(Xn Xn−1 , . . . , X0 )
by setting xj = Xj (ω ) in the corresponding deﬁnitions. Since
P (p(X0 , . . . , Xn ) = 0) = 0
the conditional probability makes sense a.s. 353 354 Chapter 6 Ergodic Theorems
(5.1) The ShannonMcMillanBreiman theorem asserts that
− 1
log p(X0 , . . . , Xn−1 ) → H
n a.s. where H = limn→∞ E {− log p(Xn Xn−1 , . . . , X0 )} is the entropy rate of Xn .
Remark. The three names indicate the evolution of the theorem: Shannon
(1948), McMillan (1953), Breiman (1957). Our proof follows Algoet and Cover
(1988).
Proof The ﬁrst step is to prove that the limit deﬁning H exists. To do this,
we observe that if Fn = σ (X−1 , . . . , X−n ) and
Yn = p(x0 X−1 , . . . , X−n ) = P (X0 = x0 Fn )
then Yn is a bounded martingale, so as n → ∞
Yn → Y∞ ≡ p(x0 X−1 , X−2 , . . .) a.s. and in L1 Since S is ﬁnite and x log x is bounded on [0, 1], it follows that
(a) Hk ≡ E (− log p(X0 X−1 , . . . , X−k ))
=E p(xX−1 , . . . , X−k ) log p(xX−1 , . . . , X−k ) −
x →E − p(xX−1 , X−2 , . . .) log p(xX−1 , X−2 , . . .) ≡H x as k → ∞.
Our second step is to ﬁnd something related to the quantity of interest
that converges to H . Elementary conditional probabilities give
n−1 p(X0 , . . . , Xn−1 X−1 , . . . , X−k ) = p(Xm Xm−1 , . . . , X−k )
m=0 and (5.7) in Chapter 4 implies
p(X0 , . . . , Xn−1 X−1 , X−2 , . . .) ≡ lim p(X0 , . . . , Xn−1 X−1 , . . . , X−k )
k→∞ So we have
n−1 p(X0 , . . . , Xn−1 X−1 , X−2 , . . .) = p(Xm Xm−1 , Xm−2 , . . .)
m=0 6.5 Entropy
Taking logs gives
− 1
log p(X0 , . . . , Xn−1 X−1 , X−2 , . . .)
n
n−1 (b) =− 1
log p(Xm Xm−1 , Xm−2 , . . .)
n m=0 → E (− log p(X0 X−1 , . . .)) = H
by the ergodic theorem applied to F (ω ) = − log p(ω0 ω−1 , . . .).
For the other side of our sandwich, we deﬁne the k step Markovian approximation
n−1 pk (X0 , . . . , Xn−1 ) = p(X0 , . . . , Xk−1 ) p(Xm Xm−1 , . . . , Xm−k )
m=k for k < n, and observe that another application of the ergodic theorem gives
(c) − 1
log pk (X0 , . . . , Xn−1 ) → E (− log p(X0 X−1 , . . . , X−k )) = Hk
n To put the sandwich together, we let
Ak = pk (X0 , . . . , Xn−1 )
n
Bn = p(X0 , . . . , Xn−1 )
Cn = p(X0 , . . . , Xn−1 X−1 , X−2 , . . .)
1
Wn = Ak /Bn
n 2
Wn = Bn /Cn and note that
1
1
1
− log Ak = − log Wn −
n
n
n
1
1
2
− log Bn = − log Wn −
n
n (d)
(e) 1
log Bn
n
1
log Cn
n To get from (b) and (c) to the desired result, we use:
(f) Lemma. If Wn ≥ 0 and EWn ≤ 1 then lim supn→∞ n−1 log Wn ≤ 0 a.s.
Proof If > 0 then
P (n−1 log Wn ≥ ) = P (Wn ≥ e n ) ≤ e− n 355 356 Chapter 6 Ergodic Theorems
by Chebyshev’s inequality.
exp− n < ∞, so the BorelCantelli lemma implies
−1
that lim sup n log Wn ≤ . is arbitrary, so the proof is complete.
i
To check that EWn ≤ 1, we observe that 1
EWn =
x pk (x0 , . . . , xn−1 )
p(x0 , ..., xn−1 ) = 1
p(x0 , . . . , xn−1 ) For the second, we observe that
E p(X0 , . . . , Xn−1 )
p(X−k , . . . , X−1 )p(X0 , . . . , Xn−1 )
=E
p(X0 , . . . , Xn−1 X−1 , . . . , X−k )
p(X−k , . . . , Xn−1 ) and x p(x−k , . . . , x−1 )p(x0 , . . . , xn−1 ) = 1.
2
Letting k → ∞ and using Fatou’s lemma gives EWn ≤ 1. Applying (f)
now and using (c), (d), (e), and (b) gives Hk ≥ lim sup n−1 log p(X0 , . . . , Xn−1 )
n→∞ ≥ lim inf n−1 log p(X0 , . . . , Xn−1 ) ≥ H
n→∞ (a) implies Hk → H as k → ∞, and the proof of (5.1) is complete.
Example 5.1. X0 , X1 , . . . are i.i.d. In this case,
n−1 p(X0 , . . . , Xn−1 ) = p(Xm )
m=0 so the strong law of large numbers implies
1
− log p(X0 , . . . , Xn−1 ) → −E log p(X0 ) = −
n p(x) log p(x)
x For a concrete example, suppose P (Xi = 0) = 2/3 and P (Xi = 1) = 1/3. In
this case,
31
2
H = log + log 3 ≈ .6365
3
23
Example 5.2. X0 , X1 , . . . is a Markov chain. In this case, H = H1 so
H = E {− log p(X0 X−1 )} = π (x)p(x, y ){− log p(x, y )}
x,y 6.6 A Subadditive Ergodic Theorem
where p is the transition probability and π is the stationary distribution. For a
concrete example, suppose
p=
In this case, H = 2
3 .5 .5
10 π = (2/3, 1/3) log 2 ≈ .4621. Comparing the values of H in the two examples we see that although they
have the same marginal distributions, the i.i.d. sequence is “more random” than
the Markov chain. To explain what the values of H tell us about the stationary
sequence, we state a result that is merely a reformulation of (5.1).
(5.2) Asymptotic equipartition property. Let a = S . Given > 0, there
is an N so that for n ≥ N , the an possible outcomes of (X0 , . . . , Xn−1 ) can
be divided into two classes: (i) a class Bn with probability < , (ii) a class Gn
in which each outcome (x0 , x1 , . . . , xn−1 ) has
 − H − n−1 log p(x0 , . . . , xn−1 ) <
Since the outcomes in Gn have probability exp(−(H ± )n), it follows that
the number of outcomes in Gn is exp((H ± )n). A trivial but informative
special case occurs when Xn is an i.i.d. sequence with P (Xn = x) = 1/a for
x ∈ S . In this case, H = log a, so we need almost all the outcomes to capture
1 − of the probability mass. Since all outcomes have equal probability, the
last conclusion should not be too surprising.
Breiman’s proof. If we let gn (ω ) = − log p(ω0 ω−1 , . . . , ω−n ), then we can
use Exercise 2.2 to prove (5.1). To check E (sup gn ) < ∞, let
Ak = gk > λ, sup gj ≤ λ
j<k and observe
p(ω−j , . . . , ω0 ) ≤ e−λ P (Aj ) =
ω ∈Aj p(ω−j , . . . , ω−1 ) ≤ ae−λ
ω ∈Aj where a = S , since for each ω0 the sum is ≤ 1.
To apply the result in Exercise 2.2, we note
n−1 n−1 1
1
gm (ϕm ω ) = −
log p(ωm ωm−1 , . . . , ω0 )
n m=0
n m=0
1
= − log p(ω0 , . . . , ωm−1 )
n 357 358 Chapter 6 Ergodic Theorems
and we have assumed Xn is ergodic so
E (g I ) = E (− log p(ω0 ω−1 , ω−2 , . . .)) *6.6. A Subadditive Ergodic Theorem
In this section we will prove Liggett’s (1985) version of Kingman’s (1968)
(6.1) Subadditive ergodic theorem. Suppose Xm,n , 0 ≤ m < n satisfy:
(i) X0,m + Xm,n ≥ X0,n
(ii) {Xnk,(n+1)k , n ≥ 1} is a stationary sequence for each k .
(iii) The distribution of {Xm,m+k , k ≥ 1} does not depend on m.
+
(iv) EX0,1 < ∞ and for each n, EX0,n ≥ γ0 n, where γ0 > −∞. Then
(a) limn→∞ EX0,n /n = inf m EX0,m /m ≡ γ
(b) X = limn→∞ X0,n /n exists a.s. and in L1 , so EX = γ .
(c) If all the stationary sequences in (ii) are ergodic then X = γ a.s.
Remark. Kingman assumed (iv), but instead of (i)–(iii) he assumed that
X ,m + Xm,n ≥ X ,n for all < m < n and that the distribution of {Xm+k,n+k ,
0 ≤ m < n} does not depend on k . In two of the four applications in Section
6.7, these stronger conditions do not hold.
Before giving the proof, which is somewhat lengthy, we will consider several
examples for motivation. Since the validity of (ii) and (iii) in each case is clear,
we will only check (i) and (iv). The ﬁrst example shows that (6.1) contains the
ergodic theorem (5.1) as a special case.
Example 6.1. Stationary sequences. Suppose ξ1 , ξ2 , . . . is a stationary
sequence with E ξk  < ∞, and let Xm,n = ξm+1 + · · · + ξn . Then X0,n =
X0,m + Xm,n , and (iv) holds.
Example 6.2. Range of random walk. Suppose ξ1 , ξ2 , . . . is a stationary
sequence and let Sn = ξ1 + · · · + ξn . Let Xm,n = {Sm+1 , . . . , Sn }. It is clear
that X0,m + Xm,n ≥ X0,n . 0 ≤ X0,n ≤ n, so (iv) holds. Applying (6.1) now
gives X0,n /n → X a.s. and in L1 , but it does not tell us what the limit is. 6.6 A Subadditive Ergodic Theorem
Exercise 6.1. Suppose ξ1 , ξ2 , . . . is ergodic in Example 6.2. Use (c) and (a) of
(6.1) to conclude that {S1 , . . . , Sn }/n → P ( no return to 0 ).
Example 6.3. Records from “improving populations.” Ballerini and
Resnick (1985, 1987). Let ξ1 , ξ2 , . . . be a stationary sequence and let ζn =
ξn + cn, where c > 0. Let
Xm,n = { : m < ≤ n and ζ > ζk for all m < k < }
It is clear that X0,m + Xm,n ≥ X0,n . 0 ≤ X0,n ≤ n, so (iv) holds. Applying
(6.1) now gives that X0,n /n → X a.s. and in L1 . To identify the limit, extend
ξn , n ≥ 0, to {ξn : n ∈ Z} using (1.2), and let ζn = ξn + cn, n ∈ Z. Let Ym = 1
if ζm > ζk for all k < m; Ym = 0 otherwise. An easy extension of (1.1) implies
Ym is a stationary sequence, so the ergodic theorem implies
(Y1 + · · · + Yn )/n → Y a.s. and in L1 To see that X = Y , observe that X ≥ Y but
n EX0,n /n = 1
P (ζ0 > ζk for 0 > k > −m) → EY
n m=1 as n → ∞, so EX = EY and hence X = Y a.s.
The analysis in the last paragraph could be applied to identify the limit
in Example 6.2. If we let Ym = 1 when Sk = Sm for all k > m, then we get a
stationary sequence that has the same limit as X0,n /n. (This was the key to
the proof in Section 6.3.) Kingman’s original proof shows that we can always
identify the limit by this method. That is, we can write
n Xm,n = Yk + Zm,n
k=m+1 where Ym , m ≥ 1, is a stationary sequence, and Zm,n ≥ 0 is a subadditive
process with EZ0,n /n → 0. We will prove (6.1) without proving Kingman’s
“decomposition theorem.”
Example 6.4. Longest common subsequences. Let X1 , X2 , X3 , . . . and
Y1 , Y2 , Y3 , . . . be ergodic stationary sequences. Let Lm,n = max{K : Xik = Yjk
for 1 ≤ k ≤ K , where m < i1 < i2 . . . < iK ≤ n and m < j1 < j2 . . . < jK ≤ n}.
It is clear that
L0,m + Lm,n ≥ L0,n 359 360 Chapter 6 Ergodic Theorems
so Xm,n = −Lm,n is subadditive. 0 ≤ L0,n ≤ n so (iv) holds. Applying (6.1)
now, we conclude that
L0,n /n → γ = sup E (L0,m /m)
m≥1 Exercise 6.2. Suppose that in the last exercise X1 , X2 , . . . and Y1 , Y2 , . . . are
i.i.d. and take the values 0 and 1 with probability 1/2 each. (a) Compute
EL1 and EL2 /2 to get lower bounds on γ . (b) Show γ < 1 by computing
the expected number of i and j sequences of length K = an with the desired
property.
Remark. Chvatal and Sankoﬀ (1975) have shown .727273 ≤ γ ≤ .866595
Our ﬁnal example shows that the convergence in (a) of (6.1) may occur
arbitrarily slowly.
Example 6.5. Suppose Xm,m+k = f (k ) ≥ 0, where f (k )/k is decreasing.
f ( n)
f ( n)
+ (n − m )
n
n
f ( n − m)
f ( m)
+ (n − m )
= X0,m + Xm,n
≤m
m
n−m X0,n = f (n) = m The examples above should provide enough motivation for now. In Section
6.7, we will give four more applications of (6.1).
Proof of (6.1) There are four steps. The ﬁrst, second, and fourth date back
to Kingman (1968). The half dozen proofs of subadditive ergodic theorems that
exist all do the crucial third step in a diﬀerent way. Here we use the approach
of S. Leventhal (1988), who in turn based his proof on Katznelson and Weiss
(1982).
Step 1. The ﬁrst thing to check is that E X0,n  ≤ Cn. To do this, we note
+
+
+
that (i) implies X0,m + Xm,n ≥ X0,n . Repeatedly using the last inequality and
+
+
invoking (iii) gives EX0,n ≤ nEX0,1 < ∞. Since x = 2x+ − x, it follows from
(iv) that
+
E X0,n  ≤ 2EX0,n − EX0,n ≤ Cn < ∞
Let an = EX0,n . (i) and (iii) imply that
(6.2) am + an−m ≥ an 6.6 A Subadditive Ergodic Theorem
From this, it follows easily that
an /n → inf am /m ≡ γ (6.3) m≥1 To prove this, we observe that the liminf is clearly ≥ γ , so all we have to show
is that the limsup ≤ am /m for any m. The last fact is easy, for if we write
n = km + with 0 ≤ < m, then repeated use of (6.2) gives an ≤ kam + a .
Dividing by n = km + gives
km
am
a
an
≤
·
+
n
km +
m
n
Letting n → ∞ and recalling 0 ≤ < m gives (6.3) and proves (a) in (1).
Remark. Chvatal and Sankoﬀ (1975) attribute (6.3) to Fekete (1923).
Step 2. Making repeated use of (i), we get
X0,n ≤ X0,km + Xkm,n
X0,n ≤ X0,(k−1)m + X(k−1)m,km + Xkm,n
and so on until the ﬁrst term on the right is X0,m . Dividing by n = km +
then gives
(6.4) X0,m + · · · + X(k−1)m,km
X0,n
k
Xkm,n
≤
·
+
n
km +
k
n Using (ii) and the ergodic theorem now gives that
X0,m + · · · + X(k−1)m,km
→ Am
k a.s. and in L1 where Am = E (X0,m Im ) and the subscript indicates that Im is the shift invariant σ ﬁeld for the sequence X(k−1)m,km , k ≥ 1. The exact formula for the
limit is not important, but we will need to know later that EAm = EX0,m .
If we ﬁx and let > 0, then (iii) implies
∞ ∞ P (Xkm,km+ > (km + ) ) ≤
k=1 P (X0, > k ) < ∞
k=1 +
since EX0, < ∞ by the result at the beginning of Step 1. The last two
observations imply (6.5) X ≡ lim sup X0,n /n ≤ Am /m
n→∞ 361 362 Chapter 6 Ergodic Theorems
Taking expected values now gives E X ≤ E (X0,m /m), and taking the inﬁmum
over m, we have E X ≤ γ . Note that if all the stationary sequences in (ii) are
ergodic, we have X ≤ γ.
+
Remark. If (i)–(iii) hold, EX0,1 < ∞, and inf EX0,m /m = −∞, then it
follows from the last argument that as X0,n /n → −∞ a.s. as n → ∞. Step 3. The next step is to let
X = lim inf X0,n /n
n→∞ and show that EX ≥ γ . Since ∞ > EX0,1 ≥ γ ≥ γ0 > −∞, and we have
shown in Step 2 that E X ≤ γ , it will follow that X = X , i.e., the limit of
X0,n /n exists a.s. Let
X m = lim inf Xm,m+n /n
n→∞ (i) implies
X0,m+n ≤ X0,m + Xm,m+n
Dividing both sides by n and letting n → ∞ gives X ≤ X m a.s. However, (iii)
implies that X m and X have the same distribution so X = X m a.s.
Let > 0 and let Z = + (X ∨ −M ). Since X ≤ X and E X ≤ γ < ∞ by
Step 2, E Z  < ∞. Let
Ym,n = Xm,n − (n − m)Z
Y satisﬁes (i)–(iv), since Zm,n = −(n − m)Z does, and has
Y ≡ lim inf Y0,n /n ≤ −
n→∞ Let Tm = min{n ≥ 1 : Ym,m+n ≤ 0}. (iii) implies Tm =d T0 and
E (Ym,m+1 ; Tm > N ) = E (Y0,1 ; T0 > N )
(6.6) implies that P (T0 < ∞) = 1, so we can pick N large enough so that
E (Y0,1 ; T0 > N ) ≤
Let
Sm = Tm
1 on {Tm ≤ N }
on {Tm > N } 6.6 A Subadditive Ergodic Theorem
This is not a stopping time but there is nothing special about stopping times
for a stationary sequence! Let
ξm = 0
Ym,m+1 on {Tm ≤ N }
on {Tm > N } Since Y (m, m+Tm ) ≤ 0 always and we have Sm = 1, Ym,m+1 > 0 on {Tm > N },
we have Y (m, m + Sm ) ≤ ξm and ξm ≥ 0. Let R0 = 0, and for k ≥ 1, let
Rk = Rk−1 + S (Rk−1 ). Let K = max{k : Rk ≤ n}. From (i), it follows that
Y (0, n) ≤ Y (R0 , R1 ) + · · · + Y (RK −1 , RK ) + Y (RK , n)
Since ξm ≥ 0 and n − RK ≤ N , the last quantity is
n−1 ≤ N ξm +
m=0 Yn−j,n−j +1 
j =1 Here we have used (i) on Y (RK , n). Dividing both sides by n, taking expected
values, and letting n → ∞ gives
lim sup EY0,n /n ≤ Eξ0 ≤ E (Y0,1 ; T0 > N ) ≤
n→∞ It follows from (a) and the deﬁnition of Y0,n that
γ = lim EX0,n /n ≤ 2 + E (X ∨ −M )
n→∞ Since > 0 and M are arbitrary, it follows that EX ≥ γ and Step 3 is complete.
Step 4. It only remains to prove convergence in L1 . Let Γm = Am /m be the
limit in (6.5), recall E Γm = E (X0,m /m), and let Γ = inf Γm . Observing that
z  = 2z + − z (consider two cases z ≥ 0 and z < 0), we can write
E X0,n /n − Γ = 2E (X0,n /n − Γ)+ − E (X0,n /n − Γ) ≤ 2E (X0,n /n − Γ)+
since
E (X0,n /n) ≥ γ = inf E Γm ≥ E Γ
Using the trivial inequality (x + y )+ ≤ x+ + y + and noticing Γm ≥ Γ now gives
E (X0,n /n − Γ)+ ≤ E (X0,n /n − Γm )+ + E (Γm − Γ) 363 364 Chapter 6 Ergodic Theorems
¯
Now E Γm → γ as m → ∞ and E Γ ≥ E X ≥ EX ≥ γ by steps 2 and 3, so
E Γ = γ , and it follows that E (Γm − Γ) is small if m is large. To bound the
other term, observe that (i) implies
E (X0,n /n − Γm )+ ≤ E
+E X (0, m) + · · · + X ((k − 1)m, km)
− Γm
km +
X (km, n)
n + + +
The second term = E (X0, /n) → 0 as n → ∞. For the ﬁrst, we observe
y + ≤ y , and the ergodic theorem implies E X (0, m) + · · · + X ((k − 1)m, km)
− Γm → 0
k so the proof of (6.1) is complete. *6.7. Applications
In this section, we will give four applications of our subadditive ergodic theorem
(6.1). These examples are independent of each other and can be read in any
order. In the last two, we encounter situations to which Liggett’s version applies
but Kingman’s version does not.
Example 7.1. Products of random matrices. Suppose A1 , A2 , . . . is a
stationary sequence of k × k matrices with positive entries and let αm,n (i, j ) =
(Am+1 · · · An )(i, j ), i.e., the entry in row i of column j of the product. It is
clear that
α0,m (1, 1)αm,n (1, 1) ≤ α0,n (1, 1)
so if we let Xm,n = − log αm,n (1, 1) then X0,m + Xm,n ≥ X0,n . To check (iv),
we observe that
n n Am (1, 1) ≤ α0,n (1, 1) ≤ k n−1
m=1 sup Am (i, j )
m=1 i,j or taking logs
n − n log Am (1, 1) ≥ X0,n ≥ −(n log k ) −
m=1 log sup Am (i, j )
m=1 i,j 6.7 Applications
+
So if E log Am (1, 1) > −∞ then EX0,1 < ∞, and if E log sup Am (i, j ) <∞ i,j
−
then EX0,n ≤ γ0 n. If we observe that P log sup Am (i, j ) ≥x ≤ i,j P (log Am (i, j ) ≥ x)
i,j we see that it is enough to assume that
(∗) E  log Am (i, j ) < ∞ for all i, j When (∗) holds, applying (6.1) gives X0,n /n → X a.s. Using the strict positivity
of the entries, it is easy to improve that result to
1
log α0,n (i, j ) → −X
n (7.1) a.s. for all i, j a result ﬁrst proved by Furstenberg and Kesten (1960).
The key to the proof above was the fact that α0,n (1, 1) was supermultiplicative. An alternative approach is to let
A = max
i A(i, j ) = max{ xA 1 :x 1 = 1} j where (xA)j =
i xi A(i, j ) and x 1 = x1  + · · · + xk . From the second
deﬁnition, it is clear that AB ≤ A · B , so if we let
βm,n = Am+1 · · · An
and Ym,n = log βm,n , then Ym,n is subadditive. It is easy to use (7.1) to show
that
1
log Am+1 · · · An → −X a.s.
n
where X is the limit of X0,n /n. To see the advantage in having two proofs of
the same result, we observe that if A1 , A2 , . . . is an i.i.d. sequence, then X is
constant, and we can get upper and lower bounds by observing
sup (E log α0,m )/m = −X = inf (E log β0,m )/m
m≥1 m≥1 365 366 Chapter 6 Ergodic Theorems
Remark. Oseled˘c (1968) proved a result which gives the asymptotic behavior
e
of all of the eigenvalues of A. As Raghunathan (1979) and Ruelle (1979) have
observed, this result can also be obtained from (6.1). See Krengel (1985) or
the papers cited for details. Furstenberg and Kesten (1960) and later Ishitani
(1977) have proved central limit theorems:
(log α0,n (1, 1) − µn)/n1/2 ⇒ σχ
where χ has the standard normal distribution. For more about products of
random matrices, see Cohen, Kesten, and Newman (1985).
Example 7.2. Increasing sequences in random permutations. Let π be
a permutation of {1, 2, . . . , n} and let (π ) be the length of the longest increasing
sequence in π . That is, the largest k for which there are integers i1 < i2 . . . < ik
so that π (i1 ) < π (i2 ) < . . . < π (ik ). Hammersley (1970) attacked this problem
by putting a rate one Poisson process in the plane, and for s < t ∈ [0, ∞),
letting Ys,t denote the length of the longest increasing path lying in the square
Rs,t with vertices (s, s), (s, t), (t, t), and (t, s). That is, the largest k for which
there are points (xi , yi ) in the Poisson process with s < x1 < . . . < xk < t and
s < y1 < . . . < yk < t. It is clear that Y0,m + Ym,n ≤ Y0,n . Applying (6.1) to
−Y0,n shows
Y0,n /n → γ ≡ sup EY0,m /m a.s.
m≥1 For each k , Ynk,(n+1)k , n ≥ 0 is i.i.d., so the limit is constant. We will show
that γ < ∞ in Exercise 7.2.
To get from the result about the Poisson process back to the random permutation problem, let τ (n) be the smallest value of t for which there are
n points in R0,t . Let the n points in R0,τ (n) be written as (xi , yi ) where
0 < x1 < x2 . . . < xn ≤ τ (n) and let πn be the unique permutation of
{1, 2, . . . , n} so that yπn (1) < yπn (2) . . . < yπn (n) . It is clear that Y0,τ (n) = (πn ).
An easy argument shows:
√
(7.2) Lemma. τ (n)/ n → 1 a.s.
Proof Let Sn be the number of points in R0,√n . Sn − Sn−1 are independent
Poisson r.v.’s with mean 1, so the strong law of large numbers implies Sn /n → 1
a.s. If > 0 then for large n, Sn(1− ) < n < Sn(1+ ) and hence (1 − )n ≤
τ (n) ≤ (1 + )n.
It follows from (7.2) and the monotonicity of m → Y0,m that
n−1/2 (πn ) → γ a.s. 6.7 Applications
Hammersley (1970) has a proof that π/2 ≤ γ ≤ e, and Kingman (1973)
shows that 1.59 < γ < 2.49. See Exercises 7.2 and 7.3. Subsequent work on the
random permutation problem, see Logan and Shepp (1977) and Vershik and
Kerov (1977), has shown that γ = 2.
Exercise 7.1. Given a rate one Poisson process in [0, ∞) × [0, ∞), let (X1 , Y1 )
be the point that minimizes x + y . Let (X2 , Y2 ) be the point in [X1 , ∞) × [Y1 , ∞)
that minimizes x + y , and so on. Use this construction to show that γ ≥
(8/π )1/2 > 1.59.
n
Exercise 7.2. Let πn be a random permutation of {1, . . . , n} and let Jk be
the number of subsets of {1, . . . n} of size k so that the associated πn (j ) form an
n
increasing subsequence. Compute EJk and take k ∼ αn1/2 to conclude γ ≤ e.
n
Remark. Kingman improved this by observing that (πn ) ≥ then Jk ≥ k .
n
1/2
and k ∼ αn1/2 , he
Using this with the bound on EJk and taking ∼ βn
showed γ < 2.49. Example 7.3. Agedependent branching processes. This is a variation of
the branching process introduced in Section 4.3 in which each individual lives
for an amount of time with distribution F before producing k oﬀspring with
probability pk . The description of the process is completed by supposing that
the process starts with one individual in generation 0 who is born at time 0,
and when this particle dies, its oﬀspring start independent copies of the original
process.
Suppose p0 = 0, let X0,m be the birth time of the ﬁrst member of generation m, and let Xm,n be the time lag necessary for that individual to have an
oﬀspring in generation n. In case of ties, pick an individual at random from
those in generation m born at time X0,m . It is clear that X0,n ≤ X0,m + Xm,n .
Since X0,n ≥ 0, (iv) holds if we assume F has ﬁnite mean. Applying (6.1) now,
it follows that
X0,n /n → γ a.s.
The limit is constant because the sequences {Xnk,(n+1)k , n ≥ 0} are i.i.d.
Remark. The inequality X ,m + Xm,n ≥ X ,n is false when > 0, because if
we call im the individual that determines the value of Xm,n for n > m, then im
may not be a descendant of i .
As usual, one has to use other methods to identify the constant. Let
t1 , t2 , . . . be i.i.d. with distribution F , let Tn = t1 + · · · + tn , and µ =
k pk .
Let Zn (an) be the number of individuals in generation n born by time an. Each 367 368 Chapter 6 Ergodic Theorems
individual in generation n has probability P (Tn ≤ an) to be born by time an,
and the times are independent of the oﬀspring numbers so
EZn (an) = EE (Zn (an)Zn ) = E (Zn P (Tn ≤ an)) = µn P (Tn ≤ an)
By results in Section 1.9, n−1 log P (Tn ≤ an) → −c(a) as n → ∞. If log µ −
c(a) < 0 then Chebyshev’s inequality and the BorelCantelli lemma imply
P (Zn (an) ≥ 1 i.o.) = 0. Conversely, if EZn (an) > 1 for some n, then we
can deﬁne a supercritical branching process Ym that consists of the oﬀspring
in generation mn that are descendants of individuals in Ym−1 in generation
(m − 1)n that are born less than an units of time after their parents. This
shows that with positive probability, X0,mn ≤ mna for all m. Combining the
last two observations with the fact that c(a) is strictly increasing gives
γ = inf {a : log µ − c(a) > 0} The last result is from Biggins (1977). See his (1978) and (1979) papers for
extensions and reﬁnements. Kingman (1975) has an approach to the problem
via martingales:
Exercise 7.3. Let ϕ(θ) = E exp(−θti ) and
Zn Yn = (µϕ(θ)) −n exp(−θTn (i))
i=1 where the sum is over individuals in generation n and Tn (i) is the ith person’s
birth time. Show that Yn is a nonnegative martingale and use this to conclude
that if exp(−θa)/µϕ(θ) > 1, then P (X0,n ≤ an) → 0. A little thought reveals
that this bound is the same as the answer in the last exercise.
Example 7.4. First passage percolation. Consider Zd as a graph with
edges connecting each x, y ∈ Zd with x − y  = 1. Assign an independent
nonnegative random variable τ (e) to each edge that represents the time required
to traverse the edge going in either direction. If e is the edge connecting x and
y , let τ (x, y ) = τ (y, x) = τ (e). If x0 = x, x1 , . . . , xn = y is a path from x to
y , i.e., a sequence with xm − xm−1  = 1 for 1 ≤ m ≤ n, we deﬁne the travel
time for the path to be τ (x0 , x1 ) + · · · + τ (xn−1 , xn ). Deﬁne the passage time
from x to y , t(x, y ) = the inﬁmum of the travel times over all paths from x to
y . Let z ∈ Zd and let Xm,n = t(mu, nu), where u = (1, 0, . . . , 0).
Clearly X0,m + Xm,n ≥ X0,n . X0,n ≥ 0 so if Eτ (x, y ) < ∞ then (iv)
holds, and (6.1) implies that X0,n /n → X a.s. To see that the limit is constant, 6.7 Applications
enumerate the edges in some order e1 , e2 , . . . and observe that X is measurable
with respect to the tail σ ﬁeld of the i.i.d. sequence τ (e1 ), τ (e2 ), . . .
Remark. It is not hard to see that the assumption of ﬁnite ﬁrst moment can
be weakened. If τ has distribution F with
∞ (1 − F (x))2d dx < ∞ (∗)
0 i.e., the minimum of 2d independent copies has ﬁnite mean, then by ﬁnding 2d
disjoint paths from 0 to u = (1, 0, . . . , 0), one concludes that Eτ (0, u) < ∞ and
(6.1) can be applied. The condition (∗) is also necessary for X0,n /n to converge
to a ﬁnite limit. If (∗) fails and Yn is the minimum of t(e) over all the edges
from ν , then
lim sup X0,n /n ≥ lim sup Yn /n = ∞ a.s.
n→∞ n→∞ Above we considered the pointtopoint passage time. A second object
of interest is the pointtoline passage time:
an = inf {t(0, x) : x1 = n}
Unfortunately, it does not seem to be possible to embed this sequence in a
¯
subadditive family. To see the diﬃculty, let t(0, x) be inﬁmum of travel times
over paths from 0 to x that lie in {y : y1 ≥ 0}, let
¯
am = inf {t(0, x) : x1 = m}
¯
and let xm be a point at which the inﬁmum is achieved. We leave to the reader
the highly nontrivial task of proving that such a point exists; see Smythe and
Wierman (1978) for a proof. If we let am,n be the inﬁmum of travel times over
¯
all paths that start at xm , stay in {y : y1 ≥ m}, and end on {y : y1 = n}, then
am,n is independent of am and
¯
¯
am + am,n ≥ an
¯
¯
¯
The last inequality is true without the halfspace restriction, but the independence is not and without the halfspace restriction, we cannot get the stationarity properties needed to apply (4.1).
¯
Remark. The family am,n is another example where a
¯
not hold for > 0. ,m + am,n ≥ a
¯
¯ ,n need A second approach to limit theorems for am is to prove a result about the
set of points that can be reached by time t: ξt = {x : t(0, x) ≤ t}. Cox and
Durrett (1981) have shown 369 370 Chapter 6 Ergodic Theorems
(7.3) Theorem. For any passage time distribution F with F (0) = 0, there is
a convex set A so that for any > 0 we have with probability one
ξt ⊂ (1 + )tA for all t suﬃciently large
and ξt ∩ (1 − )tA ∩ Zd /td → 0 as t → ∞.
Ignoring the boring details of how to state things precisely, the last result says
ξt /t → A a.s. It implies that an /n → γ a.s., where γ = 1/ sup{x1 : x ∈ A}.
(Use the convexity and reﬂection symmetry of A.) When the distribution has
ﬁnite mean (or satisﬁes the weaker condition in the remark above), γ is the
limit of t(0, nu)/n. Without any assumptions, t(0, nu)/n → γ in probability.
For more details, see the paper cited above. Kesten (1986) and (1987) are good
sources for more about ﬁrstpassage percolation.
Exercise 7.4. Oriented ﬁrstpassage percolation. Consider a graph with
vertices {(m, n) ∈ Z2 : m + n is even and n ≤ 0}, and oriented edges connecting
(m, n) to (m − 1, n − 1) and (m, n) to (m − 1, n − 1). Assign i.i.d. exponential
mean one r.v.’s to each edge. Thinking of the number on edge e as giving the
time it takes water to travel down the edge, deﬁne t(m, n) = the time at which
the ﬂuid ﬁrst reaches (m, n), and an = inf {t(m, −n)}. Show that as n → ∞,
an /n converges to a limit γ a.s.
Exercise 7.5. Continuing with the set up in the last exercise: (i) Show γ ≤ 1/2
by considering a1 . (ii) Get a positive lower bound on γ by looking at the
expected number of paths down to {(m, −n) : −n ≤ m ≤ n} with passage time
≤ an and using results from Section 1.9.
Remark. If we replace the graph in Exercise 7.4 by a binary tree, then we
get a problem equivalent to the ﬁrst birth problem (Example 7.3) for p2 = 2,
P (ti > x) = e−x . In that case, the lower bound obtained by the methods of
part (ii) Exercise 7.5 was sharp, but in this case it is not. 7 Brownian Motion Brownian motion is a process of tremendous practical and theoretical significance. It originated (a) as a model of the phenomenon observed by Robert
Brown in 1828 that “pollen grains suspended in water perform a continual
swarming motion,” and (b) in Bachelier’s (1900) work as a model of the stock
market. These are just two of many systems that Brownian motion has been
used to model. On the theoretical side, Brownian motion is a Gaussian Markov
process with stationary independent increments. It lies in the intersection of
three important classes of processes and is a fundamental example in each theory.
The ﬁrst part of this chapter develops properties of Brownian motion. In
Section 7.1, we deﬁne Brownian motion and investigate continuity properties of
its paths. In Section 7.2, we prove the Markov property and a related 01 law.
In Section 7.3, we deﬁne stopping times and prove the strong Markov property.
In Section 7.4, we take a close look at the zero set of Brownian motion. In
Section 7.5, we introduce some martingales associated with Brownian motion
and use them to compute the distribution of T and BT for some stopping times
T.
The second part of this chapter applies Brownian motion to some of the
problems considered in Chapters 1 and 2. In Section 7.6, we embed random
walks into Brownian motion to prove Donsker’s theorem, a farreaching generalization of the central limit theorem. In Section 7.7, we extend Donsker’s
theorem to martingales satisfying “LindebergFeller conditions” and to weakly
dependent stationary sequences. In Section 7.8, we show that the discrepancy
between the empirical distribution and the true distribution when suitably magniﬁed converges to Brownian bridge. In Section 7.9, we prove laws of the iterated
logarithm for Brownian motion and random walks with ﬁnite variance. The last
three sections depend on Section 7.6 but are independent of each other and can
be read in any order. 372 Chapter 7 Brownian Motion 7.1. Deﬁnition and Construction
A onedimensional Brownian motion is a realvalued process Bt , t ≥ 0 that
has the following properties:
(a) If t0 < t1 < . . . < tn then B (t0 ), B (t1 ) − B (t0 ), . . . , B (tn ) − B (tn−1 ) are
independent.
(b) If s, t ≥ 0 then
(2πt)−1/2 exp(−x2 /2t) dx P (B (s + t) − B (s) ∈ A) =
A (c) With probability one, t → Bt is continuous.
(a) says that Bt has independent increments. (b) says that the increment
B (s + t) − B (s) has a normal distribution with mean 0 and variance t. (c) is
selfexplanatory.
Thinking of Brown’s pollen grain (c) is certainly reasonable. (a) and (b)
can be justiﬁed by noting that the movement of the pollen grain is due to the net
eﬀect of the bombardment of millions of water molecules, so by the central limit
theorem, the displacement in any one interval should have a normal distribution,
and the displacements in two disjoint intervals should be independent.
Two immediate consequences of the deﬁnition that will be useful many
times are:
(1.1) Translation invariance. {Bt − B0 , t ≥ 0} is independent of B0 and has
the same distribution as a Brownian motion with B0 = 0.
Proof Let A1 = σ (B0 ) and A2 be the events of the form {B (t1 ) − B (t0 ) ∈
A1 , . . . , B (tn ) − B (tn−1 ) ∈ An }. The Ai are π systems that are independent,
so the desired result follows from (4.2) in Chapter 1.
(1.2) The Brownian scaling relation. If B0 = 0 then for any t > 0,
d {Bst , s ≥ 0} = {t1/2 Bs , s ≥ 0}
To be precise, the two families of r.v.’s have the same ﬁnite dimensional distributions, i.e., if s1 < . . . < sn then
d (Bs1 t , . . . , Bsn t ) = (t1/2 Bs1 , . . . t1/2 Bsn )
Proof To check this when n = 1, we note that t1/2 times a normal with mean
0 and variance s is a normal with mean 0 and variance st. The result for n > 1
follows from independent increments. Section 7.1 Deﬁnition and Construction
A second equivalent deﬁnition of Brownian motion starting from B0 = 0,
that we will occasionally ﬁnd useful is that Bt , t ≥ 0, is a realvalued process
satisfying
(a ) B (t) is a Gaussian process (i.e., all its ﬁnite dimensional distributions
are multivariate normal).
(b ) EBs = 0 and EBs Bt = s ∧ t.
(c ) With probability one, t → Bt is continuous.
It is easy to see that (a) and (b) imply (a ). To get (b ) from (a) and (b),
suppose s < t and write
2
EBs Bt = E (Bs ) + E (Bs (Bt − Bs )) = s The converse is even easier. (a ) and (b ) specify the ﬁnite dimensional distributions of Bt , which by the last calculation must agree with the ones deﬁned
in (a) and (b).
The ﬁrst question that must be addressed in any treatment of Brownian
motion is, “Is there a process with these properties?” The answer is “Yes,” of
course, or this chapter would not exist. For pedagogical reasons, we will pursue
an approach that leads to a dead end and then retreat a little to rectify the
diﬃculty. Fix an x ∈ R and for each 0 < t1 < . . . < tn , deﬁne a measure on
Rn by
n µx,t1 ,...,tn (A1 × . . . × An ) = dx1 · · ·
A1 dxn
An
ptm −tm−1 (xm−1 , xm )
m=1 where Ai ∈ R, x0 = x, t0 = 0, and
pt (a, b) = (2πt)−1/2 exp(−(b − a)2 /2t)
From the formula above, it is easy to see that for ﬁxed x the family µ is a consistent set of ﬁnite dimensional distributions (f.d.d.’s), that is, if {s1 , . . . , sn−1 } ⊂
{t1 , . . . , tn } and tj ∈ {s1 , . . . , sn−1 } then
/
µx,s1 ,...,sn−1 (A1 ×· · ·× An−1 ) = µx,t1 ,...,tn (A1 ×· · ·× Aj −1 × R × Aj ×· · ·× An−1 )
This is clear when j = n. To check the equality when 1 ≤ j < n, it is enough
to show that
ptj −tj−1 (x, y )ptj+1 −tj (y, z ) dy = ptj+1 −tj−1 (x, z ) 373 374 Chapter 7 Brownian Motion
By translation invariance, we can without loss of generality assume x = 0, but
all this says is that the sum of independent normals with mean 0 and variances
tj − tj −1 and tj +1 − tj has a normal distribution with mean 0 and variance
tj +1 − tj −1 .
With the consistency of f.d.d.’s veriﬁed, we get our ﬁrst construction of
Brownian motion:
(1.3) Theorem. Let Ωo = {functions ω : [0, ∞) → R} and Fo be the σ ﬁeld
generated by the ﬁnite dimensional sets {ω : ω (ti ) ∈ Ai for 1 ≤ i ≤ n}, where
Ai ∈ R. For each x ∈ R, there is a unique probability measure νx on (Ωo , Fo )
so that νx {ω : ω (0) = x} = 1 and when 0 < t1 < . . . < tn
(∗) νx {ω : ω (ti ) ∈ Ai } = µx,t1 ,...,tn (A1 × · · · × An ) This follows from a generalization of Kolmogorov’s extension theorem, (7.1) in
the Appendix. We will not bother with the details since at this point we are
at the dead end referred to above. If C = {ω : t → ω (t) is continuous} then
C ∈ Fo , that is, C is not a measurable set. The easiest way of proving C ∈ Fo
/
/
is to do:
Exercise 1.1. A ∈ Fo if and only if there is a sequence of times t1 , t2 , . . . in
[0, ∞) and a B ∈ R{1,2,...} so that A = {ω : (ω (t1 ), ω (t2 ), . . .) ∈ B }. In words,
all events in Fo depend on only countably many coordinates.
The above problem is easy to solve. Let Q2 = {m2−n : m, n ≥ 0} be the
dyadic rationals. If Ωq = {ω : Q2 → R} and Fq is the σ ﬁeld generated by the
ﬁnite dimensional sets, then enumerating the rationals q1 , q2 , . . . and applying
Kolmogorov’s extension theorem shows that we can construct a probability νx
on (Ωq , Fq ) so that νx {ω : ω (0) = x} = 1 and (∗) in (1.3) holds when the
ti ∈ Q2 . To extend Bt to a process deﬁned on [0, ∞), we will show:
(1.4) Theorem. Let T < ∞ and x ∈ R. νx assigns probability one to paths
ω : Q2 → R that are uniformly continuous on Q2 ∩ [0, T ].
Remark. It will take quite a bit of work to prove (1.4). Before taking on
that task, we will attend to the last measure theoretic detail: We tidy things
up by moving our probability measures to (C, C ), where C = {continuous ω :
[0, ∞) → R} and C is the σ ﬁeld generated by the coordinate maps t → ω (t).
To do this, we observe that the map ψ that takes a uniformly continuous point
in Ωq to its unique continuous extension in C is measurable, and we set
Px = νx ◦ ψ −1 Section 7.1 Deﬁnition and Construction
Our construction guarantees that Bt (ω ) = ωt has the right ﬁnite dimensional
distributions for t ∈ Q2 . Continuity of paths and a simple limiting argument
shows that this is true when t ∈ [0, ∞). Finally, the reader should note that, as
in the case of Markov chains, we have one set of random variables Bt (ω ) = ω (t),
and a family of probability measures Px , x ∈ R, so that under Px , Bt is a
Brownian motion with Px (B0 = x) = 1.
Proof of (1.4) By (1.1) and (1.2), we can without loss of generality suppose
B0 = 0 and prove the result for T = 1. In this case, part (b) of the deﬁnition
and the scaling relation (1.2) imply
E0 (Bt − Bs )4 = E0 Bt−s 4 = C (t − s)2
where C = E0 B1 4 < ∞. From the last observation, we get the desired uniform
continuity by using the following result due to Kolmogorov.
(1.5) Theorem. Suppose E Xs − Xt β ≤ K t − s1+α where α, β > 0. If
γ < α/β then with probability one there is a constant C (ω ) so that
X ( q ) − X ( r )  ≤ C  q − r  γ for all q, r ∈ Q2 ∩ [0, 1] Proof Let γ < α/β , η > 0, In = {(i, j ) : 0 ≤ i ≤ j ≤ 2n , 0 < j − i ≤ 2nη }
and Gn = {X (j 2−n ) − X (i2−n) ≤ ((j − i)2−n )γ for all (i, j ) ∈ In }. Since
aβ P (Y  > a) ≤ E Y β , we have
P (Gc ) ≤
n ((j − i)2−n )−βγ E X (j 2−n ) − X (i2−n )β
(i,j )∈In Using our assumption and then noticing the number of (i, j ) ∈ In is ≤ 2n 2nη ,
we see that the righthand side is (for the second step note α − βγ > 0)
((j − i)2−n )−βγ +1+α ≤ K 2n 2nη (2nη 2−n )−βγ +1+α = K 2−nλ ≤K
(i,j )∈In where λ = (1 − η )(1 + α − βγ ) − (1 + η ). Since γ < α/β , we can pick η small
enough so that λ > 0. To complete the proof now, we will show:
(1.6) Lemma. Let A = 3 · 2(1−η)γ /(1 − 2−γ ). On HN = ∩∞ N Gn we have
n=
X (q ) − X (r) ≤ Aq − rγ for q, r ∈ Q2 ∩ [0, 1] with q − r < 2−(1−η)N .
(1.6) implies (1.5) The trivial inequality
∞
c
P ( HN ) ≤ ∞ P (Gc ) ≤ K
n
n=N 2−nλ = K 2−N λ /(1 − 2−λ )
n=N 375 376 Chapter 7 Brownian Motion
implies that X (q ) − X (r) ≤ Aq − rγ for q, r ∈ Q2 with q − r < δ (ω ). To
extend this to q, r ∈ Q2 ∩ [0, 1], let s0 = q < s1 < . . . < sn = r with si − si−1  <
δ (ω ) and use the triangle inequality to conclude X (q ) − X (r) ≤ C (ω )q − rγ
where C (ω ) = 1 + δ (ω )−1 .
Proof of (1.6) Let q, r ∈ Q2 ∩ [0, 1] with 0 < r − q < 2−(1−η)N . Pick m ≥ N
so that
2−(m+1)(1−η) ≤ r − q < 2−m(1−η)
and write r = j 2−m + 2−r(1) + · · · + 2−r( ) q = i2−m − 2−q(1) − · · · − 2−q(k)
where m < r(1) < · · · < r( ) and m < q (1) < · · · < q (k ). Now 0 < r − q <
2−m(1−η) , so (j − i) < 2mη , and it follows that on HN
(a) X (i2−m ) − X (j 2−m ) ≤ ((2mη )2−m )γ On HN , it follows from the triangle inequality that
k (b) X (q ) − X (i2−m ) ≤ ∞ (2−q(h) )γ ≤
h=1 (2−γ )h ≤ Cγ 2−γm
h=m where Cγ = 1/(1 − 2−γ ) > 1. Repeating the last computation shows
X (r) − X (j 2−m ) ≤ Cγ 2−γm (c)
Combining (a)–(c) gives X (q ) − X (r) ≤ 3Cγ 2−γm(1−η) ≤ 3Cγ 2(1−η)γ r − q γ
since 2−m(1−η) ≤ 21−η r − q . This completes the proof of (1.6) and hence of
(1.5) and (1.4).
The scaling relation (1.2) implies
E Bt − Bs 2m = Cm t − sm where Cm = E B1 2m so using (1.5) with β = 2m, α = m − 1 and letting m → ∞ gives a result of
Wiener (1923).
(1.7) Theorem. Brownian paths are H¨lder continuous with exponent γ for
o
any γ < 1/2. Section 7.2 Markov Property, Blumenthal’s 01 Law
It is easy to show:
(1.8) Theorem. With probability one, Brownian paths are not Lipschitz continuous (and hence not diﬀerentiable) at any point.
Remark. The nondiﬀerentiability of Brownian paths was discovered by Paley,
Wiener, and Zygmund (1933). Paley died in 1933 at the age of 26 in a skiing
accident while the paper was in press. The proof we are about to give is due to
Dvoretsky, Erd¨s, and Kakutani (1961).
o
Proof Fix a constant C < ∞ and let An = {ω : there is an s ∈ [0, 1] so that
Bt − Bs  ≤ C t − s when t − s ≤ 3/n}. For 1 ≤ k ≤ n − 2, let
Yk,n = max B k+j
n −B k+j −1
n : j = 0, 1, 2 Bn = { at least one Yk,n ≤ 5C/n}
The triangle inequality implies An ⊂ Bn . The worst case is s = 1. We pick
k = n − 2 and observe
B n−3
n −B n−2
n ≤B n−3
n − B (1) + B (1) − B n−2
n ≤ C (3/n + 2/n)
Using An ⊂ Bn and the scaling relation (1.2) gives
P (An ) ≤ P (Bn ) ≤ nP (B (1/n) ≤ 5C/n)3 = nP (B (1) ≤ 5C/n1/2 )3
≤ n{(10C/n1/2) · (2π )−1/2 }3
since exp(−x2 /2) ≤ 1. Letting n → ∞ shows P (An ) → 0. Noticing n → An is
increasing shows P (An ) = 0 for all n and completes the proof.
Exercise 1.2. Looking at the proof of (1.8) carefully shows that if γ > 5/6
then Bt is not H¨lder continuous with exponent γ at any point in [0,1]. Show,
o
by considering k increments instead of 3, that the last conclusion is true for all
γ > 1/2 + 1/k.
The next result is more evidence that the sample paths of Brownian motion
√
behave locally like t.
Exercise 1.3. Fix t and let ∆m,n = B (tm2−n ) − B (t(m − 1)2−n ). Compute
2 ∆2 − t
m,n E
m≤2n 377 378 Chapter 7 Brownian Motion
and use BorelCantelli to conclude that m≤2n ∆2 → t a.s. as n → ∞.
m,n Remark. The last result is true if we consider a sequence of partitions Π1 ⊂
Π2 ⊂ . . . with mesh → 0. See Freedman (1971a) p. 42–46. However, the true
quadratic variation, deﬁned as the sup over all partitions, is ∞. 7.2. Markov Property, Blumenthal’s 01 Law
Intuitively, the Markov property says “if s ≥ 0 then B (t + s) − B (s), t ≥ 0 is
a Brownian motion that is independent of what happened before time s.” The
ﬁrst step in making this into a precise statement is to explain what we mean
by “what happened before time s.” The ﬁrst thing that comes to mind is
o
Fs = σ (Br : r ≤ s) For reasons that will become clear as we go along, it is convenient to replace
o
Fs by
+
o
Fs = ∩t>s Ft
+
The ﬁelds Fs are nicer because they are right continuous:
+
o
o
+
∩t>s Ft = ∩t>s (∩u>t Fu ) = ∩u>s Fu = Fs
+
+
In words, the Fs allow us an “inﬁnitesimal peek at the future,” i.e., A ∈ Fs if
o
it is in Fs+ for any > 0. If f (u) > 0 for all u > 0, then the random variable lim sup
t ↓s Bt − Bs
f (t − s) +
o
is measurable with respect to Fs but not Fs . We will see below that there are
+
o
no interesting examples, i.e., Fs and Fs are the same (up to null sets).
To state the Markov property, we need some notation. Recall that we have
a family of measures Px , x ∈ R, on (C, C ) so that under Px , Bt (ω ) = ω (t) is a
Brownian motion starting at x. For s ≥ 0, we deﬁne the shift transformation
θs : C → C by
(θs ω )(t) = ω (s + t) for t ≥ 0 In words, we cut oﬀ the part of the path before time s and then shift the path
so that time s becomes time 0.
(2.1) Markov property. If s ≥ 0 and Y is bounded and C measurable, then
for all x ∈ R
+
Ex (Y ◦ θs Fs ) = EBs Y Section 7.2 Markov Property, Blumenthal’s 01 Law
where the righthand side is the function ϕ(x) = Ex Y evaluated at x = Bs .
Proof
that By the deﬁnition of conditional expectation, what we need to show is (∗) Ex (Y ◦ θs ; A) = Ex (EBs Y ; A) +
for all A ∈ Fs We will begin by proving the result for a carefully chosen special case and
then use the monotone class theorem (MCT) to get the general case. Suppose
Y (ω ) = 1≤m≤n fm (ω (tm )), where 0 < t1 < . . . < tn and the fm are bounded
and measurable. Let 0 < h < t1 , let 0 < s1 . . . < sk ≤ s + h, and let A = {ω :
ω (sj ) ∈ Aj , 1 ≤ j ≤ k }, where Aj ∈ R for 1 ≤ j ≤ k . From the deﬁnition of
Brownian motion, it follows that
Ex (Y ◦ θs ; A) = dx1 ps1 (x, x1 )
A1 dx2 ps2 −s1 (x1 , x2 ) · · ·
A2 dxk psk −sk−1 (xk−1 , xk ) dy ps+h−sk (xk , y )ϕ(y, h) Ak where
ϕ(y, h) = dy1 pt1 −h (y, y1 )f1 (y1 ) . . . dyn ptn −tn−1 (yn−1 , yn )fn (yn ) For more details, see the proof of (1.6) in Chapter 5, which applies without
change here. Using that identity on the righthand side, we have
(∗∗) Ex (Y ◦ θs ; A) = Ex (ϕ(Bs+h , h); A) The last equality holds for all ﬁnite dimensional sets A so the π − λ theorem,
o
+
(4.1) in Chapter 1, implies that it is valid for all A ∈ Fs+h ⊃ Fs .
It is easy to see by induction on n that
ψ (y1 ) =f1 (y1 )
... dy2 pt2 −t1 (y1 , y2 )f2 (y2 )
dyn ptn −tn−1 (yn−1 , yn )fn (yn ) is bounded and measurable. Letting h ↓ 0 and using the dominated convergence
theorem shows that if xh → x, then
ϕ(xh , h) = dy1 pt1 −h (xh , y1 )ψ (y1 ) → ϕ(x, 0) 379 380 Chapter 7 Brownian Motion
as h ↓ 0. Using (∗∗) and the bounded convergence theorem now gives
Ex (Y ◦ θs ; A) = Ex (ϕ(Bs , 0); A)
+
for all A ∈ Fs . This shows that (∗) holds for Y = 1≤m≤n fm (ω (tm )) and the
fm are bounded and measurable.
The desired conclusion now follows from the monotone class theorem, (1.5)
in Chapter 5. Let H = the collection of bounded functions for which (∗) holds.
H clearly has properties (ii) and (iii). Let A be the collection of sets of the
form {ω : ω (tj ) ∈ Aj }, where Aj ∈ R. The special case treated above shows (i)
holds and the desired conclusion follows. The next two exercises give typical applications of the Markov property.
In Section 7.4, we will use these equalities to compute the distributions of L
and R.
Exercise 2.1. Let T0 = inf {s > 0 : Bs = 0} and let R = inf {t > 1 : Bt = 0}.
R is for right or return. Use the Markov property at time 1 to get
(2.2) Px (R > 1 + t) = p1 (x, y )Py (T0 > t) dy Exercise 2.2. Let T0 = inf {s > 0 : Bs = 0} and let L = sup{t ≤ 1 : Bt = 0}.
L is for left or last. Use the Markov property at time 0 < t < 1 to conclude
(2.3) P0 (L ≤ t) = pt (0, y )Py (T0 > 1 − t) dy The reader will see many applications of the Markov property below, so
we turn our attention now to a “triviality” that has surprising consequences.
Since
+
o
Ex (Y ◦ θs Fs ) = EB (s) Y ∈ Fs
it follows from (1.1) in Chapter 5 that
+
o
Ex (Y ◦ θs Fs ) = Ex (Y ◦ θs Fs ) From the last equation, it is a short step to:
(2.4) Theorem. If Z ∈ C is bounded then for all s ≥ 0 and x ∈ R,
+
o
Ex (Z Fs ) = Ex (Z Fs ) Section 7.2 Markov Property, Blumenthal’s 01 Law
Proof As in the proof of (2.1), it suﬃces to prove the result when
n fm (B (tm )) Z=
m=1 and the fm are bounded and measurable. In this case, Z can be written as
o
X (Y ◦ θs ), where X ∈ Fs and Y is C measurable, so
+
+
o
Ex (Z Fs ) = XEx (Y ◦ θs Fs ) = XEBs Y ∈ Fs and the proof is complete.
+
o
o
If we let Z ∈ Fs , then (2.4) implies Z = Ex (Z Fs ) ∈ Fs , so the two
σ ﬁelds are the same up to null sets. At ﬁrst glance, this conclusion is not
exciting. The fun starts when we take s = 0 in (2.4) to get:
+
(2.5) Blumenthal’s 01 law. If A ∈ F0 then for all x ∈ R, Px (A) ∈ {0, 1}. Proof +
o
Using A ∈ F0 , (2.4), and F0 = σ (B0 ) is trivial under Px gives
+
o
1A = Ex (1A F0 ) = Ex (1A F0 ) = Px (A) Px a.s. This shows that the indicator function 1A is a.s. equal to the number Px (A),
and the result follows.
+
In words, the last result says that the germ ﬁeld, F0 , is trivial. This
result is very useful in studying the local behavior of Brownian paths. (2.6) Theorem. If τ = inf {t ≥ 0 : Bt > 0} then P0 (τ = 0) = 1.
Proof P0 (τ ≤ t) ≥ P0 (Bt > 0) = 1/2 since the normal distribution is symmetric about 0. Letting t ↓ 0, we conclude
P0 (τ = 0) = lim P0 (τ ≤ t) ≥ 1/2
t ↓0 so it follows from (2.5) that P0 (τ = 0) = 1.
Once Brownian motion must hit (0, ∞) immediately starting from 0, it
must also hit (−∞, 0) immediately. Since t → Bt is continuous, this forces:
(2.7) Theorem. If T0 = inf {t > 0 : Bt = 0} then P0 (T0 = 0) = 1. 381 382 Chapter 7 Brownian Motion
A corollary of (2.7) is:
Exercise 2.3. If a < b, then with probability one there is a local maximum of
Bt in (a, b). So the set of local maxima of Bt is almost surely a dense set.
Another typical application of (2.5) is:
Exercise 2.4. (i) Suppose f (t) > 0 for all t > 0. Use (2.5) to conclude that
lim supt↓0 B (t)/f (t) = c, P0 a.s., where c ∈ [0, ∞] is a constant. (ii) Show that
√
if f (t) = t then c = ∞, so with probability one Brownian paths are not H¨lder
o
continuous of order 1/2 at 0.
o
Remark. Let Hγ (ω ) be the set of times at which the path ω ∈ C is H¨lder
continuous of order γ . (1.7) shows that P (Hγ = [0, ∞)) = 1 for γ < 1/2.
Exercise 1.2 shows that P (Hγ = ∅) = 1 for γ > 1/2. The last exercise shows
P (t ∈ H1/2 ) = 0 for each t, but B. Davis (1983) has shown P (H1/2 = ∅) = 1.
(2.5) concerns the behavior of Bt as t → 0. By using a trick, we can use
this result to get information about the behavior as t → ∞.
(2.8) Theorem. If Bt is a Brownian motion starting at 0, then so is the process
deﬁned by X0 = 0 and Xt = tB (1/t) for t > 0.
Proof Here we will check the second deﬁnition of Brownian motion. To do
this, we note: (i) If 0 < t1 < . . . < tn , then (X (t1 ), . . . , X (tn )) has a multivariate
normal distribution with mean 0. (ii) EXs = 0 and if s < t then
E (Xs Xt ) = stE (B (1/s)B (1/t)) = s
For (iii) we note that X is clearly continuous at t = 0. To handle t = 0, we
note that X has the same ﬁnite dimensional distributions as Brownian motion
and then use (1.4).
Direct proof of continuity. We begin by observing that the strong law
of large numbers implies Bn /n → 0 as n → ∞ through the integers. To
handle values in between integers, we note that Kolmogorov’s inequality, (8.2)
in Chapter 1, implies
P sup B (n + k 2−m ) − Bn  > n2/3 ≤ n−4/3 E (Bn+1 − Bn )2 0<k≤2m Letting m → ∞, we have
P sup
u∈[n,n+1] Bu − Bn  > n2/3 ≤ n−4/3 Section 7.2 Markov Property, Blumenthal’s 01 Law
Since n n−4/3 < ∞, the BorelCantelli lemma implies Bu /u → 0 as u → ∞.
Taking u = 1/t, we have Xt → 0 as t → 0.
(2.8) allows us to relate the behavior of Bt as t → ∞ and as t → 0.
Combining this idea with Blumenthal’s 01 law leads to a very useful result.
Let
Ft = σ (Bs : s ≥ t) = the future at time t
T = ∩t≥0 Ft = the tail σ ﬁeld.
(2.9) Theorem. If A ∈ T then either Px (A) ≡ 0 or Px (A) ≡ 1.
Remark. Notice that this is stronger than the conclusion of Blumenthal’s 01
law (2.5). The examples A = {ω : ω (0) ∈ D} show that for A in the germ
+
σ ﬁeld F0 , the value of Px (A), 1D (x) in this case, may depend on x.
Proof Since the tail σ ﬁeld of B is the same as the germ σ ﬁeld for X , it
follows that P0 (A) ∈ {0, 1}. To improve this to the conclusion given, observe
that A ∈ F1 , so 1A can be written as 1D ◦ θ1 . Applying the Markov property
gives
Px (A) = Ex (1D ◦ θ1 ) = Ex (Ex (1D ◦ θ1 F1 )) = Ex (EB1 1D )
= (2π )−1/2 exp(−(y − x)2 /2)Py (D) dy Taking x = 0, we see that if P0 (A) = 0, then Py (D) = 0 for a.e. y with respect
to Lebesgue measure, and using the formula again shows Px (A) = 0 for all x.
To handle the case P0 (A) = 1, observe that Ac ∈ T and P0 (Ac ) = 0, so the last
result implies Px (Ac ) = 0 for all x.
The next result is a typical application of (2.9).
(2.10) Theorem. Let Bt be a onedimensional Brownian motion starting at 0
then with probability 1,
√
√
lim sup Bt / t = ∞
lim inf Bt / t = −∞
t→∞ t→∞ Proof Let K < ∞. By Exercise 6.6 in Chapter 1 and scaling √
√
P0 (Bn / n ≥ K i.o.) ≥ lim sup P0 (Bn ≥ K n) = P0 (B1 ≥ K ) > 0
n→∞ so the 0–1 law (2.9) implies the probability is 1. Since K is arbitrary, this
proves the ﬁrst result. The second one follows from symmetry. 383 384 Chapter 7 Brownian Motion
From (2.10), translation invariance, and the continuity of Brownian paths
it follows that we have:
(2.11) Theorem. Let Bt be a onedimensional Brownian motion and let A =
∩n {Bt = 0 for some t ≥ n}. Then Px (A) = 1 for all x.
In words, onedimensional Brownian motion is recurrent. For any starting point
x, it will return to 0 “inﬁnitely often,” i.e., there is a sequence of times tn ↑ ∞
so that Btn = 0. We have to be careful with the interpretation of the phrase in
quotes since starting from 0, Bt will hit 0 inﬁnitely many times by time > 0.
Last rites. With our discussion of Blumenthal’s 01 law complete, the dis+
o
tinction between Fs and Fs is no longer important, so we will make one ﬁnal
improvement in our σ ﬁelds and remove the superscripts. Let
Nx = {A : A ⊂ D with Px (D) = 0}
x
+
Fs = σ (Fs ∪ Nx )
x
Fs = ∩x Fs
x
Nx are the null sets and Fs are the completed σ ﬁelds for Px . Since we do
not want the ﬁltration to depend on the initial state, we take the intersection
of all the σ ﬁelds. The reader should note that it follows from the deﬁnition
that the Fs are rightcontinuous. 7.3. Stopping Times, Strong Markov Property
Generalizing a deﬁnition in Section 3.1, we call a random variable S taking
values in [0, ∞] a stopping time if for all t ≥ 0, {S < t} ∈ Ft . In the last
deﬁnition, we have obviously made a choice between {S < t} and {S ≤ t}. This
makes a big diﬀerence in discrete time but none in continuous time (for a right
continuous ﬁltration Ft ) :
If {S ≤ t} ∈ Ft then {S < t} = ∪n {S ≤ t − 1/n} ∈ Ft .
If {S < t} ∈ Ft then {S ≤ t} = ∩n {S < t + 1/n} ∈ Ft .
The ﬁrst conclusion requires only that t → Ft is increasing. The second relies
on the fact that t → Ft is right continuous. (3.2) and (3.3) below show that
when checking something is a stopping time, it is nice to know that the two
deﬁnitions are equivalent.
(3.1) Theorem. If G is an open set and T = inf {t ≥ 0 : Bt ∈ G} then T is a
stopping time. Section 7.3 Stopping Times, Strong Markov Property
Proof Since G is open and t → Bt is continuous, {T < t} = ∪q<t {Bq ∈ G},
where the union is over all rational q , so {T < t} ∈ Ft . Here we need to use the
rationals to get a countable union, and hence a measurable set.
(3.2) Theorem. If Tn is a sequence of stopping times and Tn ↓ T then T is a
stopping time.
Proof {T < t} = ∪n {Tn < t}. (3.3) Theorem. If Tn is a sequence of stopping times and Tn ↑ T then T is a
stopping time.
Proof {T ≤ t} = ∩n {Tn ≤ t}. (3.4) Theorem. If K is a closed set and T = inf {t ≥ 0 : Bt ∈ K } then T is a
stopping time.
Proof Let Gn = ∪{(x − 1/n, x + 1/n) : x ∈ K } and let Tn = inf {t ≥ 0 :
Bt ∈ Gn }. Since Gn is open, it follows from (3.1) that Tn is a stopping time.
I claim that as n ↑ ∞, Tn ↑ T . To prove this, notice that T ≥ Tn for all n, so
lim Tn ≤ T . To prove T ≤ lim Tn , we can suppose that Tn ↑ t < ∞. Since
¯
B (Tn ) ∈ Gn for all n and B (Tn ) → B (t), it follows that B (t) ∈ K and T ≤ t.
Exercise 3.1. Let S be a stopping time and let Sn = ([2n S ] + 1)/2n where
[x] = the largest integer ≤ x. That is,
Sn = (m + 1)2−n if m2−n ≤ S < (m + 1)2−n
In words, we stop at the ﬁrst time of the form k 2−n after S (i.e., > S ). From
the verbal description, it should be clear that Sn is a stopping time. Prove that
it is.
Exercise 3.2. If S and T are stopping times, then S ∧ T = min{S, T },
S ∨ T = max{S, T }, and S + T are also stopping times. In particular, if t ≥ 0,
then S ∧ t, S ∨ t, and S + t are stopping times.
Exercise 3.3. Let Tn be a sequence of stopping times. Show that
sup Tn ,
n are stopping times. inf Tn ,
n lim sup Tn ,
n lim inf Tn
n 385 386 Chapter 7 Brownian Motion
(3.1) and (3.4) will take care of all the hitting times we will consider. Our
next goal is to state and prove the strong Markov property. To do this, we need
to generalize two deﬁnitions from Section 3.1. Given a nonnegative random
variable S (ω ) we deﬁne the random shift θS , which “cuts oﬀ the part of ω
before S (ω ) and then shifts the path so that time S (ω ) becomes time 0.” In
symbols, we set
ω (S (ω ) + t)
∆ (θS ω )(t) = on {S < ∞}
on {S = ∞} where ∆ is an extra point we add to C . As in Section 5.2, we will usually
explicitly restrict our attention to {S < ∞}, so the reader does not have to
worry about the second half of the deﬁnition.
The second quantity FS , “the information known at time S ,” is a little
more subtle. Imitating the discrete time deﬁnition from Section 3.1, we let
FS = {A : A ∩ {S ≤ t} ∈ Ft for all t ≥ 0}
In words, this makes the reasonable demand that the part of A that lies in
{S ≤ t} should be measurable with respect to the information available at time
t. Again we have made a choice between ≤ t and < t, but as in the case of
stopping times, this makes no diﬀerence, and it is useful to know that the two
deﬁnitions are equivalent.
Exercise 3.4. Show that when Ft is right continuous, the last deﬁnition is
unchanged if we replace {S ≤ t} by {S < t}.
For practice with the deﬁnition of FS , do:
Exercise 3.5. Let S be a stopping time, let A ∈ FS , and let
R= S on A
∞ on Ac Show that R is a stopping time.
Exercise 3.6. Let S and T be stopping times.
(i) {S < t}, {S > t}, {S = t} are in FS .
(ii) {S < T }, {S > T }, and {S = T } are in FS (and in FT ).
Most of the properties of FN derived in Section 3.1 carry over to continuous
time. Two that will be useful below are: Section 7.3 Stopping Times, Strong Markov Property
(3.5) Theorem. If S ≤ T are stopping times then FS ⊂ FT .
Proof If A ∈ FS then A ∩ {T ≤ t} = (A ∩ {S ≤ t}) ∩ {T ≤ t} ∈ Ft . (3.6) Theorem. If Tn ↓ T are stopping times then FT = ∩F (Tn ).
Proof (3.5) implies F (Tn ) ⊃ FT for all n. To prove the other inclusion, let
A ∈ ∩F (Tn ). Since A ∩ {Tn < t} ∈ Ft and Tn ↓ T , it follows that A ∩ {T <
t } ∈ Ft .
The last result allows us to prove something that is obvious from the verbal
deﬁnition.
Exercise 3.7. BS ∈ FS , i.e., the value of BS is measurable with respect to
the information known at time S ! To prove this, let Sn = ([2n S ] + 1)/2n be the
stopping times deﬁned in Exercise 3.1. Show B (Sn ) ∈ FSn , then let n → ∞
and use (3.6).
We are now ready to state the strong Markov property, which says that
the Markov property holds at stopping times.
(3.7) Strong Markov property. Let (s, ω ) → Ys (ω ) be bounded and R × C
measurable. If S is a stopping time, then for all x ∈ R
Ex (YS ◦ θS FS ) = EB (S ) YS on {S < ∞}
where the righthand side is the function ϕ(x, t) = Ex Yt evaluated at x = B (S ),
t = S.
Proof We ﬁrst prove the result under the assumption that there is a sequence
Px (S = tn ). In this case, the proof
of times tn ↑ ∞, so that Px (S < ∞) =
is basically the same as the proof of (2.4) in Chapter 5. We break things down
according to the value of S , apply the Markov property, and put the pieces back
together. If we let Zn = Ytn (ω ) and A ∈ FS , then
∞ Ex (YS ◦ θS ; A ∩ {S < ∞}) = Ex (Zn ◦ θtn ; A ∩ {S = tn })
n=1 Now if A ∈ FS , A ∩ {S = tn } = (A ∩ {S ≤ tn }) − (A ∩ {S ≤ tn−1 }) ∈ Ftn , so
it follows from the Markov property that the above sum is
∞ = Ex (EB (tn ) Zn ; A ∩ {S = tn }) = Ex (EB (S ) YS ; A ∩ {S < ∞})
n=1 387 388 Chapter 7 Brownian Motion
To prove the result in general, we let Sn = ([2n S ] + 1)/2n be the stopping time
deﬁned in Exercise 3.1. To be able to let n → ∞, we restrict our attention to
Y ’s of the form
n (∗) Ys (ω ) = f0 (s) fm (ω (tm ))
m=1 where 0 < t1 < . . . < tn and f0 , . . . , fn are bounded and continuous. If f is
bounded and continuous then the dominated convergence theorem implies that
x→ dy pt (x, y )f (y ) is continuous. From this and induction, it follows that
ϕ(x, s) = Ex Ys = f0 (s)
... dy1 pt1 (x, y1 )f1 (y1 )
dyn ptn −tn−1 (yn−1 , yn )fn (yn ) is bounded and continuous.
Having assembled the necessary ingredients, we can now complete the
proof. Let A ∈ FS . Since S ≤ Sn , (3.5) implies A ∈ F (Sn ). Applying the special case of (3.7) proved above to Sn and observing that {Sn < ∞} = {S < ∞}
gives
Ex (YSn ◦ θSn ; A ∩ {S < ∞}) = Ex (ϕ(B (Sn ), Sn ); A ∩ {S < ∞})
Now, as n → ∞, Sn ↓ S , B (Sn ) → B (S ), ϕ(B (Sn ), Sn ) → ϕ(B (S ), S ) and
YSn ◦ θSn → YS ◦ θS
so the bounded convergence theorem implies that (3.7) holds when Y has the
form given in (∗).
To complete the proof now, we will apply the monotone class theorem. As
in the proof of (2.1), we let H be the collection of Y for which
Ex (YS ◦ θS ; A) = Ex (EB (S ) YS ; A) for all A ∈ FS and it is easy to see that (ii) and (iii) hold. This time, however, we take A to
be the sets of the form A = G0 × {ω : ω (sj ) ∈ Gj , 1 ≤ j ≤ k }, where the Gj are
n
open sets. To verify (i), we note that if Kj = Gc and fj (x) = 1 ∧ nρ(x, Kj ),
j
n
where ρ(x, K ) = inf {x = y  : y ∈ K } then fj are continuous functions with
n
fj ↑ 1Gj as n ↑ ∞. The facts that
k
n
Ysn (ω ) = f0 (s) n
fj (ω (sj )) ∈ H
j =1 Section 7.4 Maxima and Zeros
and (iii) holds for H imply that 1A ∈ H. This veriﬁes (i) in the monotone class
theorem and completes the proof of (3.7).
The next section is devoted to applications of the strong Markov property. 7.4. Maxima and Zeros
In this section, we will use the strong Markov property to derive properties
of the zero set {t : Bt = 0}, the hitting times Ta = inf {t : Bt = a} and
max0≤s≤t Bs . This is the tip of an iceberg that is bigger than the one the
Titanic ran into.
Example 4.1. Zeros of Brownian motion. Let Rt = inf {u > t : Bu = 0}
and let T0 = inf {u > 0 : Bu = 0}. Now (2.10) implies Px (Rt < ∞) = 1, so
B (Rt ) = 0 and the strong Markov property and (2.7) imply
Px (T0 ◦ θRt > 0FRt ) = P0 (T0 > 0) = 0
Taking expected value of the last equation, we see that
Px (T0 ◦ θRt > 0 for some rational t) = 0
From this, it follows that if a point u ∈ Z (ω ) ≡ {t : Bt (ω ) = 0} is isolated
on the left (i.e., there is a rational t < u so that (t, u) ∩ Z (ω ) = ∅), then it is,
with probability one, a decreasing limit of points in Z (ω ). This shows that the
closed set Z (ω ) has no isolated points and hence must be uncountable. For the
last step, see Hewitt and Stromberg (1965), page 72.
If we let Z (ω ) denote the Lebesgue measure of Z (ω ) then Fubini’s theorem
implies
T Ex (Z (ω ) ∩ [0, T ]) = Px (Bt = 0) dt = 0
0 So Z (ω ) is a set of measure zero.
The last four observations show that Z is like the Cantor set that is obtained by removing (1/3, 2/3) from [0, 1] and then repeatedly removing the
middle third from the intervals that remain. The Cantor set is bigger however.
Its Hausdorﬀ dimension is log 2/ log 3, while Z has dimension 1/2.
Example 4.2. Hitting times have independent increments.
(4.1) Theorem. Under P0 , {Ta , a ≥ 0} has stationary independent increments. 389 390 Chapter 7 Brownian Motion
Proof The ﬁrst step is to notice that if 0 < a < b then
Tb ◦ θTa = Tb − Ta , so if f is bounded and measurable, the strong Markov property (3.7) and translation invariance imply
E0 (f (Tb − Ta ) FTa ) = E0 (f (Tb ) ◦ θTa FTa )
= Ea f (Tb ) = E0 f (Tb−a )
To show that the increments are independent, let a0 < a1 . . . < an , and let fi ,
1 ≤ i ≤ n be bounded and measurable. Conditioning on FTan−1 and using the
preceding calculation we have
n E0 fi (Tai − Tai−1 )
i=1
n−1 = E0 fi (Tai − Tai−1 )E0 (fn (Tan − Tan−1 )FTan−1 )
i=1
n−1 = E0 fi (Tai − Tai−1 ) E0 fn (Tan −an−1 )
i=1 By induction, it follows that
n E0 n fi (Tai − Tai−1 )
i=1 = E0 fi (Tai −ai−1 )
i=1 which implies the desired conclusion.
The scaling relation (1.2) implies
(4.2) d Ta = a2 T1 Combining (4.1) and (4.2), and using (7.15) from Chapter 2, we see that Ta has
a stable law with index α = 1/2. Since Ta ≥ 0, the skewness parameter κ = 1.
For a derivation that does not rely on the fact that (7.13) gives all the stable
laws, combine Example 6.6 below with Exercise 7.8 in Chapter 2.
Without knowing the theory mentioned in the previous paragraph, it is
easy to determine the Laplace transform ϕa (λ) = E0 exp(−λTa ). To do this,
we start by observing that (4.1) implies
ϕx (λ)ϕy (λ) = ϕx+y (λ) Section 7.4 Maxima and Zeros
It follows easily from this that
(4.3) ϕa (λ) = exp(−ac(λ)) Proof Let c(λ) = − log ϕ1 (λ) so (4.3) holds when a = 1. Using the previous
identity with x = y = 2−m and induction gives the result for a = 2−m , m ≥ 1.
Then, letting x = k 2−m and y = 2−m we get the result for a = (k + 1)2−m with
k ≥ 1. Finally, to extend to a ∈ [0, ∞), note that a → ϕa (λ) is decreasing.
To identify c(λ), we observe that (4.2) implies
E exp(−Ta ) = E exp(−a2 T1 )
√
so ac(1) = c(a2 ), i.e., c(λ) = c λ. In the next section (see Exercise 5.2), we
will show
√
(4.4)
E0 (exp(−λTa )) = exp(−a 2λ)
√
so c = 2.
Our next goal is to compute the distribution of the hitting times Ta . This
application of the strong Markov property shows why we want to allow the
function Y that we apply to the shifted path to depend on the stopping time
S.
Example 4.3. Reﬂection principle. Let a > 0 and let Ta = inf {t : Bt = a}.
Then
(4.5) P0 (Ta < t) = 2P0 (Bt ≥ a) Intuitive proof We observe that if Bs hits a at some time s < t, then the
strong Markov property implies that Bt − B (Ta ) is independent of what happened before time Ta . The symmetry of the normal distribution and Pa (Bu =
a) = 0 for u > 0 then imply
(4.6) P0 (Ta < t, Bt > a) = 1
P 0 ( Ta < t )
2 Rearranging the last equation and using {Bt > a} ⊂ {Ta < t} gives
P0 (Ta < t) = 2P0 (Ta < t, Bt > a) = 2P0 (Bt > a)
Proof To make the intuitive proof rigorous, we only have to prove (4.6). To
extract this from the strong Markov property (3.7), we let
Ys ( ω ) = 1
0 if s < t, ω (t − s) > a
otherwise 391 392 Chapter 7 Brownian Motion
We do this so that if we let S = inf {s < t : Bs = a} with inf ∅ = ∞, then
YS (θS ω ) = 1 if S < t, Bt > a
0 otherwise and the strong Markov property implies
E0 (YS ◦ θS FS ) = ϕ(BS , S ) on {S < ∞} = {Ta < t} where ϕ(x, s) = Ex Ys . BS = a on {S < ∞} and ϕ(a, s) = 1/2 if s < t, so
taking expected values gives
P0 (Ta < t, Bt ≥ a) = E0 (YS ◦ θS ; S < ∞)
= E0 (E0 (YS ◦ θS FS ); S < ∞) = E0 (1/2; Ta < ∞)
which proves (4.6).
Exercise 4.1. (i) Generalize the proof of (4.6) to conclude that if u < v ≤ a
then
(4.7) P0 (Ta < t, u < Bt < v ) = P0 (2a − v < Bt < 2a − u) (ii) Let Mt = max0≤s≤t Bs . Use (i) to derive the joint density
P0 (Mt = a, Bt = x) = 2(2a − x) −(2a−x)2 /2t
√
e
2πt3 Using (4.5), we can compute the probability density of Ta . We begin by
noting that
∞ (2πt)−1/2 exp(−x2 /2t)dx P (Ta ≤ t) = 2 P0 (Bt ≥ a) = 2
a then change variables x = (t1/2 a)/s1/2 to get
0 (2πt)−1/2 exp(−a2 /2s) −t1/2 a/2s3/2 ds P 0 ( Ta ≤ t ) = 2
t
t (4.8) (2πs3 )−1/2 a exp(−a2 /2s) ds =
0 Using the last formula, we can compute: Section 7.4 Maxima and Zeros
Example 4.4. The distribution of L = sup{t ≤ 1 : Bt = 0}. By (2.3),
∞ P0 (L ≤ s) = ps (0, x)Px (T0 > 1 − s) dx
−∞
∞ ∞ (2πs)−1/2 exp(−x2 /2s) =2
1
π
1
=
π (2πr3 )−1/2 x exp(−x2 /2r) dr dx
1−s 0
∞ ∞ (sr3 )−1/2 = x exp(−x2 (r + s)/2rs) dx dr 1−s
∞ 0 (sr3 )−1/2 rs/(r + s) dr
1−s Our next step is to let t = s/(r + s) to convert the integral over r ∈ [1 − s, ∞)
into one over t ∈ [0, s]. dt = −s/(r + s)2 dr, so to make the calculations easier
we ﬁrst rewrite the integral as
1
=
π ∞ (r + s)
rs 1−s 2 1/2 s
dr
(r + s)2 and then change variables to get
(4.9) P0 (L ≤ s) = 1
π s (t(1 − t))−1/2 dt =
0 √
2
arcsin( s)
π The arcsin may remind the reader of the limit theorem for L2n = sup{m ≤ 2n :
Sm = 0} given in (3.6) of Chapter 3. We will see in Section 7.6 that our new
result is a consequence of the old one.
Exercise 4.2. Show that R = inf {t > 1 : Bt = 0} has probability density
P0 (R = 1 + t) = 1/(πt1/2 (1 + t))
Example 4.5. L´vy’s modulus of continuity
e
osc(δ ) = sup{Bs − Bt  : s, t ∈ [0, 1], t − s < δ }
(4.10) Theorem. With probability 1,
lim sup osc(δ )/(δ log(1/δ ))1/2 ≤ 6
δ →0 393 394 Chapter 7 Brownian Motion
Remark. The constant 6 is not the best possible because the end of the proof
is sloppy. L´vy (1937) showed
e
lim sup osc(δ )/(δ log(1/δ ))1/2 = √
2 δ →0 See McKean (1969), p. 1416, or Itˆ and McKean (1965), p. 3638, where a
o
sharper result due to Chung, Erd¨s and Sirao (1959) is proved. In contrast, if
o
we look at the behavior at a single point, (9.5) below shows
lim sup Bt / 2t log log(1/t) = 1 a.s.
t→0 Proof Let Im,n = [m2−n , (m + 1)2−n ], and ∆m,n = sup{Bt − B (m2−n ) : t ∈
Im,n }. From (4.5) and the scaling relation, it follows that
P (∆m,n ≥ a2−n/2 ) ≤ 4P (B (2−n ) ≥ a2−n/2 )
= 4P (B (1) ≥ a) ≤ 4 exp(−a2 /2)
by (1.3) in Chapter 1 if a ≥ 1. If
then the last result implies > 0, b = 2(1 + )(log 2), and an = (bn)1/2 , P (∆m,n ≥ an 2−n/2 for some m ≤ 2n ) ≤ 2n · 4 exp(−bn/2) = 4 · 2−n
so the BorelCantelli lemma implies that if n ≥ N (ω ), ∆m,n ≤ (bn)1/2 2−n/2 .
Now if s ∈ Im,n , s < t and s − t < 2−n , then t Im,n or Im+1,n . I claim that
in either case the triangle inequality implies
Bt − Bs  ≤ 3(bn)1/2 2−n/2
To see this, note that the worst case is t ∈ Im+1,n , but even in this case
Bt − Bs  ≤ Bt − B ((m + 1)2−n )
+ B ((m + 1)2−n ) − B (m2−n ) + B (m2−n ) − Bs 
It follows from the last estimate that for 2−(n+1) ≤ δ < 2−n
osc(δ ) ≤ 3(bn)1/2 2−n/2 ≤ 3(b log2 (1/δ ))1/2 (2δ )1/2 = 6((1 + )δ log(1/δ ))1/2
Recall b = 2(1 + ) log 2 and observe exp((log 2)(log2 1/δ )) = 1/δ . Section 7.5 Martingales 7.5. Martingales
At the end of Section 4.7 we used martingales to study the hitting times of
random walks. The same methods can be used on Brownian motion once we
prove:
(5.1) Theorem. Let Xt be a right continuous martingale adapted to a right
continuous ﬁltration. If T is a bounded stopping time, then EXT = EX0 .
Proof Let n be an integer so that P (T ≤ n − 1) = 1. As in the proof of
m
the strong Markov property, let Tm = ([2m T ] + 1)/2m . Yk = X (k 2−m ) is a
m
−m
m
martingale with respect to Fk = F (k 2 ) and Sm = 2 Tm is a stopping time
m
for (Ykm , Fk ), so by Exercise 4.3 in Chapter 4
m
m
m
X (Tm ) = YSm = E (Yn2m FSm ) = E (Xn F (Tm )) As m ↑ ∞, X (Tm ) → X (T ) by right continuity and F (Tm ) ↓ F (T ) by (3.6), so
it follows from (6.3) in Chapter 4 that
X (T ) = E (Xn F (T ))
Taking expected values now gives EX (T ) = EXn = EX0 , since Xn is a martingale.
(5.2) Theorem. Bt is a martingale w.r.t. the σ ﬁelds Ft deﬁned in Section 3.
Note: We will use these σ ﬁelds in (5.4), (5.6), and (5.8) but will not mention
them explicitly in the statements.
Proof The Markov property implies that
Ex (Bt Fs ) = EBs (Bt−s ) = Bs since symmetry implies Ey Bu = y for all u ≥ 0.
From (5.2), it follows immediately that we have:
(5.3) Theorem. If a < x < b then Px (Ta < Tb ) = (b − x)/(b − a).
Proof Let T = Ta ∧ Tb . (2.10) implies that T < ∞ a.s. Using (5.1) and
(5.2), it follows that x = Ex B (T ∧ t). Letting t → ∞ and using the bounded
convergence theorem, it follows that
x = aPx (Ta < Tb ) + b(1 − Px (Ta < Tb )) 395 396 Chapter 7 Brownian Motion
Solving for Px (Ta < Tb ) now gives the desired result.
Example 5.1. Optimal doubling in Backgammon (Keeler and Spencer
(1975)). In our idealization, backgammon is a Brownian motion starting at 1/2
run until it hits 1 or 0, and Bt is the probability you will win given the events
up to time t. Initially, the “doubling cube” sits in the middle of the board and
either player can “double,” that is, tell the other player to play on for twice
the stakes or give up and pay the current wager. If a player accepts the double
(i.e., decides to play on), she gets possession of the doubling cube and is the
only one who can oﬀer the next double.
A doubling strategy is given by two numbers b < 1/2 < a, i.e., oﬀer a
double when Bt ≥ a and give up if the other player doubles and Bt < b. It is
not hard to see that for the optimal strategy b∗ = 1 − a∗ and that when Bt = b∗
accepting and giving up must have the same payoﬀ. If you accept when your
probability of winning is b∗ , then you lose 2 dollars when your probability hits
0 but you win 2 dollars when your probability of winning hits a∗ , since at that
moment you can double and the other player gets the same payoﬀ if they give
up or play on. If giving up or playing on at b∗ is to have the same payoﬀ, we
must have
b∗
a∗ − b∗
−1 = ∗ · 2 +
· (−2)
a
a∗
Writing b∗ = c and a∗ = 1 − c and solving, we have
−(1 − c) = 2c − 2(1 − 2c) 1 = 5c so b∗ = 1/5 and a∗ = 4/5.
2
(5.4) Theorem. Bt − t is a martingale. Proof 2
2
Ex (Bt Fs ) = Ex (Bs + 2Bs (Bt − Bs ) + (Bt − Bs )2 Fs )
2
= Bs + 2Bs Ex (Bt − Bs Fs ) + Ex ((Bt − Bs )2 Fs )
2
= Bs + 0 + (t − s) since Bt − Bs is independent of Fs and has mean 0 and variance t − s.
(5.5) Theorem. Let T = inf {t : Bt ∈ (a, b)}, where a < 0 < b.
/
E0 T = −ab
Proof (5.1) and (5.4) imply E0 (B 2 (T ∧ t)) = E0 (T ∧ t)). Letting t → ∞ and
using the monotone convergence theorem gives E0 (T ∧ t) ↑ E0 T . Using the
bounded convergence theorem and (5.3), we have
2
E0 B 2 (T ∧ t) → E0 BT = a2 a−b
b
−a
+ b2
= ab
= −ab
b−a
b−a
b−a Section 7.5 Martingales
(5.6) Theorem. exp(θBt − (θ2 t/2)) is a martingale.
Proof Ex (exp(θBt )Fs ) = exp(θBs )E (exp(θ(Bt − Bs ))Fs )
= exp(θBs ) exp(θ2 (t − s)/2) since Bt − Bs is independent of Fs and has a normal distribution with mean 0
and variance t − s.
(5.7) Theorem. Let T = inf {t : Bt ∈ (−a, a)}.
/
√
E0 exp(−λT ) = 1/ cosh(a 2λ)
Proof (5.1) and (5.6) imply that 1 = E0 exp(θB (T ∧ t) − θ2 (T ∧ t)/2). Letting
t → ∞ and using the bounded convergence theorem gives
1 = E0 exp(θBT − θ2 T /2)
Symmetry implies that P (BT = a) = P (BT = −a) = 1/2 and that B (T ) is
independent of T , so
1 = cosh(θa)E0 exp(−θ2 T /2)
√
Setting θ = 2λ now gives the desired result. If you don’t like the “symmetry implies that,” then
Exercise 5.1. Derive (5.7) by showing that exp(−θ2 t/2) cosh(θBt ) is a martingale.
3
4
2
(5.8) Theorem. Bt − 3tBt , Bt − 6tBt + 3t2 , . . . are martingales. Proof The result in (5.6) can be written as
E (exp(θBt − θ2 t/2); A) = E (exp(θBs − θ2 s/2); A) for A ∈ Fs Let f (x, t, θ) = exp(θx − θ2 t/2), let fk (x, t, θ) be the k th derivative w.r.t. θ, and
let hk (x, t) = fk (x, t, 0). Diﬀerentiating the last identity k times (referring to
Section A.9 for the justiﬁcation) and setting θ = 0, we see that hk (Bt , t) is a
martingale. We have seen h1 (Bt , t) and h2 (Bt , t), but the other ones are new:
k
1
2 hk (θ, t)
(x − θt)f (x, t, θ)
{(x − θt)2 − t}f (x, t, θ) Martingale
Bt
2
Bt − t 397 398 Chapter 7 Brownian Motion
3
4 {(x − θt)3 − 3t(x − θt)}f (x, t, θ)
{(x − θt)4 − 6t(x − θt)2 + 3t2 }f (x, t, θ) 3
Bt − 3tBt
4
2
Bt − 6tBt + 3t2 (5.9) Theorem. Let T = inf {t : Bt ∈ (−a, a)}.
/
ET 2 = 5a4 /3
Proof (5.1) and (5.8) imply E (B (T ∧ t)4 − 6(T ∧ t)B (T ∧ t)2 ) = −3E (T ∧ t)2 .
From (5.5), we know that ET = a2 < ∞. Letting t → ∞, using the dominated
convergence theorem on the lefthand side, and the monotone convergence theorem on the right gives
a4 − 6a2 ET = −3E (T 2 )
Plugging in ET = a2 gives the desired result.
Exercises
5.2. The point of this exercise is to get information about the amount of time
it takes Brownian motion with drift −b, Xt ≡ Bt − bt to hit level a. Let
τ = inf {t : Bt = a + bt}, where a > 0. (i) Use the martingale exp(θBt − θ2 t/2)
with θ = b + (b2 + 2λ)1/2 to show
E0 exp(−λτ ) = exp(−a{b + (b2 + 2λ)1/2 })
Setting b = 0 gives formula (4.4) as promised. Letting λ → 0
P0 (τ < ∞) = exp(−2ab)
/
5.3. Let σ = inf {t : Bt ∈ (a, b)} and let λ > 0. Use the strong Markov property
to show
Ex exp(−λTa ) = Ex (e−λσ ; Ta < Tb ) + Ex (e−λσ ; Tb < Ta )Eb exp(−λTa )
(ii) Interchange the roles of a and b to get a second equation, use (4.4), and
solve to get
√
√
Ex (e−λT ; Ta < Tb ) = sinh( 2λ(b − x))/ sinh( 2λ(b − a))
√
√
Ex (e−λT ; Tb < Ta ) = sinh( 2λ(x − a))/ sinh( 2λ(b − a))
/
5.4. If U = inf {t : Bt ∈ (a, b)}, where a < 0 < b and a = −b, then U and
2
BU are not independent, so we cannot calculate EU 2 as we did in the proof of Section 7.6 Donsker’s Theorem
2
(5.9). Use the CauchySchwarz inequality to estimate E (U BU ) and conclude
2
4
EU ≤ C E (BU ), where C is independent of a and b. 5.5. Let u(t, x) be a function that satisﬁes
(∗) ∂u 1 ∂ 2 u
+
=0
∂t
2 ∂x2 and ∂2u
(t, x) ≤ CT exp(x2 /(t + ))
∂x2 for t ≤ T Show that u(t, Bt ) is a martingale by checking
1 ∂2
∂
pt (x, y ) =
pt (x, y )
∂t
2 ∂y 2
interchanging ∂/∂t and , and then integrating by parts twice to show
∂
Ex u(t, Bt ) =
∂t ∂
(pt (x, y )u(t, y )) dy = 0
∂t Examples of functions that satisfy (∗) are exp(θx − θ2 t/2), x, x2 − t, . . .
6
4
2
5.6. Find a martingale of the form Bt − atBt + bt2 Bt − ct3 and use it to
compute the third moment of T = inf {t : Bt ∈ (− , )}. Note: You can
/
diﬀerentiate exp(θx − θ2 t/2) six times, but it is easier to get a, b, and c from
the last exercise.
2
5.7. Show that (1 + t)−1/2 exp(Bt /2(1 + t)) is a martingale and use this to
conclude that lim supt→∞ Bt /(2t log t)1/2 ≤ 1 a.s. 7.6. Donsker’s Theorem
Let X1 , X2 , . . . be i.i.d. with EX = 0 and EX 2 = 1, and let Sn = X1 + · · · + Xn .
In this section, we will show that as n → ∞, S (nt)/n1/2 , 0 ≤ t ≤ 1 converges
in distribution to Bt a Brownian motion starting from B0 = 0. We will say
precisely what the last sentence means below. The key to its proof is:
(6.1) Skorokhod’s representation theorem. If EX = 0 and EX 2 < ∞
then there is a stopping time T for Brownian motion so that BT =d X and
ET = EX 2 .
Remark. The Brownian motion in the statement and all the Brownian motions
in this section have B0 = 0.
Proof Suppose ﬁrst that X is supported on {a, b}, where a < 0 < b. Since
EX = 0, we must have
P (X = a) = b
b−a P (X = b) = −a
b−a 399 400 Chapter 7 Brownian Motion
/
If we let T = Ta,b = inf {t : Bt ∈ (a, b)} then (5.3) implies BT =d X and (5.5)
tells us that
2
ET = −ab = EBT
To treat the general case, we will write F (x) = P (X ≤ x) as a mixture of
two point distributions with mean 0. Let
0 c= ∞ (−u) dF (u) = v dF (v ) −∞ 0 If ϕ is bounded and ϕ(0) = 0, then
∞ c ϕ(x) dF (x) = 0 ϕ(v ) dF (v ) (−u)dF (u) 0
0 −∞
∞ ϕ(u) dF (u) +
−∞
∞ = v dF (v )
0 0 dF (v )
0 dF (u) (vϕ(u) − uϕ(v ))
−∞ So we have
∞ ϕ(x) dF (x) = c−1 0 dF (v )
0 dF (u)(v − u)
−∞ −u
v
ϕ ( u) +
ϕ(v )
v−u
v−u The last equation gives the desired mixture. If we let (U, V ) ∈ R2 have
P {(U, V ) = (0, 0)} = F ({0})
(6.2) P ((U, V ) ∈ A) = c−1 dF (u) dF (v ) (v − u)
(u,v )∈A for A ⊂ (−∞, 0) × (0, ∞) and deﬁne probability measures by µ0,0 ({0}) = 1 and
µu,v ({u}) = v
v−u µu,v ({v }) = −u
v−u for u < 0 < v then
ϕ(x) dF (x) = E ϕ(x) µU,V (dx) We proved the last formula when ϕ(0) = 0, but it is easy to see that it is true
in general. Letting ϕ ≡ 1 in the last equation shows that the measure deﬁned
in (6.2) has total mass 1. Section 7.6 Donsker’s Theorem
From the calculations above it follows that if we have (U, V ) with distribution given in (6.2) and an independent Brownian motion deﬁned on the same
space then B (TU,V ) =d X . Sticklers for detail will notice that TU,V is not a
stopping time for Bt since (U, V ) is independent of the Brownian motion. This
is not a serious problem since if we condition on U = u and V = v , then Tu,v
is a stopping time, and this is good enough for all the calculations below. For
instance, to compute E (TU,V ) we observe
E (TU,V ) = E {E (TU,V (U, V ))} = E (−U V )
by (5.5). (6.2) implies
0 ∞ E (−U V ) = dF (v )v (v − u)c−1 dF (u)(−u)
−∞
0 = 0
∞ dF (v )c−1 v 2 dF (u)(−u) −u +
−∞ since 0 ∞ c= 0 v dF (v ) =
0 (−u) dF (u)
−∞ Using the second expression for c now gives
0 ∞ u2 dF (u) + E (TU,V ) = E (−U V ) =
−∞ v 2 dF (v ) = EX 2
0 2
Exercise 6.1. Use Exercise 5.4 to conclude that E (TU,V ) ≤ CEX 4 . Remark. One can embed distributions in Brownian motion without adding
random variables to the probability space: See Dubins (1968), Root (1969), or
Sheu (1986).
From (6.1), it is only a small step to:
(6.3) Theorem. Let X1 , X2 , . . . be i.i.d. with a distribution F , which has mean
0 and variance 1, and let Sn = X1 + . . . + Xn . There is a sequence of stopping
times T0 = 0, T1 , T2 , . . . such that Sn =d B (Tn ) and Tn − Tn−1 are independent
and identically distributed.
Proof Let (U1 , V1 ), (U2 , V2 ), . . . be i.i.d. and have distribution given in (6.2)
and let Bt be an independent Brownian motion. Let T0 = 0, and for n ≥ 1, let
Tn = inf {t ≥ Tn−1 : Bt − B (Tn−1 ) ∈ (Un , Vn )}
/ 401 402 Chapter 7 Brownian Motion
As a corollary of (6.3), we get:
√
(6.4) Central limit theorem. Under the hypotheses of (6.3), Sn / n ⇒ χ,
where χ has the standard normal distribution.
Proof √
If we let Wn (t) = B (nt)/ n =d Bt by Brownian scaling, then
√d
√
Sn / n = B (Tn )/ n = Wn (Tn /n) The weak law of large numbers implies that Tn /n → 1 in probability. It should
√
be clear from this that Sn / n ⇒ B1 . To ﬁll in the details, let > 0, pick δ so
that
P (Bt − B1  > for some t ∈ (1 − δ, 1 + δ )) < /2
then pick N large enough so that for n ≥ N , P (Tn /n − 1 > δ ) < /2. The last
two estimates imply that for n ≥ N
P (Wn (Tn /n) − Wn (1) > ) <
Since is arbitrary, it follows that Wn (Tn /n) − Wn (1) → 0 in probability.
Applying the converging together lemma, Exercise 2.10 in Chapter 2, with
Xn = Wn (1) and Zn = Wn (Tn /n), the desired result follows.
Our next goal is to prove a strengthening of (6.4) that allows us to obtain
limit theorems for functionals of {Sm : 0 ≤ m ≤ n}, e.g., max0≤m≤n Sm or
{m ≤ n : Sm > 0}. Let C [0, 1] = {continuous ω : [0, 1] → R}. When
equipped with the norm ω = sup{ω (s) : s ∈ [0, 1]}, C [0, 1] becomes a
complete separable metric space. To ﬁt C [0, 1] into the framework of Section
2.9, we want our measures deﬁned on B = the σ ﬁeld generated by the open
sets. Fortunately,
(6.5) Lemma. B is the same as C the σ ﬁeld generated by the ﬁnite dimensional
sets {ω : ω (ti ) ∈ Ai }.
Proof Observe that if ξ is a given continuous function
{ω : ω − ξ ≤ r − 1/n} = ∩q {ω : ω (q ) − ξ (q ) ≤ r − 1/n} where the intersection is over all rationals in [0,1]. Letting n → ∞ shows
{ω : ω − ξ < r} ∈ C and B ⊂ C . To prove the reverse inclusion, observe that
if the Ai are open the ﬁnite dimensional set {ω : ω (ti ) ∈ Ai } is open, so the
π − λ theorem implies B ⊃ C . Section 7.6 Donsker’s Theorem
A sequence of probability measures µn on C [0, 1] is said to converge
weakly to a limit µ if for all bounded continuous functions ϕ : C [0, 1] → R,
ϕ dµn → ϕ dµ. Let N be the nonnegative integers and let
S ( u) = if u = k ∈ N
Sk
linear on [k, k + 1] for k ∈ N We will prove:
√
(6.6) Donsker’s theorem. Under the hypotheses of (6.3), S (n·)/ n ⇒ B (·),
i.e., the associated measures on C [0, 1] converge weakly.
To motivate ourselves for the proof we will begin by extracting several
corollaries. The key to each one is a consequence of (9.1) in Chapter 2.
(6.7) Theorem. If ψ : C [0, 1] → R has the property that it is continuous
P0 a.s. then
(∗) √
ψ (S (n·)/ n) ⇒ ψ (B (·)) Example 6.1. Let ψ (ω ) = ω (1). In this case, ψ : C [0, 1] → R is continuous
and (∗) is the central limit theorem.
Example 6.2. Let ψ (ω ) = max{ω (t) : 0 ≤ t ≤ 1}. Again, ψ : C [0, 1] → R is
continuous. This time (∗) says
√
max Sm / n ⇒ M1 ≡ max Bt 0≤m≤n 0≤t≤1 To complete the picture, we observe that by (4.5) the distribution of the righthand side is
P0 (M1 ≥ a) = P0 (Ta ≤ 1) = 2P0 (B1 ≥ a)
Exercise 6.2. Suppose Sn is onedimensional simple random walk and let
Rn = 1 + max Sm − min Sm
m≤n m≤n √
be the number of points visited by time n. Show that Rn / n ⇒ a limit.
Example 6.3. Let ψ (ω ) = sup{t ≤ 1 : ω (t) = 0}. This time, ψ is not
continuous, for if ω has ω (0) = 0, ω (1/3) = 1, ω (2/3) = , ω (1) = 2, and
linear on each interval [j, (j + 1)/3], then ψ (ω0 ) = 2/3 but ψ (ω ) = 0 for > 0. 403 404 Chapter 7 Brownian Motion
It is easy to see that if ψ (ω ) < 1 and ω (t) has positive and negative values
in each interval (ψ (ω ) − δ, ψ (ω )), then ψ is continuous at ω . By arguments in
Example 4.1, the last set has P0 measure 1. (If the zero at ψ (ω ) was isolated
on the left, it would not be isolated on the right.) It follows that
sup{m ≤ n : Sm−1 · Sm ≤ 0}/n ⇒ L = sup{t ≤ 1 : Bt = 0}
The distribution of L is given in (4.9). The last result shows that the arcsine
law, (3.4) in Chapter 3, proved for simple random walks holds when the mean
is 0 and variance is ﬁnite.
Example 6.4. Let ψ (ω ) = {t ∈ [0, 1] : (t) > a}. The point ω ≡ a shows
that ψ is not continuous, but it is easy to see that ψ is continuous at paths ω
with {t ∈ [0, 1] : ω (t) = a} = 0. Fubini’s theorem implies that
1 E0 {t ∈ [0, 1] : Bt = a} = P0 (Bt = a) dt = 0
0 so ψ is continuous P0 a.s. With a little work, (∗) implies
√
{m ≤ n : Sm > a n}/n ⇒ {t ∈ [0, 1] : Bt > a}
Proof Application of (∗) gives that for any a,
√
{t ∈ [0, 1] : S (nt) > a n} ⇒ {t ∈ [0, 1] : Bt > a} √
To convert this into a result about {m ≤ n : Sm > a n}, we note that
√
on {maxm≤n Xm  ≤
n}, which by Chebyshev’s inequality has a probability
→ 1, we have
√
√
1
{t ∈ [0, 1] : S (nt) > (a + ) n} ≤ {m ≤ n : Sm > a n}
n
√
≤ {t ∈ [0, 1] : S (nt) > (a − ) n}
Combining this with the ﬁrst conclusion of the proof and using the fact that
b → {t ∈ [0, 1] : Bt > b} is continuous at b = a with probability one, one
arrives easily at the desired conclusion.
To compute the distribution of {t ∈ [0, 1] : Bt > 0}, observe that we
proved in (3.8) of Chapter 3 that if Sn =d −Sn and P (Sm = 0) = 0 for all
m ≥ 1, e.g., the Xi have a symmetric continuous distribution, then the lefthand
side converges to the arcsine law, so the righthand side has that distribution
and is the limit for any random walk with mean 0 and ﬁnite variance. The Section 7.6 Donsker’s Theorem
last argument uses an idea called the “invariance principle” that originated
with Erd¨s and Kac (1946, 1947): The asymptotic behavior of functionals of
o
Sn should be the same as long as the central limit theorem applies. Our ﬁnal
application is from the original paper of Donsker (1951). Erd¨s and Kac (1946)
o
give the limit distribution for the case k = 2.
Example 6.5. Let ψ (ω ) = [0,1] ω (t)k dt where k > 0 is an integer. ψ is
continuous, so applying (6.7) gives
1 1 √
(S (nt)/ n)k dt ⇒ 0 k
Bt dt
0 To convert this into a result about the original sequence, we begin by observing
that if x < y with x − y  ≤ and x, y  ≤ M , then
y x k − y k  ≤
x M k+1
z k+1
dz ≤
k+1
k+1 From this, it follows that on
Gn (M ) = √
√
max Xm  ≤ M −(k+2) n, max Sm ≤ M n
m≤n m≤n we have
1 √
(S (nt)/ n)k dt − n−1−(k/2) 0 n
k
Sm ≤
m=1 1
(k + 1)M For ﬁxed M , it follows from Chebyshev’s inequality, Example 6.2, and (2.4) in
Chapter 2 that
lim inf P (Gn (M )) ≥ P
n→∞ max Bt  < M 0≤t≤1 The righthand side is close to 0 if M is large, so
1 √
(S (nt)/ n)k dt − n−1−(k/2) 0 n
k
Sm → 0
m=1 in probability, and it follows from the converging together lemma (Exercise 2.10
in Chapter 2) that
n n−1−(k/2) 1
k
Sm ⇒ m=1 k
Bt dt
0 405 406 Chapter 7 Brownian Motion
It is remarkable that the last result holds under the assumption that EXi = 0
and EXi2 = 1, i.e., we do not need to assume that E Xik  < ∞.
Exercise 6.3. When k = 1, the last result says that if X1 , X2 , . . . are i.i.d. with
EXi = 0 and EXi2 = 1, then
n n−3/2 1 ( n + 1 − m ) Xm ⇒ Bt dt
0 m=1 (i) Show that the righthand side has a normal distribution with mean 0 and
variance 1/3. (ii) Deduce this result from the LindebergFeller theorem.
Proof of (6.6) To simplify the proof and prepare for generalizations in the
next section, let Xn,m , 1 ≤ m ≤ n, be a triangular array of random variables,
n
Sn,m = Xn,1 + · · · + Xn,m and suppose Sn,m = B (τm ). Let
Sn,(u) = if u = m ∈ {0, 1, . . . , n}
Sn,m
linear for u ∈ [m − 1, m] when m ∈ {1, . . . , n} (6.8) Lemma. If τ[n ] → s in probability for each s ∈ [0, 1] then
ns
Sn,(n·) − B (·) → 0 in probability √
To make the connection with the original problem, let Xn,m = Xm / n and
n
n
n
n
deﬁne τ1 , . . . , τn so that (Sn,1 , . . . , Sn,n ) =d (B (τ1 ), . . . , B (τn )). If T1 , T2 , . . .
n
are the stopping times deﬁned in (6.3), Brownian scaling implies τm =d Tm /n,
so the hypothesis of (6.8) is satisﬁed.
Proof The fact that B has continuous paths (and hence uniformly continuous
on [0,1]) implies that if > 0 then there is a δ > 0 so that 1/δ is an integer and
(a) P (Bt − Bs  < for all 0 ≤ s ≤ 1, t − s < 2δ ) > 1 − The hypothesis of (6.8) implies that if n ≥ Nδ then
P (τ[n ] − kδ  < δ
nkδ for k = 1, 2, . . . , 1/δ ) ≥ 1 − n
Since m → τm is increasing, it follows that if s ∈ ((k − 1)δ, kδ ) τ[n ] − s ≥ τ[n (k−1)δ] − kδ
ns
n
τ[n ] − s ≤ τ[n ] − (k + 1)δ
ns
nkδ Section 7.6 Donsker’s Theorem
so if n ≥ Nδ ,
(b) P sup τ[n ] − s < 2δ
ns ≥1− 0≤s≤1 When the events in (a) and (b) occur
(c) Sn,m − Bm/n  < for all m ≤ n To deal with t = (m + θ)/n with 0 < θ < 1, we observe that
Sn,(nt) − Bt  ≤ (1 − θ)Sn,m − Bm/n  + θSn,m+1 − B(m+1)/n 
+ (1 − θ)Bm/n − Bt  + θB(m+1)/n − Bt 
Using (c) on the ﬁrst two terms and (a) on the last two, we see that if n ≥ Nδ
and 1/n < 2δ , then Sn,(n·) − B (·) < 2 with probability ≥ 1 − 2 . Since is
arbitrary, the proof of (6.8) is complete.
To get (6.6) now, we have to show:
(6.9) Lemma. If ϕ is bounded and continuous then Eϕ(Sn,(n·) ) → Eϕ(B (·)).
Proof For ﬁxed > 0, let Gδ = {ω : if ω − ω < δ then ϕ(ω ) − ϕ(ω ) < }.
Since ϕ is continuous, Gδ ↑ C [0, 1] as δ ↓ 0. Let ∆ = Sn,(n·) − B (·) . The
desired result now follows from (6.8) and the trivial inequality
Eϕ(Sn,(n·) ) − Eϕ(B (·)) ≤ + (2 sup ϕ(ω )){P (Gc ) + P (∆ ≥ δ )}
δ
To accommodate our ﬁnal example, we need a trivial generalization of
(6.6). Let C [0, ∞) = {continuous ω : [0, ∞) → R} and let C [0, ∞) be the
σ ﬁeld generated by the ﬁnite dimensional sets. Given a probability measure µ
on C [0, ∞), there is a corresponding measure πM µ on C [0, M ] = {continuous
ω : [0, M ] → R} (with C [0, M ] the σ ﬁeld generated by the ﬁnite dimensional
sets) obtained by “cutting oﬀ the paths at time M.” Let (ψM ω )(t) = ω (t) for
−
t ∈ [0, M ] and let πM µ = µ ◦ ψM1 . We say that a sequence of probability
measures µn on C [0, ∞) converges weakly to µ if for all M , πM µn converges
weakly to πM µ on C [0, M ], the last concept being deﬁned by a trivial extension
of the deﬁnitions for M = 1. With these deﬁnitions, it is easy to conclude:
√
(6.10) Theorem. S (n·)/ n ⇒ B (·), i.e., the associated measures on C [0, ∞)
converge weakly. 407 408 Chapter 7 Brownian Motion
Proof By deﬁnition, all we have to show is that weak convergence occurs on
C [0, M ] for all M < ∞. The proof of (6.6) works in the same way when 1 is
replaced by M.
√
Example 6.6. Let Nn = inf {m : Sm ≥ n} and T1 = inf {t : Bt ≥ 1}. Since
ψ (ω ) = T1 (ω ) ∧ 1 is continuous P0 a.s. on C [0, 1] and the distribution of T1 is
continuous, it follows from (6.7) that for 0 < t < 1
P (Nn ≤ nt) → P (T1 ≤ t)
Repeating the last argument with 1 replaced by M and using (6.10) shows that
the last conclusion holds for all t. *7.7. CLT’s for Dependent Variables
In this section, we will prove central limit theorems for some dependent sequences. First, by embedding of martingales in Brownian motion, we will prove
a LindebergFeller theorem for martingales, (7.3). Then, using an idea of Gordin
(1969), we will use the result for martingales to get a CLT for stationary sequences, (7.6). The condition in (7.6) may look diﬃcult to check, but we show
that it is implied by the usual mixing conditions. We begin with the central
limit theorems for: a. Martingales
(7.1) Theorem. If Sn is a martingale with S0 = 0, and Bt is a Brownian
motion, there is a sequence of stopping times 0 = T0 ≤ T1 ≤ T2 . . . for the
Brownian motion so that
d (S0 , S1 , . . . , Sk ) = (B (T0 ), B (T1 ), . . . , B (Tk )) for all k ≥ 0 Remark. This is due to Strassen (1967). See his Theorem 4.3.
Proof We include S0 = 0 = B (T0 ) only for the sake of starting the induction
argument. Suppose we have (S0 , . . . , Sk−1 ) =d (B (T0 ), . . . , B (Tk−1 )) for some
k ≥ 1. The strong Markov property implies that {B (Tk−1 +t)−B (Tk−1 ) : t ≥ 0}
is a Brownian motion that is independent of F (Tk−1 ). Let µk (s0 , . . . , sk−1 ; ·)
be a regular conditional distribution of Sk − Sk−1 , given Sj = sj , 0 ≤ j ≤ k − 1,
that is, for each Borel set A
P (Sk − Sk−1 ∈ AS0 , . . . , Sk−1 ) = µk (S0 , . . . , Sk−1 ; A) 7.7 CLT’s for Dependent Variables
By Exercises 1.16 and 1.14 in Chapter 4, this exists, and we have
0 = E (Sk − Sk−1 S0 , . . . , Sk−1 ) = xµk (S0 , . . . , Sk−1 ; dx) so the mean of the conditional distribution is 0 almost surely. Using (6.1) now
ˆ
we see that for almost every S ≡ (S0 , . . . , Sk−1 ), there is a stopping time τS
ˆ
(for {B (Tk−1 + t) − B (Tk−1 ) : t ≥ 0}) so that
d B (Tk−1 + τS ) − B (Tk−1 ) = µk (S1 , . . . , Sk−1 ; ·)
ˆ
d If we let Tk = Tk−1 + τS then (S0 , . . . , Sk ) = (B (T0 ), . . . , B (Tk )) and the result
ˆ
follows by induction.
Remark. While the details of the proof are fresh in the reader’s mind, we
would like to observe that if E (Sk − Sk−1 )2 < ∞, then
x2 µk (S0 , . . . , Sk−1 ; dx) E (τS S0 , . . . , Sk−1 ) =
ˆ 2
since Bt − t is a martingale and τS is the exit time from a randomly chosen
ˆ
interval (Sk−1 + Uk , Sk−1 + Vk ). Our ﬁrst step toward the promised LindebergFeller theorem is to prove
a result of Freedman (1971a). We say that Xn,m , Fn,m , 1 ≤ m ≤ n, is a
martingale diﬀerence array if Xn,m ∈ Fn,m and E (Xn,m Fn,m−1 ) = 0 for
1 ≤ m ≤ n, where Fn,0 = {∅, Ω}. Let
2
E (Xn,m Fn,m−1 ) Vn,k =
1≤m≤k (7.2) Theorem. Suppose {Xn,m , Fn,m } is a martingale diﬀerence array and
let Sn,m = Xn,1 + . . . + Xn,m .
If (i) Xn,m  ≤ n for all m with n → 0, and (ii) for each t, Vn,[nt] → t in probability,
then Sn,(n·) ⇒ B (·).
Let N = {0, 1, 2, . . .}. Here and throughout the section, Sn,(n·) denotes the
linear interpolation of Sn,m deﬁned by
Sn,(u) = if u = k ∈ N
Sk
linear for u ∈ [k, k + 1] when k ∈ N 409 410 Chapter 7 Brownian Motion
and B (·) is a Brownian motion with B0 = 0.
Proof (ii) implies Vn,n → 1 in probability. By stopping each sequence at the
ﬁrst time Vn,k > 2 and setting the later Xn,m = 0, we can suppose without loss
of generality that Vn,n ≤ 2 + 2 for all n. By (7.1), we can ﬁnd stopping times
n
Tn,1 , . . . , Tn,n so that (Sn,1 , . . . , Sn,n ) =d (B (Tn,1 ), . . . , B (Tn,n )). By (6.8), it
suﬃces to show that Tn,[nt] → t in probability for each t ∈ [0, 1]. To do this we
let tn,m = Tn,m − Tn,m−1 (with Tn,0 = 0) observe that by the remark after the
2
proof of (7.1), E (tn,m Fn,m−1 ) = E (Xn,m Fn,m−1 ). The last observation and
hypothesis (ii) imply
[nt] E (tn,m Fn,m−1 ) → t in probability m=1 To get from this to Tn,[nt] → t in probability, we observe E 2 [nt] m=1 tn,m − E (tn,m Fn,m−1 ) = E [nt] {tn,m − E (tn,m Fn,m−1 )}2
m=1 by the orthogonality of martingale increments, (4.6) in Chapter 4. Now
E ({tn,m − E (tn,m Fn,m−1 )}2 Fn,m−1 ) ≤ E (t2 Fn,m−1 )
n,m
4
≤ CE (Xn,m Fn,m−1 ) ≤ C 2
2
n E (Xn,m Fn,m−1 ) by Exercise 5.4 and assumption (i). Summing over n, taking expected values,
and recalling we have assumed Vn,n ≤ 2 + 2 , it follows that
n E [nt] m=1 2 tn,m − E (tn,m Fn,m−1 ) → C 2
n EVn,n →0 Unscrambling the deﬁnitions, we have shown E (Tn,[nt] −Vn,[nt] )2 → 0, so Chebyshev’s inequality implies P (Tn,[nt] − Vn,[nt]  > ) → 0, and using (ii) now completes the proof.
Remark. We can get rid of assumption (ii) in (7.2) by putting the martingale
on its “natural time scale.” Let Xn,m , Fn,m , 1 ≤ m < ∞, be a martingale
diﬀerence array with Xn,m  ≤ n where n → 0, and suppose that Vn,m → ∞
as m → ∞. Deﬁne Bn (t) by requiring Bn (Vn,m ) = Sn,m and Bn (t) is linear on
each interval [Vn,m−1 , Vn,m ]. A generalization of the proof of (7.2) shows that
Bn (·) ⇒ B (·). See Freedman (1971a), p. 89–93 for details. 7.7 CLT’s for Dependent Variables
With (7.2) established, a truncation argument gives us:
(7.3) LindebergFeller theorem for martingales. Suppose Xn,m , Fn,m ,
1 ≤ m ≤ n is a martingale diﬀerence array.
If (i) Vn,[nt] → t in probability for all t ∈ [0, 1] and
(ii) for all > 0, m≤n 2
E (Xn,m 1(Xn,m > ) Fn,m−1 ) → 0 in probability, then Sn,(n·) ⇒ B (·).
Remark. Literally dozens of papers have been written proving results like
this. See Hall and Heyde (1980) for some history. Here we follow Durrett and
Resnick (1978).
Proof The ﬁrst step is to truncate so that we can apply (7.2). Let
n ˆ
Vn ( ) = 2
E (Xn,m 1(Xn,m > n) Fn,m−1 ) m=1 (a) If n → 0 slowly enough then Remark. The −2
n −2 ˆ
n Vn ( n ) → 0 in probability. in front is so that we can conclude
P (Xn,m  > n Fn,m−1 ) → 0 in probability m≤n ˆ
Proof of (a) Let Nm be chosen so that P (m2 Vn (1/m) > 1/m) ≤ 1/m for
n ≥ Nm . Let n = 1/m for n ∈ [Nm , Nm+1 ) and n = 1 if n < N1 . If δ > 0 and
1/m < δ then for n ∈ [Nm , Nm+1 )
P( −2 ˆ
n Vn ( n ) ˆ
Let Xn,m = Xn,m 1(Xn,m > ˆ
> δ ) ≤ P (m2 Vn (1/m) > 1/m) ≤ 1/m n) ¯
, Xn,m = Xn,m 1(Xn,m ≤ n) and ˜
¯
¯
Xn,m = Xn,m − E (Xn,m Fn,m−1 )
Our next step is to show:
˜
˜
(b) If we deﬁne Sn,(n·) in the obvious way then (7.2) implies Sn,(n·) ⇒ B (·). 411 412 Chapter 7 Brownian Motion
˜
Proof of (b) Since Xn,m  ≤ 2 n , we only have to check (ii) in (7.2). To
do this, we observe that the conditional variance formula, (4.7) in Chapter 4,
implies
˜2
¯2
¯
E (Xn,m Fn,m−1 ) = E (Xn,m Fn,m−1 ) − E (Xn,m Fn,m−1 )2
For the ﬁrst term, we observe
2
¯2
ˆ2
E (Xn,m Fn,m−1 ) = E (Xn,m Fn,m−1 ) − E (Xn,m Fn,m−1 ) For the second, we observe that E (Xn,m Fn,m−1 ) = 0 implies
ˆ
ˆ2
¯
E (Xn,m Fn,m−1 )2 = E (Xn,m Fn,m−1 )2 ≤ E (Xn,m Fn,m−1 )
by Jensen’s inequality, so it follows from (a) and (i) that
[nt] ˜2
E (Xn,m Fn,m−1 ) → t for all t ∈ [0, 1] m=1 Having proved (b), it remains to estimate the diﬀerence between Sn,(n·)
˜
and Sn,(n·) . On {Xn,m  ≤ n for all 1 ≤ m ≤ n}, we have
n ˜
Sn,(n·) − Sn,(n·) ≤ (c) ¯
E (Xn,m Fn,m−1 )
m=1 To handle the righthand side, we observe
n n ¯
E (Xn,m Fn,m−1 ) =
m=1 ˆ
E (Xn,m Fn,m−1 )
m=1
n ˆ
E (Xn,m Fn,m−1 ) ≤ (d) m=1
n ≤ ˆ2
E (Xn,m Fn,m−1 ) → 0 −1
n
m=1 in probability by (a). To complete the proof now, it suﬃces to show
(e) P (Xn,m  > n for some m ≤ n) → 0 ˜
for with (d) and (c) this implies Sn,(n·) − Sn,(n·) → 0 in probability. The
˜
proof of (7.2) constructs a Brownian motion with Sn,(n·) − B (·) → 0, so the
desired result follows from the triangle inequality and (6.9). 7.7 CLT’s for Dependent Variables
To prove (e), we will use Lemma 3.5 of Dvoretsky (1972).
(f) If An is adapted to Gn then for any nonnegative δ ∈ G0 ,
n P (∪n =1 Am G0 ) ≤ δ + P
m P (Am Gm−1 ) > δ G0
m=1 Proof of (f ) We proceed by induction. When n = 1, the conclusion says
P (A1 G0 ) ≤ δ + P (P (A1 G0 ) > δ G0 ) This is obviously true on Ω− ≡ {P (A1 G0 ) ≤ δ } and also on Ω+ ≡ {P (A1 G0 ) >
δ } ∈ G0 since on Ω+
P (P (A1 G0 ) > δ G0 ) = 1 ≥ P (A1 G0 )
To prove the result for n sets, observe that by the last argument the inequality
is trivial on Ω+ . Let Bm = Am ∩ Ω− . Since Ω− ∈ G0 ⊂ Gm−1 , P (Bm Gm−1 ) =
P (Am Gm−1 ) on Ω− . (See Exercise 1.1. in Chapter 4.) Applying the result for
n − 1 sets with γ = δ − P (B1 G0 ) ≥ 0,
n P (∪n =2 Bm G1 ) ≤ γ + P
m P (Bm Gm−1 ) > γ G1
m=2 Taking conditional expectation w.r.t. G0 and noting γ ∈ G0 ,
n P (∪n =2 Bm G0 ) ≤ γ + P
m P (Bm Gm−1 ) > δ G0
m=1 ∪2≤m≤n Bm = (∪2≤m≤n Am ) ∩ Ω− and another use of Exercise 1.1 from Chapter
4 shows
P (Bm Gm−1 ) =
P (Am Gm−1 )
1≤m≤n 1≤m≤n on Ω− . So, on Ω− ,
n P (∪n =2 Am G0 ) ≤ δ − P (A1 G0 ) + P
m P (Am Gm−1 ) > δ G0
m=1 The result now follows from
P (∪n =1 Am G0 ) ≤ P (A1 G0 ) + P (∪n =2 Am G0 )
m
m
413 414 Chapter 7 Brownian Motion
To see this, let C = ∪2≤m≤n Am , observe that 1A1 ∪C ≤ 1A1 + 1C , and use the
monotonicity of conditional expectations.
Proof of (e) Let Am = {Xn,m  >
number. (f) implies n }, Gm = Fn,m , and let δ be a positive
n P (Xn,m  > n for some m ≤ n) ≤ δ + P P (Xn,m  > n Fn,m−1 ) >δ m=1 To estimate the righthand side, we observe that “Chebyshev’s inequality” (Exercise 1.3 in Chapter 4) implies
n n P (Xn,m  > n Fn,m−1 ) ≤ m=1 ˆ2
E (Xn,m Fn,m−1 ) → 0 −2
n
m=1 so lim supn→∞ P (Xn,m  > n for some m ≤ n) ≤ δ . Since δ is arbitrary, the
proof of (e) and hence of (7.3) is complete.
For applications, it is useful to have a result for a single sequence.
(7.4) Martingale central limit theorem. Suppose Xn , Fn , n ≥ 1, is a
2
martingale diﬀerence sequence and let Vk = 1≤n≤k E (Xn Fn−1 ).
2
If (i) Vk /k → σ > 0 in probability and
(ii) n−1 2
E (Xm 1(Xm >
√
then S(n·) / n ⇒ σB (·).
m≤n √)
n) →0 √
Proof Let Xn,m = Xm /σ n , Fn,m = Fm . Changing notation and letting
k = nt, our ﬁrst assumption becomes (i) of (7.3). To check (ii) of (7.3) observe
that
n n
2
E (Xn,m 1(Xn,m > ) Fn,m−1 ) = σ −2 n−1 E
m=1 2
E (Xm 1(Xm > √
σ n) ) →0 m=1 Exercise 7.1. Chaindependent random variables. (Kielson and Wishart
(1964), O’Brien (1974)). Let ζn be a Markov chain with ﬁnite state space S ,
irreducible transition probability p(i, j ), and stationary distribution πi . The
random variables {Xn , n ≥ 1} are said to be chaindependent if for Gn =
σ (ζm , Xm , m ≤ n) and n ≥ 0 we have
P (ζn+1 = j, Xn+1 ≤ xGn ) = p(ζn , j )Hζn (x) 7.7 CLT’s for Dependent Variables
where H1 , . . . , Hn are given distribution functions. Intuitively, ζn is the “environment,” which may be in one of a ﬁnite number of states, and the distribution
of the random variable we observe depends upon the value of ζn . Show that
2
if x dHi = 0 and x2 dHi = σi < ∞, then Xn is a martingale diﬀerence
sequence and
M n−1/2 S(n·) ⇒ σB (·) where σ2 = 2
πm σm
m=1 The unappetizing assumption
7.3. x dHi = 0 for all i will be dropped in Exercise b. Stationary Sequences
We begin by considering the martingale case.
(7.5) Theorem. Suppose Xn , n ∈ Z, is an ergodic stationary sequence of square
2
integrable martingale diﬀerences, i.e., σ 2 = EXn < ∞ and E (Xn Fn−1 ) = 0,
where Fn = σ (Xm , m ≤ n). Let Sn = X1 + · · · + Xn . Then
S(n·) /n1/2 ⇒ σB (·)
Remark. This result was discovered independently by Billingsley (1961) and
Ibragimov (1963).
2
Proof un ≡ E (Xn Fn−1 ) can be written as ϕ(Xn−1 , Xn−2 , . . .), so (1.3) in
Chapter 6 implies un is stationary and ergodic, and the ergodic theorem implies
n n −1 2
um → Eu0 = EX0 a.s. m=1 The last conclusion shows that (i) of (7.4) holds. To verify (ii), we observe
n n−1 2
E (Xm 1(Xm > √)
n) 2
= E (X0 1(X0 > √)
n) →0 m=1 by the dominated convergence theorem.
We come now to the promised central limit theorem for stationary sequences. Our proof is based on Scott (1973), but the key is an idea of Gordin 415 416 Chapter 7 Brownian Motion
(1969). In some applications of this result (see e.g., Exercise 7.2), it is convenient to use σ ﬁelds richer than Fm = σ (Xn , n ≤ m). So we adopt abstract
conditions that satisfy the demands of martingale theory and ergodic theory:
(i) n → Fn is increasing, and Xn ∈ Fn
(ii) θ−1 Fm = Fm+1
(7.6) Theorem. Suppose Xn , n ∈ Z is an ergodic stationary sequence with
EXn = 0. Let Fn satisfy (i), (ii), and
E (X0 F−n ) 2 <∞ n≥1 where Y
where 2 √
= (EY 2 )1/2 . Let Sn = X1 + · · · + Xn . Then S(n·) / n ⇒ σB (·)
∞
2
σ 2 = EX0 + 2 EX0 Xn
n=1 and the series in the deﬁnition converges absolutely.
Proof Suppose Xn , n ∈ Z is deﬁned on sequence space (RZ , RZ , P ) with
Xn (ω ) = ωn and let (θn ω )(m) = ω (m + n). Let
Hn = {Y ∈ Fn with EY 2 < ∞}
Kn = {Y ∈ Hn with E (Y Z ) = 0 for all Z ∈ Hn−1 }
Geometrically, H0 ⊃ H−1 ⊃ H−2 . . . is a sequence of subspaces of L2 and Kn
is the orthogonal complement of Hn−1 in Hn . If Y is a random variable, let
(θn Y )(ω ) = Y (θn ω ). Generalizing from the example Y = f (X−j , . . . , Xk ),
which has θn Y = f (Xn−j , . . . , Xn+k ), it is easy to see that if Y ∈ Hk , then
θn Y ∈ Hk+n , and hence if Y ∈ Kj then θn Y ∈ Kn+j .
If X0 happened to be in K0 then we would be happy since then Xn =
θn X0 ∈ Kn for all n, and taking Z = 1A ∈ Hn−1 we would have E (Xn 1A ) = 0
for all A ∈ Fn−1 and hence E (Xn Fn−1 ) = 0. The next best thing to having
X0 ∈ K0 is to have
(∗) X0 = Y0 + Z0 − θZ0 with Y0 ∈ K0 and Z0 ∈ L2 , for then if we let
n Sn = n m=1 n θ m X0 Xm =
m=1 θ m Y0 and Tn =
m=1 7.7 CLT’s for Dependent Variables
then Sn = Tn + θZ0 − θn+1 Z0 . The θm Y0 are a stationary ergodic martingale
diﬀerence sequence (ergodicity follows from (1.3) in Chapter 6), so (7.5) implies
√
T(n·) / n ⇒ σB (·) where σ 2 = EY02
√
To get rid of the other term, we observe θZ0 / n → 0 a.s. and
P sup θm+1 Z0 > √ n ≤ nP (Z0 > √ −2 n) ≤ 1≤m≤n 2
E (Z0 ; Z0 > √ n) → 0 by dominated convergence. To solve (∗) formally, we let
∞ Z0 = E (Xj F−1 )
j =0
∞ θZ0 = E (Xj +1 F0 ) j =0
∞ Y0 = {E (Xj F0 ) − E (Xj F−1 )}
j =0 and check that
Y0 + Z0 − θZ0 = E (X0 F0 ) = X0
To justify the last calculation, we need to know that the series in the deﬁnitions
of Z0 and Y0 converge. Our assumption and shift invariance imply
∞ E (Xj F−1 ) 2 <∞ j =0 so the triangle inequality implies that the series for Z0 converges in L2 . Since
E (Xj F0 ) − E (Xj F−1 ) 2 ≤ E (Xj F0 ) 2 (the lefthand side is the projection onto K0 ⊂ H0 ), the series for Y0 also
converges in L2 . Putting the pieces together, we have shown (7.5) with σ 2 =
EY02 . To get the indicated formula for σ 2 , observe that conditioning and using
CauchySchwarz
EX0 Xn  = E (X0 E (Xn F0 )) ≤ X0
Shift invariance implies E (Xn F0 )
absolutely.
n 2 2 n−1
2
EXj Xk = n EX0 + 2 j =1 k=1 E (Xn F0 ) = E (X0 F−n ) 2 , so the series converges n 2
ESn = 2 (n − m)EX0 Xm
m=1 417 418 Chapter 7 Brownian Motion
From this, it follows easily that
∞
2
2
n−1 ESn → EX0 + 2 EX0 Xm
m=1 To ﬁnish the proof, let Tn = n
m=1 θm Y0 , observe σ 2 = EY02 and 2
n−1 E (Sn − Tn )2 = n−1 E (θZ0 − θn+1 Z0 )2 ≤ 4EZ0 /n → 0 since (a − b)2 ≤ (2a)2 + (2b)2 .
We turn now to examples. In the ﬁrst one, it is trivial to check the hypothesis of (7.6).
Example 7.1. Mdependent sequences. Let Xn , n ∈ Z, be a stationary
2
sequence with EXn = 0, EXn < ∞, and suppose {Xj , j ≤ 0} and {Xk , k > M }
are independent. In this case, E (X0 F−n ) = 0 for n > M , so (7.5) implies
√
Sn,(n·) / n ⇒ σB (·), where
M
2
σ 2 = EX0 + 2 EX0 Xm
m=1 Remark. For a barehands approach to this problem, see Theorem 7.3.1 in
Chung (1974).
Exercise 7.2. Consider the special case of Example 7.1 in which ξi , i ∈ Z, are
i.i.d. and take values H and T with equal probability. Let ηn = f (ξn , ξn+1 ),
where f (H, T ) = 1 and f (i, j ) = 0 otherwise. Sn = η1 + . . . + ηn counts the
number of head runs in ξ1 , . . . , ξn+1 . Apply (7.6) with Fn = σ (Xm : m ≤ n + 1)
to show that there are constants µ and σ so that (Sn − nµ)/σn1/2 ⇒ χ. What
is the random variable Y0 constructed in the proof of (7.6) in this case?
Example 7.2. Markov chains. Let ζn , n ∈ Z be an irreducible Markov chain
on a countable state space S in which each ζn has the stationary distribution
π . Let Xn = f (ζn ), where
f (x)π (x) = 0 and
f (x)2 π (x) < ∞. Results in
Chapter 6 imply that Xn is an ergodic stationary sequence. If we let F−n =
σ (ζm , m ≤ −n), then
pn (ζ−n , y )f (y ) E (X0 F−n ) =
y 7.7 CLT’s for Dependent Variables
where pn (x, y ) is the n step transition probability, so
E (X0 F−n ) 2
2 =
x When f is bounded, we can use
E (X0 F−n ) 2
2 ≤f pn (x, y )f (y )
π ( x) 2 y f (x)π (x) = 0 to get the following bound,
π (x) pn (x, ·) − π (·) ∞ 2 x where f ∞ = sup f (x) and · is the total variation norm.
When S is ﬁnite, all f are bounded. If the chain is aperiodic, Exercise 5.6
in Chapter 5 implies that
sup pn (x, ·) − π (·) ≤ Ce− n x and the hypothesis of (7.6) is satisﬁed. To see that the limiting variance σ 2
may be 0 in this case, consider the modiﬁcation of Exercise 7.1 in which Xn =
(ξn , ξn+1 ), f (H, T ) = 1, f (T, H ) = −1 and f (H, H ) = f (T, T ) = 0. In this
n
case, m=1 f (Xm ) ∈ {−1, 0, 1} so there is no central limit theorem.
Exercise 7.3. Consider the setup of Exercise 7.2 and allow x dHi = µi to be
diﬀerent from 0 but assume √
(without loss of generality) that
µi πi = 0. Use
(7.6) to conclude that S(n·) / n ⇒ σB (·).
Our last example is simple to treat directly, but will help us evaluate the
strength of the conditions in (7.6) and in later theorems.
Example 7.3. Moving average process. Suppose
Xm = ck ξm−k c2 < ∞
k where k≥0 k≥0 2
and the ξi , i ∈ Z, are i.i.d. with Eξi = 0 and Eξi = 1. If F−n = σ (ξm ;
m ≤ −n) then
1/2 E (X0 F−n ) 2 = ck ξ−k
k≥n If, for example, ck = (1 + k )−p , E (X0 F−n )
p > 3/2. c2
k =
2 2 k≥n ∼ n(1/2)−p , and (7.6) applies if 419 420 Chapter 7 Brownian Motion
Remark. Theorem 5.3 in Hall and Heyde (1980) shows that
E (Xj F0 ) − E (Xj F−1 ) 2 <∞ j ≥0 is suﬃcient for a central limit theorem. Using this result shows that ck  < ∞
is suﬃcient for the central limit theorem in Example 7.3.
The condition in the improved result is close to the best possible. Suppose
the ξi take values 1 and 1 with equal probability and let ck = (1 + k )−p
where 1/2 < p < 1. The LindebergFeller theorem can be used to show that
Sn /n3/2−p ⇒ σχ. To check the normalization (which was wrong in the ﬁrst
edition) note that
n Xm =
m=1 an,j ξj
j ≤n n−j If j ≥ 0 then an,j = i=0 ck ≈ (n − j )1−p , so using 0 ≤ j ≤ n/2 the variance
is at least n3−2p . Further details are left to the reader.
The last three examples show that in many cases it is easy to verify the
hypothesis of (7.6) directly. To connect (7.6) with other results in the literature,
we will introduce two suﬃcient conditions phrased in terms of: c. Mixing Properties
In each case, we will ﬁrst give an estimate on covariances and then state a
central limit theorem. Let
α(G , H) = sup{P (A ∩ B ) − P (A)P (B ) : A ∈ G , B ∈ H}
If α = 0, G and H are independent so α measures the dependence between the
σ ﬁelds.
(7.7) Lemma. Let p, q , r ∈ (1, ∞] with 1/p + 1/q + 1/r = 1, and suppose
X ∈ G , Y ∈ H have E X p , E Y q < ∞. Then
EXY − EXEY  ≤ 8 X p Y q (α(G , H)) 1/r Here, we interpret x0 = 1 for x > 0 and 00 = 0.
Proof If α = 0, X and Y are independent and the result is true, so we can
suppose α > 0. We build up to the result in three steps, starting with the case
r = ∞.
(a) EXY − EXEY  ≤ 2 X p Y q 7.7 CLT’s for Dependent Variables
Proof of (a) H¨lder’s inequality, (5.3) in the Appendix, implies EXY  ≤
o
X p Y q , and Jensen’s inequality implies
X p Y q ≥ E X E Y  ≥ EXEY  so the result follows from the triangle inequality.
(b) EXY − EXEY  ≤ 4 X Proof of (b) ∞ Y ∞ α(G , H ) Let η = sgn {E (Y G ) − EY } ∈ G . EXY = E (XE (Y G )), so
EXY − EXEY  = E (X {E (Y G ) − EY })
≤X
=X ∞ E E ( Y =X ∞ {E (ηY G ) − EY 
E (η {E (Y G ) − EY })
∞
) − EηEY } Applying the last result with X = Y and Y = η gives
E (Y η ) − EY Eη  ≤ Y ∞ E (ζη ) − EζEη  where ζ = sgn {E (η H) − Eη }. Now η = 1A − 1B and ζ = 1C − 1D , so
E (ζη ) − EζEη  = P (A ∩ C ) − P (B ∩ C ) − P (A ∩ D) + P (B ∩ D)
− P (A)P (C ) + P (B )P (C ) + P (A)P (D) − P (B )P (D)
≤ 4α(G , H)
Combining the last three displays gives the desired result.
(c) EXY − EXEY  ≤ 6 X Proof of (c) Let C = α−1/p X p, p Y ∞ α(G , H)1−1/p X1 = X 1(X ≤C ) and X2 = X − X1 EXY − EXEY  ≤ EX1 Y − EX1 EY  + EX2 Y − EX2 EY 
≤ 4αC Y ∞ +2 Y ∞ E X 2  by (b) and (a). Now
E X2  ≤ C −(p−1) E (X p 1(X >C )) ≤ C −p+1 E X p
Combining the last two inequalities and using the deﬁnition of C gives
EXY − EX EY  ≤ 4α1−1/p X p Y ∞ +2 Y ∞α 1−1/p X −p+1+p
p 421 422 Chapter 7 Brownian Motion
which is the desired result.
Finally, to prove (7.7), let C = α−1/q Y q, Y1 = Y 1(Y ≤C ) , and Y2 = Y − Y1 . EXY − EXEY  ≤ EXY1 − EXEY1  + EXY2 − EXEY2 
≤ 6C X pα 1−1/p +2 X p Y2 θ where θ = (1 − 1/p)−1 by (c) and (a). Now
E Y2 θ ≤ C −q+θ E (Y q 1(Y >C ) ) ≤ C −q+θ E Y q
Taking the 1/θ root of each side and recalling the deﬁnition of C
Y2 θ ≤ C −(q−θ)/θ Y q /θ
q ≤ α(q−θ)/θq Y q so we have
EXY − EXEY  ≤ 6α−1/q Y q X pα 1−1/p +2 X pα 1/θ −1/q Y 1/θ +1/q
q proving (7.7).
Remark. The last proof is from Appendix III of Hall and Heyde (1980). They
attribute (b) to Ibragimov (1962) and (c) and (7.7) to Davydov (1968).
Combining (7.6) and (7.7) gives:
(7.8) Theorem. Suppose Xn , n ∈ Z is an ergodic stationary sequence with
EXn = 0, E X0 2+δ < ∞. Let α(n) = α(F−n , σ (X0 )), where F−n = σ (Xm :
m ≤ −n), and suppose
∞ α(n)δ/2(2+δ) < ∞
n=1 √
If Sn = X1 + · · · + Xn , then S(n·) / n ⇒ σB (·), where
∞
2
σ 2 = EX0 + 2 EX0 Xn
n=1 Remark. Let α(n) = α(F−n , F0 ), where F0 = σ (Xk , k ≥ 0). When α(n) ↓
¯
¯
0, the sequence is called strong mixing. Rosenblatt (1956) introduced the
concept as a condition under which the central limit theorem for stationary 7.7 CLT’s for Dependent Variables
√
sequences could be obtained. Ibragimov (1962) proved Sn / n ⇒ σχ where
2
σ 2 = limn→∞ ESn /n under the assumption
∞ α(n)δ/(2+δ) < ∞
¯
n=1 See Ibragimov and Linnik (1971), Theorem 18.5.3, or Hall and Heyde (1980),
Corollary 5.1, for a proof.
Proof
(7.9) To use (7.7) to estimate the quantity in (7.6) we begin with
E (X F ) 2 = sup{E (XY ) : Y ∈ F , Y 2 = 1} Proof of (7.9) If Y ∈ F with Y 2 = 1, then using a by now familiar
property of conditional expectation and the CauchySchwarz inequality
EXY = E (E (XY F )) = E (Y E (X F )) ≤ E (X F ) 2 Y 2 Equality holds when Y = E (X F )/ E (X F ) 2 .
Letting p = 2 + δ and q = 2 in (7.7), noticing
11
1
1
4 + 2δ − 2 − (2 + δ )
δ
1
=1− − =1−
−=
=
r
pq
2+δ
2
(2 + δ )2
(2 + δ )2
and recalling EX0 = 0, shows that if Y ∈ F−n
EX0 Y  ≤ 8 X0 2+δ Y 2 α(n)δ/2(2+δ) Combining this with (7.9) gives
E (X0 F−n ) 2 ≤ 8 X0 2+δ α(n)δ/2(2+δ) and it follows that the hypotheses of (7.6) are satisﬁed.
In the M dependent case (Example 7.1), α(n) = 0 for n > M , so (7.8)
applies. As for Markov chains (Example 7.2), in this case
α(n) = sup P (X−n ∈ A, X0 ∈ B ) − π (A)π (B )
A,B π (x)pn (x, y ) − π (x)π (y ) = sup
A,B x∈A,y ∈B π (x)2 pn (x, ·) − π (·) ≤
x 423 424 Chapter 7 Brownian Motion
so the hypothesis of (7.8) can be checked if we know enough about the rate of
convergence to equilibrium.
Finally, to see how good the conditions in (7.8) are, we consider the special
case of Example 7.3 in which the ξi are i.i.d. standard normals. Let
ρ(G , H) = sup{corr(X, Y ) : X ∈ G , Y ∈ H}
where
corr(X, Y ) = EXY − EXEY
X − EX 2 Y − EY 2 Clearly, α(G , H) ≤ ρ(G , H). Kolmogorov and Rozanov (1960) have shown that
when G and H are generated by Gaussian random variables, then ρ(G , H) ≤
2πα(G , H). They proved this by showing that ρ(G , H) is the angle between
L2 (G ) and L2 (H). Using the geometric interpretation, we see that if ck  is
decreasing, then α(n) ≤ α(n) ≤ cn , so (7.8) requires
¯
cn (1/2)− < ∞,
1−
but Ibragimov’s result applies if
cn 
< ∞. As Exercise 7.9 (or direct
computation) shows, the central limit theorem is valid if
cn  < ∞.
Our second mixing concept is more restrictive (and asymmetric) because
we divide by P (B ) < 1. Let
β (G , H) = sup{P (AB ) − P (A) : A ∈ G , B ∈ H with P (B ) > 0}
Clearly, β (G , H) ≥ α(G , H) and β (G , H) = 0 implies that G and H are independent. The analogue of (7.7) for β is:
(7.10) Lemma. Let p, q ∈ (1, ∞) with 1/p + 1/q = 1. Suppose X ∈ G , and
Y ∈ H have E X p , E Y q < ∞. Then
EXY − EXEY  ≤ 2β (G , H)1/p X p Y q and the resulting convergence theorem is:
(7.11) Theorem. Suppose Xn , n ∈ Z is an ergodic stationary sequence with
EXn = 0, E X0 2 < ∞. Let β (n) = α(F−n , σ (X0 )), where F−n = σ (Xm : m ≤
−n) and suppose
∞ β (n)1/2 < ∞
n=1
2
If Sn = X1 + · · · + Xn , then Sn,(n·) ⇒ σB (·), where σ 2 = EX0 + ∞
n=1 EX0 Xn . ¯
¯
Remark. Let β (n) = β (F−n , F0 ), where F0 = σ (X0 , X1 , . . .). If β (n) ↓ 0 as
n → ∞, the sequence Xn is said to be uniformly mixing. Billingsley (1968), 7.8 Empirical Distributions, Brownian Bridge
¯
see Theorem 20.1 on p. 174, gives (7.11) under the assumption β (n)1/2 < ∞.
A proof of (7.10) can be found on p. 170–171 of his book, and it is a simple
exercise to deduce (7.11) from (7.10) using (7.6).
We will not enter into the details of these proofs since they do not give
signiﬁcant results for our examples. In the case of Markov chains (Example
7.2.),
β (n) = sup pn (x, ·) − π (·)
x The chain is clearly not uniformly mixing if β (n) = 1 for all n. Conversely, if
β (N ) < 1 − for some N , iterating gives
sup pkN +j (x, ·) − π (·) ≤ (1 − )k for 0 ≤ j < N x so (7.11) applies, but by using Example 7.2 one can check the assumptions of
(7.6) directly.
Things are even worse for Example 7.3, moving average processes. It is
easy to see that if ck = 0 for all k and the distribution of ξk is unbounded,
β (n) = 1 for all n. The last result is a little discouraging, but there is at least
one interesting example:
Example 7.4. The continued fraction transformation (see Exercise 1.6
¯
in Chapter 6) is uniformly mixing with β (n) = aρn and ρ < 1. See Chapter 9
of L´vy (1937). For more on this example, see Billingsley (1968), p. 192194.
e *7.8. Empirical Distributions, Brownian Bridge
Let X1 , X2 , . . . be i.i.d. with distribution F . In Section 1.7, we showed that
with probability one, the empirical distribution
1
ˆ
Fn (x) = {m ≤ n : Xm ≤ x}
n
converges uniformly to F (x). In this section, we will investigate the rate of
convergence when F is continuous. We impose this restriction so we can reduce
to the case of a uniform distribution on (0,1) by setting Yn = F (Xn ). (See
Exercise 1.9 in Chapter 1.) Since x → F (x) is nondecreasing and continuous
and no observations land in intervals of constancy of F , it is easy to see that if
we let
1
ˆ
Gn (y ) = {m ≤ n : Ym ≤ y }
n 425 426 Chapter 7 Brownian Motion
then
ˆ
ˆ
sup Fn (x) − F (x) = sup Gn (y ) − y 
x 0<y<1 For the rest of the section then, we will assume Y1 , Y2 , . . . is i.i.d. uniform on
(0,1). To be able to apply Donsker’s theorem, we will transform the problem.
n
n
n
Put the observations Y1 , . . . , Yn in increasing order: U1 < U2 < . . . < Un . I
claim that (8.2) m
n
ˆ
sup Gn (y ) − y = sup
− Um
0<y<1
1≤m≤n n
m−1
n
ˆ
− Um
inf Gn (y ) − y = inf
0<y<1
1≤m≤n
n ˆ
since the sup occurs at a jump of Gn and the inf right before a jump. We will
show that
ˆ
Dn ≡ n1/2 sup Gn (y ) − y 
0<y<1 has a limit, so the extra −1/n in the inf does not make any diﬀerence.
Our third and ﬁnal maneuver is to give a special construction of the order
n
n
n
statistics U1 < U2 . . . < Un . Let W1 , W2 , . . . be i.i.d. with P (Wi > t) = e−t
and let Zn = W1 + · · · + Wn .
d n
(8.3) Lemma. {Uk : 1 ≤ k ≤ n} = {Zk /Zn+1 : 1 ≤ k ≤ n} Proof We change variables v = r(t), where vi = ti /tn+1 for i ≤ n, vn+1 =
tn+1 . The inverse function is
s(v ) = (v1 vn+1 , . . . , vn vn+1 , vn+1 )
which has matrix of partial derivatives ∂si /∂vj given by
v 0 n+1 0
.
.
. 0
0 vn+1
.
.
.
0
0 ...
...
..
. 0
0
.
.
. . . . vn+1
...
0 v1 v2 .
.
. vn
1 n
The determinant of this matrix is vn+1 , so if we let W = (V1 , . . . , Vn+1 ) =
r(Z1 , . . . , Zn+1 ), the change of variables formula implies W has joint density
n λe−λvn+1 (vm −vm−1 ) fW (v1 , . . . , vn , vn+1 ) =
m=1 n
λe−λvn+1 (1−vn ) vn+1 7.8 Empirical Distributions, Brownian Bridge
To ﬁnd the joint density of V = (V1 , . . . , Vn ), we simplify the preceding formula
and integrate out the last coordinate to get
∞
n
λn+1 vn+1 e−λvn+1 dvn+1 = n! fV (v1 , . . . , vn ) =
0 for 0 < v1 < v2 . . . < vn < 1, which is the desired joint density.
We turn now to the limit law for Dn . As argued above, it suﬃces to
consider
m
Zm
Dn = n1/2 max
−
1≤m≤n Zn+1
n
Zm
m Zn+1
n
·
max
−
=
1/2
Zn+1 1≤m≤n n
n n1/2
n
Zm − m m Zn+1 − n
=
·
max
−
Zn+1 1≤m≤n n1/2
n
n1/2
If we let
Bn (t) = (Zm − m)/n1/2
linear if t = m/n with m ∈ {0, 1, . . . , n}
on [(m − 1)/n, m/n] then
Dn = n
Zn+1 − Zn
max Bn (t) − t Bn (1) +
Zn+1 0≤t≤1
n1/2 The strong law of large numbers implies Zn+1 /n → 1 a.s., so the ﬁrst factor will disappear in the limit. To ﬁnd the limit of the second, we observe
that Donsker’s theorem, (6.6), implies Bn (·) ⇒ B (·), a Brownian motion, and
computing second moments shows
(Zn+1 − Zn )/n1/2 → 0 in probability
ψ (ω ) = max0≤t≤1 ω (t) − tω (1) is a continuous function from C [0, 1] to R, so
it follows from Donsker’s theorem that:
(8.4) Theorem. Dn ⇒ max0≤t≤1 Bt − tB1 , where Bt is a Brownian motion
starting at 0.
Remark. Doob (1949) suggested this approach to deriving results of Kolmogorov and Smirnov, which was later justiﬁed by Donsker (1952). Our proof
follows Breiman (1968).
To identify the distribution of the limit in (8.4), we will ﬁrst prove
(8.5) d {Bt − tB1 , 0 ≤ t ≤ 1} = {Bt , 0 ≤ t ≤ 1B1 = 0} 427 428 Chapter 7 Brownian Motion
0
a process we will denote by Bt and call the Brownian bridge. The event
B1 = 0 has probability 0, but it is easy to see what the conditional probability
should mean. If 0 = t0 < t1 < . . . < tn < tn+1 = 1, x0 = 0, xn+1 = 0, and
x1 , . . . , xn ∈ R, then P (B (t1 ) = x1 , . . . , B (tn ) = xn B (1) = 0)
n+1 (8.6) = 1
pt −t
(xm−1 , xm )
p1 (0, 0) m=1 m m−1 where pt (x, y ) = (2πt)−1/2 exp(−(y − x)2 /2t).
0
Proof of (8.5) Formula (8.6) shows that the f.d.d.’s of Bt are multivariate
normal and have mean 0. Since Bt − tB1 also has this property, it suﬃces to
show that the covariances are equal. We begin with the easier computation. If
s < t then (8.7) E ((Bs − sB1 )(Bt − tB1 )) = s − st − st + st = s(1 − t) 0
0
For the other process, P (Bs = x, Bt = y ) is exp(−x2 /2s) exp(−(y − x)2 /2(t − s)) exp(−y 2 /2(1 − t))
·
·
· (2π )1/2
(2πs)1/2
(2π (t − s))1/2
(2π (1 − t))1/2
= (2π )−1 (s(t − s)(1 − t))−1/2 exp(−(ax2 + 2bxy + cy 2 )/2)
where
1
t
1
1
+
=
b=−
s t−s
s(t − s)
t−s
1
1
1−s
c=
+
=
t−s 1−t
(t − s)(1 − t) a= Recalling the discussion at the end of Section 2.9 and noticing
t
s(t−s)
−1
(t−s) −1
(t−s)
1−s
(t−s)(1−t) −1 = s(1 − s)
s(1 − t) s(1 − t)
t(1 − t) (multiply the matrices!) shows (8.5) holds.
Our ﬁnal step in investigating the limit distribution of Dn is to compute the
0
distribution of max0≤t≤1 Bt . To do this, we compute Px (Ta ∧ Tb > t, Bt ∈ A)
where a < x < b and A ⊂ (a, b). We begin by observing that
(∗) Px (Ta ∧ Tb > t, Bt ∈ A) = Px (Bt ∈ A) − Px (Ta < Tb , Ta < t, Bt ∈ A)
− Px (Tb < Ta , Tb < t, Bt ∈ A) 7.8 Empirical Distributions, Brownian Bridge
If we let ρa (y ) = 2a − y be reﬂection through a and observe that {Ta < Tb } is
F (Ta ) measurable, then it follows from the proof of (4.6) that
Px (Ta < Tb , Ta < t, Bt ∈ A) = Px (Ta < Tb , Bt ∈ ρa A)
where ρa A = {ρa (y ) : y ∈ A}. To get rid of the Ta < Tb , we observe that
Px (Ta < Tb , Bt ∈ ρa A) = Px (Bt ∈ ρa A) − Px (Tb < Ta , Bt ∈ ρa A)
Noticing that Bt ∈ ρa A and Tb < Ta imply Tb < t and using the reﬂection
principle again gives
Px (Tb < Ta , Bt ∈ ρa A) = Px (Tb < Ta , Bt ∈ ρb ρa A)
= Px (Bt ∈ ρb ρa A) − Px (Ta < Tb , Bt ∈ ρb ρa A)
Repeating the last two calculations n more times gives
n Px (Bt ∈ ρa (ρb ρa )m A) − Px (Bt ∈ (ρb ρa )m+1 A) Px (Ta < Tb , Bt ∈ ρa A) =
m=0 + Px (Ta < Tb , Bt ∈ (ρb ρa )n+1 A)
Each pair of reﬂections pushes A further away from 0, so letting n → ∞ shows
∞ Px (Bt ∈ ρa (ρb ρa )m A) − Px (Bt ∈ (ρb ρa )m+1 A) Px (Ta < Tb , Bt ∈ ρa A) =
m=0 Interchanging the roles of a and b gives
∞ Px (Bt ∈ ρb (ρa ρb )m A) − Px (Bt ∈ (ρa ρb )m+1 A) Px (Tb < Ta , Bt ∈ ρb A) =
m=0 Combining the last two expressions with (∗) and using ρ−1 = ρc , (ρa ρb )−1 =
c
ρ−1 ρ−1 gives
a
b
∞ Px (Bt ∈ (ρb ρa )n A) − Px (Bt ∈ ρa (ρb ρa )n A) Px (Ta ∧ Tb > t, Bt ∈ A) =
m=−∞ To prepare for applications, let A = (u, v ) where a < u < v < b, notice that
ρb ρa (y ) = y + 2(b − a), and change variables in the second sum to get
Px (Ta ∧ Tb > t, u < Bt < v ) =
∞ (8.8) {Px (u + 2n(b − a) < Bt < v + 2n(b − a))
n=−∞ − Px (2b − v + 2n(b − a) < Bt < 2b − u + 2n(b − a))} 429 430 Chapter 7 Brownian Motion
Letting u = y − , v = y + , dividing both sides by 2 , and letting → 0 (leaving
it to the reader to check that the dominated convergence theorem applies) gives
∞ (8.9) Px (Ta ∧ Tb > t, Bt = y ) = Px (Bt = y + 2n(b − a))
n=−∞ − Px (Bt = 2b − y + 2n(b − a))
where the probabilities are interpreted as density functions.
Setting x = y = 0, t = 1, and dividing by (2π )−1/2 = P0 (B1 = 0), we get
0
a result for the Brownian bridge Bt :
0
0
P0 a < min Bt < max Bt < b
0≤t≤1 (8.10) 0≤t≤1 ∞ e−(2n(b−a)) = 2 /2 − e−(2b+2n(b−a)) 2 /2 n=−∞ Taking a = −b, we have
∞ (8.11) P0 0
max Bt  < b 0≤t≤1 e−2m =1+ 22 b m=1 This formula gives the distribution of the KolmogorvSmirnov statistic, which
can be used to test if an i.i.d. sequence X1 , . . . , Xn has distribution F . To
do this, we transform the data to F (Xn ) and look at the maximum discrepancy between the empirical distribution and the uniform. (8.11) tells us the
distribution of the error when the Xi have distribution F .
(8.10) gives the joint distribution of the maximum and minimum of Brownian bridge. In theory, one can let a → −∞ in this formula to ﬁnd the distribution of the maximum, but in practice it is easier to start over again.
Exercise 8.1. Use Exercise 4.1 and the reasoning that led to (8.10) to conclude
0
P ( max Bt > b) = exp(−2b2 )
0≤t≤1 7.9 Laws of the Iterated Logarithm *7.9. Laws of the Iterated Logarithm
Our ﬁrst goal is to show:
(9.1) LIL for Brownian motion. lim supt→∞ Bt /(2t log log t)1/2 = 1 a.s.
Here LIL is short for “law of the iterated logarithm,” a name that refers to the
log log t in the denominator. Once (9.1) is established, we can use the Skorokhod
representation to prove the analogous result for random walks with mean 0 and
ﬁnite variance. The key to the proof of (9.1) is (4.5).
(9.2) P0 max Bs > a 0≤s≤1 = P0 (Ta ≤ 1) = 2 P0 (B1 ≥ a) To identify the asymptotic behavior of the righthand side of (9.2) as a → ∞,
we use (1.4) from Chapter 1.
∞ 1
exp(−x2 /2)
x
x
∞
1
exp(−y 2 /2) dy ∼ exp(−x2 /2) as x → ∞
x
x
exp(−y 2 /2) dy ≤ (9.3)
(9.4) where f (x) ∼ g (x) means f (x)/g (x) → 1 as x → ∞. The last result and
Brownian scaling imply that
P0 (Bt > (tf (t))1/2 ) ∼ κf (t)−1/2 exp(−f (t)/2)
where κ = (2π )−1/2 is a constant that we will try to ignore below. The last
result implies that if > 0, then
∞ P0 (Bn > (nf (n))1/2 )
n=1 < ∞ when f (n) = (2 + ) log n
= ∞ when f (n) = (2 − ) log n and hence by the BorelCantelli lemma that
lim sup Bn /(2n log n)1/2 ≤ 1 a.s. n→∞ To replace log n by log log n, we have to look along exponentially growing sequences. Let tn = αn , where α > 1.
P0 max tn ≤s≤tn+1 Bs > (tn f (tn ))1/2 ≤ P0 max 0≤s≤tn+1 1/2 Bs /tn+1 > f (tn )
α ≤ 2κ(f (tn )/α)−1/2 exp(−f (tn )/2α) 1/2 431 432 Chapter 7 Brownian Motion
by (9.2) and (9.3). If f (t) = 2α2 log log t, then
log log tn = log(n log α) = log n + log log α
so exp(−f (tn )/2α) ≤ Cα n−α , where Cα is a constant that depends only on α,
and hence
∞
P0
n=1 max tn ≤s≤tn+1 Bs > (tn f (tn ))1/2 <∞ Since t → (tf (t))1/2 is increasing and α > 1 is arbitrary, it follows that
lim sup Bt /(2t log log t)1/2 ≤ 1 (a) To prove the other half of (9.1), again let tn = αn , but this time α will be large,
since to get independent events, we will we look at
P0 B (tn+1 ) − B (tn ) > (tn+1 f (tn+1 ))1/2 = P0 B1 > (βf (tn+1 ))1/2
where β = tn+1 /(tn+1 − tn ) = α/(α − 1) > 1. The last quantity is
≥ κ
(βf (tn+1 ))−1/2 exp(−βf (tn+1 )/2)
2 if n is large by (9.4). If f (t) = (2/β 2 ) log log t, then log log tn = log n + log log α
so
exp(−βf (tn+1 )/2) ≥ Cα n−1/β
where Cα is a constant that depends only on α, and hence
∞ P0 B (tn+1 ) − B (tn ) > (tn+1 f (tn+1 ))1/2 = ∞
n=1 Since the events in question are independent, it follows from the second BorelCantelli lemma that
(b) B (tn+1 ) − B (tn ) > ((2/β 2 )tn+1 log log tn+1 )1/2 i.o. From (a), we get
lim sup B (tn )/(2tn log log tn )1/2 ≤ 1 (c) n→∞ Since tn = tn+1 /α and t → log log t is increasing, combining (b) and (c), and
recalling β = α/(α − 1) gives
lim sup B (tn+1 )/(2tn+1 log log tn+1 )1/2 ≥
n→∞ 1
α−1
− 1/2
α
α 7.9 Laws of the Iterated Logarithm
Letting α → ∞ now gives the desired lower bound, and the proof of (9.1) is
complete.
Exercise 9.1. Let tk = exp(ek ). Show that
lim sup B (tk )/(2tk log log log tk )1/2 = 1 a.s. k→∞ (2.8) implies that Xt = tB (1/t) is a Brownian motion. Changing variables
and using (9.1), we conclude
lim sup Bt /(2t log log(1/t))1/2 = 1 a.s. (9.5) t→0 To take a closer look at the local behavior of Brownian paths, we note that
Blumenthal’s 01 law (2.6) implies P0 (Bt < h(t) for all t suﬃciently small) ∈
{0, 1}. h is said to belong to the upper class if the probability is 1, the lower
class if it is 0.
(9.6) Kolmogorov’s test. If h(t) ↑ and t−1/2 h(t) ↓ then h is upper or lower
class according as
1 t−3/2 h(t) exp(−h2 (t)/2t) dt converges or diverges 0 The ﬁrst proof of this was given by Petrovsky (1935). Recalling (4.1), we
see that the integrand is the probability of hitting h(t) at time t. To see what
(9.5) says, deﬁne lgk (t) = log(lgk−1 (t)) for k ≥ 2 and t > ak = exp(ak−1 ),
where lg1 (t) = log(t) and a1 = 0. A little calculus shows that when n ≥ 4,
n−1 h( t ) = 1/2 3
2t lg2 (1/t) + lg3 (1/t) +
lgm (1/t) + (1 + ) lgn (1/t)
2
m=4 is upper or lower class according as > 0 or ≤ 0.
Approximating h from above by piecewise constant functions, it is easy to
show that if the integral in (9.6) converges, h(t) is an upper class function. The
proof of the other direction is much more diﬃcult; see Motoo (1959) or Section
4.12 of Itˆ and McKean (1965).
o
Turning to random walk, we will prove a result due to Hartman and Wintner (1941): 433 434 Chapter 7 Brownian Motion
(9.7) Theorem. If X1 , X2 , . . . are i.i.d. with EXi = 0 and EXi2 = 1 then
lim sup Sn /(2n log log n)1/2 = 1
n→∞ Proof By (6.3), we can write Sn = B (Tn ) with Tn /n → 1 a.s. As in the proof
of Donsker’s theorem, this is all we will use in the argument below. (9.7) will
follow from (9.1) once we show
(S[t] − Bt )/(t log log t)1/2 → 0 (a) To do this, we begin by observing that if a.s. > 0 and t ≥ to (ω ) T[t] ∈ [t/(1 + ), t(1 + )] (b) To estimate S[t] − Bt , we let M (t) = sup{B (s) − B (t) : t/(1+ ) ≤ s ≤ t(1+ )}.
To control the last quantity, we let tk = (1 + )k and notice that if tk ≤ t ≤ tk+1
M (t) ≤ sup{B (s) − B (tk−1 ) : tk−1 ≤ s ≤ tk+2 }
≤ 2 sup{B (s) − B (tk−1 ) : tk−1 ≤ s ≤ tk+2 }
Noticing tk+2 − tk−1 = δtk−1 , where δ = (1 + )3 − 1, scaling implies
P max tk−1 ≤s≤tk+2 =P B (s) − B (t) > (3δtk−1 log log tk−1 )1/2 max B (r) > (3 log log tk−1 )1/2 0≤r ≤1 ≤ 2κ(3 log log tk−1 )−1/2 exp(−3 log log tk−1 /2)
by a now familiar application of (9.2) and (9.3). Summing over k and using (b)
gives
lim sup(S[t] − Bt )/(t log log t)1/2 ≤ (3δ )1/2
t→∞ If we recall δ = (1 + )3 − 1 and let
complete. ↓ 0, (a) follows and the proof of (9.6) is Remark. Since the proof of (9.7) only requires Sn = B (Tn ) with Tn /n → 1
a.s., the conclusion generalizes to many of the dependent sequences considered
in Section 7.7.
Exercise 9.2. Show that if E Xi α = ∞ for some α < 2 then
lim sup Xn /n1/α = ∞ a.s.
n→∞ 7.9 Laws of the Iterated Logarithm
so the law of the iterated logarithm fails.
Strassen (1965) has shown an exact converse. If (9.7) holds then EXi = 0
and EXi2 = 1. Another one of his contributions to this subject is
(9.8) Strassen’s (1964) invariance principle. Let X1 , X2 , . . . be i.i.d. with
EXi = 0 and EXi2 = 1, let Sn = X1 + · · · + Xn , and let S(n·) be the usual
linear interpolation. The limit set (i.e., the collection of limits of convergent
subsequences) of
Zn (·) = (2n log log n)−1/2 S (n·)
is K = {f : f (x) = x
0 g (y ) dy with 1
0 for n ≥ 3 g (y )2 dy ≤ 1}.
1 Jensen’s inequality implies f (1)2 ≤ 0 g (y )2 dy ≤ 1 with equality if and
only if f (t) = t, so (9.8) contains (9.5) as a special case and provides some
information about how the large value of Sn came about.
Exercise 9.3. Give a direct proof that, under the hypotheses of (9.8), the
limit set of {Sn /(2n log log n)1/2 } is [−1, 1].
History Lesson. The LIL for sums of independent r.v. s grew from the early
eﬀorts of Hausdorﬀ (1913) and Hardy and Littlewood (1914) to determine the
rate of convergence in Borel’s theorem about normal numbers (see Example 2.4
in Chapter 6). The LIL for Bernoulli variables was reached in ﬁve steps:
Hausdorﬀ (1913)
Hardy and Littlewood (1914)
Steinhaus (1922)
Khintchine (1923)
Khintchine (1924) Sn = o(n1/2+ )
Sn = O((n log n)1/2 )
lim sup Sn /(2n log n)1/2 ≤ 1
Sn = O((n log log n)1/2 )
lim sup Sn /((n/2) log log n)1/2 = 1 We have n/2 instead of 2n in the last result since Bernoulli variables have variance 1/4. The LIL was proved for bounded independent r.v.’s by Kolmogorov
(1929), who did not assume the summands were identically distributed. As
mentioned earlier (9.6) is due to Hartman and Wintner (1941). Strassen (1964)
gave the proof (of (9.7) and hence (9.6)) using Skorokhod imbedding. The version above follows the treatment in Breiman (1968). For generalizations of (9.7)
to martingales and processes with stationary increments, see Heyde and Scott
(1973), Hall and Heyde (1976), (1980).
There are versions of Kolmogorov’s test for random walks. We say that cn
belongs to the upper class U if P (Sn > cn i.o.) = 0, and to the lower class L 435 436 Chapter 7 Brownian Motion
otherwise. For Bernoulli variables, P. L´vy (1937) showed
e
(2t log log t + a log log log t)1/2 ∈U
∈L if a > 3
if a ≤ 1 Kolmogorov claimed and later Erd¨s (1942) proved cn ∈ U , L according as
o
∞ n−3/2 cn exp(−c2 /2n) < ∞ or = ∞
n
n=1 Feller (1943) gave a general version of the test for independent variables without
assuming they all had the same distribution. His conditions are stronger than
ﬁnite variance in the i.i.d. case. Appendix: Measure Theory
This Appendix gives a complete treatment of the results from measure theory
that we will need. A.1. LebesgueStieltjes Measures
To prove the existence of Lebesgue measure (and some related more general
measures), we will use the Carath´odory extension theorem, (1.1). To state
e
that result, we need several deﬁnitions in addition to the ones given in Section
1 of Chapter 1. A collection A of subsets of Ω is called an algebra (or ﬁeld) if
A, B ∈ A implies Ac and A ∪ B are in A. Since A ∩ B = (Ac ∪ B c )c , it follows
that A ∩ B ∈ A. Obviously a σ algebra is an algebra. Two cases in which the
converse is false are:
Example 1.1. Ω = Z = the integers, A = the collection of A ⊂ Z so that A
or Ac is ﬁnite.
Example 1.2. Ω = R, A = the collection of sets of the form
∪k=1 (ai , bi ]
i where − ∞ ≤ ai < bi ≤ ∞ Exercise 1.1. (i) Show that if F1 ⊂ F2 ⊂ . . . are σ algebras, then ∪i Fi is an
algebra. (ii) Give an example to show that ∪i Fi need not be a σ algebra.
Exercise 1.2. A set A ⊂ {1, 2, . . .} is said to have asymptotic density θ if
lim A ∩ {1, 2, . . . , n}/n = θ n→∞ Let A be the collection of sets for which the asymptotic density exists. Is A a
σ algebra? an algebra?
By a measure on an algebra A, we mean a set function µ with
(i) µ(A) ≥ µ(∅) = 0 for all A ∈ A, and
(ii) if Ai ∈ A are disjoint and their union is in A, then
∞ µ (∪∞ Ai ) =
i=1 µ(Ai )
i=1 438 Appendix: Measure Theory
The italicized clause is unnecessary if A is a σ algebra, so in that case the new
deﬁnition coincides with the old one. The next exercise generalizes Exercise 1.1
in Chapter 1.
Exercise 1.3. We assume that all the sets mentioned are in A.
(i) monotonicity. If A ⊂ B then µ(A) ≤ µ(B ).
(ii) subadditivity. If A ⊂ ∪i Ai then µ(A) ≤ i µ(Ai ). (iii) continuity from below. If Ai ↑ A (i.e., A1 ⊂ A2 , . . . and ∪i Ai = A) then
µ(Ai ) ↑ µ(A).
(iv) continuity from above. If Ai ↓ A (i.e., A1 ⊃ A2 , . . . and ∩i Ai = A) and
µ(A1 ) < ∞ then µ(Ai ) ↓ µ(A) as i ↑ ∞.
µ is said to be σ ﬁnite if there is a sequence of sets An ∈ A so that
µ(An ) < ∞ and ∪n An = Ω. Letting A1 = A1 and for n ≥ 2,
An = ∪n =1 Am
m or An = An ∩ ∩n−1 Ac ∈ A
m=1 m we can without loss of generality assume that An ↑ Ω or the An are disjoint.
(1.1) Carath´odory’s Extension Theorem. Let µ be a σ ﬁnite measure on
e
an algebra A. Then µ has a unique extension to σ (A) = the smallest σ algebra
containing A.
Exercise 1.4. Let Z = the integers and A = the collection of subsets so that
A or Ac is ﬁnite. Let µ(A) = 0 in the ﬁrst case and µ(A) = 1 in the second.
Show that µ has no extension to σ (A).
The next section is devoted to the proof of (1.1). To check the hypotheses
of (1.1) for Lebesgue measure, we will prove a theorem, (1.3), that will be
useful for other examples. To state that result, we will need several deﬁnitions.
A collection S of sets is said to be a semialgebra if (i) it is closed under
intersection, i.e., S , T ∈ S implies S ∩ T ∈ S , and (ii) if S ∈ S then S c is a
ﬁnite disjoint union of sets in S . An important example of a semialgebra is
Rd = the collection of sets of the form
o
(a1 , b1 ] × · · · × (ad , bd ] ⊂ Rd where − ∞ ≤ ai < bi ≤ ∞ Exercise 1.5. Show that σ (Rd ) = Rd , the Borel subsets of Rd .
o
¯
(1.2) Lemma. If S is a semialgebra then S = {ﬁnite disjoint unions of sets in
S} is an algebra, called the algebra generated by S . A.1 LebesgueStieltjes Measures
Proof Suppose A = +i Si and B = +j Tj , where + denotes disjoint union and
¯
we assume the index sets are ﬁnite. Then A ∩ B = +i,j Si ∩ Tj ∈ S . As for
c
c
c
¯
complements, if A = +i Si then A = ∩i Si . The deﬁnition of S implies Si ∈ S .
¯ is closed under intersection, so it follows by induction
We have shown that S
¯
that Ac ∈ S .
Let λ denote Lebesgue measure. The deﬁnition gives the values of λ on
a semialgebra S (the halfopen intervals). It is easy to see how to extend the
¯
deﬁnition to the algebra S deﬁned in (1.2). We let λ(+i (ai , bi ]) = i (bi − ai ).
To assert that λ has an extension to σ (S ) = R, we have to check that λ is a
¯
¯
¯
measure on S , i.e., if A ∈ S is a countable disjoint union of sets Ai ∈ S , then
λ(A) = i λ(Ai ). The next result simpliﬁes that task somewhat.
(1.3) Theorem. Let S be a semialgebra and let µ deﬁned on S have µ(∅) = 0.
Suppose (i) if S ∈ S is a ﬁnite disjoint union of sets Si ∈ S then µ(S ) =
i µ(Si ), and (ii) if Si , S ∈ S with S = +i≥1 Si then µ(S ) ≤
i µ(Si ). Then
¯
µ has a unique extension µ that is a measure on S the algebra generated by S .
¯
If the extension is σ ﬁnite then by (1.1) there is a unique extension ν that is a
measure on σ (S ).
Remark. In (ii) above, and in what follows, i ≥ 1 indicates a countable union,
while a plain subscript i or j indicates a ﬁnite union.
¯
Proof We deﬁne µ on S by µ(A) = i µ(Si ) whenever A = +i Si . To check
¯
¯
that µ is well deﬁned, suppose that A = +j Tj and observe Si = +j (Si ∩ Tj )
¯
and Tj = +i (Si ∩ Tj ), so (i) implies
µ( Si ) =
i µ( S i ∩ Tj ) =
i,j µ( Tj )
j ¯
Our next result takes the ﬁrst step toward proving that µ is a measure on S .
¯
It includes an extra statement, (b), that will be useful in checking (ii).
(1.4) Lemma. Suppose only that (i) holds.
¯
(a) If A, Bi ∈ S with A = +n Bi then µ(A) =
¯
i=1
¯
(b) If A, Bi ∈ S with A ⊂ ∪n Bi then µ(A) ≤
¯
i=1 µ(Bi ).
¯
µ(Bi ).
¯
i
i Proof Observe that it follows from the deﬁnition that if A = +i Bi is a ﬁnite
¯
disjoint union of sets in S and Bi = +j Si,j , then
µ(A) =
¯ µ(Si,j ) =
i,j µ(Bi )
¯
i 439 440 Appendix: Measure Theory
To prove (b), we begin with the case n = 1, B1 = B . B = A + (B ∩ Ac ) and
¯
B ∩ Ac ∈ S , so
µ(A) ≤ µ(A) + µ(B ∩ Ac ) = µ(B )
¯
¯
¯
¯
c
c
To handle n > 1 now, let Fk = B1 ∩ . . . ∩ Bk−1 ∩ Bk and note ∪i Bi = F1 + · · · + Fn
A = A ∩ (∪i Bi ) = (A ∩ F1 ) + · · · + (A ∩ Fn )
so using (a), (b) with n = 1, and (a) again
n n µ( A ∩ F k ) ≤
¯ µ(A) =
¯
k=1 µ(Fk ) = µ (∪i Bi )
¯
¯
k=1 ¯
To extend the additivity property to A ∈ S that are countable disjoint
¯, we observe that each Bi = +j Si,j with
unions A = +i≥1 Bi , where Bi ∈ S
Si,j ∈ S and i≥1 µ(Bi ) = i≥1,j µ(Si,j ), so replacing the Bi ’s by Si,j ’s we
¯
¯
can without loss of generality suppose that the Bi ∈ S . Now A ∈ S implies
A = +j Tj (a ﬁnite disjoint union) and Tj = +i≥1 Tj ∩ Bi , so (ii) implies
µ( Tj ) ≤ µ(Tj ∩ Bi )
i≥1 Summing over j and observing that nonnegative numbers can be summed in
any order,
µ( Tj ) ≤ µ(A) =
¯
j µ(Tj ∩ Bi ) =
i≥1 j µ(Bi )
i≥1 the last equality following from (i). To prove the opposite inequality, let An =
¯
¯
B1 + · · · + Bn , and Cn = A ∩ Ac . Cn ∈ S , since S is an algebra, so ﬁnite
n
additivity of µ implies
¯
µ(A) = µ(B1 ) + · · · + µ(Bn ) + µ(Cn ) ≥ µ(B1 ) + · · · + µ(Bn )
¯
¯
¯
¯
¯
¯
and letting n → ∞, µ(A) ≥
¯ i≥1 µ(Bi ).
¯ With (1.3) established, we are ready to prove the existence of Lebesgue
measure and a number of other measures.
(1.5) Theorem. Suppose F is (i) nondecreasing and (ii) right continuous,
i.e., F (y ) ↓ F (x) when y ↓ x. There is a unique measure µ on (R, R) with
µ((a, b]) = F (b) − F (a) for all a, b. A.1 LebesgueStieltjes Measures
Remark. A function F that has properties (i) and (ii) is called a Stieltjes
measure function. To see the reasons for the two conditions, observe that (a)
if µ is a measure then F (b) − F (a) ≥ 0 and (b) part (iv) of Exercise 1.3 implies
that if µ((a, y ]) < ∞ and y ↓ x > a
F (y ) − F (a) = µ((a, y ]) ↓ µ((a, x]) = F (x) − F (a)
Conversely, if µ is a measure on R with µ((a, b]) < ∞ when −∞ < a < b < ∞
then
c + µ((0, x]) for x ≥ 0
F ( x) =
c − µ((x, 0]) for x < 0
is a function F with F (b) − F (a) = µ((a, b]), and any such function has this
form with c = F (0).
Proof Let S be the semialgebra of halfopen intervals (a, b] with −∞ ≤ a <
b ≤ ∞. To deﬁne µ on S , we begin by observing that
F (∞) = lim F (x)
x↑∞ and F (−∞) = lim F (x)
x↓−∞ exist and µ((a, b]) = F (b) − F (a) makes sense for all −∞ ≤ a < b ≤ ∞ since
F (∞) > −∞ and F (−∞) < ∞.
If (a, b] = +n (ai , bi ] then after relabeling the intervals we must have
i=1
a1 = a, bn = b, and ai = bi−1 for 2 ≤ i ≤ n, so condition (i) in (1.3) holds.
To check (ii), suppose ﬁrst that −∞ < a < b < ∞, and (a, b] ⊂ ∪i≥1 (ai , bi ]
where (without loss of generality) −∞ < ai < bi < ∞. Pick δ > 0 so that
F (a + δ ) < F (a) + and pick ηi so that
F (bi + ηi ) < F (bi ) + 2−i
The open intervals (ai , bi + ηi ) cover [a + δ, b], so there is a ﬁnite subcover
(αj , βj ), 1 ≤ j ≤ J . Since (a + δ, b] ⊂ ∪J=1 (αj , βj ], (b) in (1.4) implies
j
J F (b) − F (a + δ ) ≤ ∞ F ( βj ) − F ( α j ) ≤
j =1 (F (bi + ηi ) − F (ai ))
i=1 So, by the choice of δ and ηi ,
∞ (F (bi ) − F (ai )) F (b) − F (a) ≤ 2 +
i=1 441 442 Appendix: Measure Theory
and since is arbitrary, we have proved the result in the case −∞ < a < b < ∞.
To remove the last restriction, observe that if (a, b] ⊂ ∪i (ai , bi ] and (A, B ] ⊂
(a, b] has −∞ < A < B < ∞, then we have
∞ F (B ) − F (A) ≤ (F (bi ) − F (ai ))
i=1 Since the last result holds for any ﬁnite (A, B ] ⊂ (a, b], the desired result follows. Our next goal is to prove a version of (1.5) for Rd . The ﬁrst step is to
introduce the assumptions on the deﬁning function F .
(i) It is nondecreasing, i.e., if x ≤ y (meaning xi ≤ yi for all i) then F (x) ≤ F (y ).
(ii) F is right continuous, i.e., limy↓x F (y ) = F (x) (here y ↓ x means each
coordinate yi ↓ xi ).
To formulate the third and ﬁnal condition, let
A = (a1 , b1 ] × · · · × (ad , bd ]
V = {a1 , b1 } × · · · × {ad , bd }
where −∞ < ai < bi < ∞. To emphasize that ∞’s are not allowed, we will call
A a ﬁnite rectangle. Then V = the vertices of the rectangle A. If v ∈ V , let
sgn (v ) = (−1)# of a’s in v
∆A F = sgn (v )F (v )
v ∈V We will let µ(A) = ∆A F , so we must assume
(iii) ∆A F ≥ 0 for all rectangles A.
To see the reason for this deﬁnition, consider the special case d = 2 and then
divide one large rectangle into four small ones. For more on this assumption,
see Section 2.9.
Example 1.3. Suppose F (x) =
(1.5). In this case, d
i=1 Fi (x), where the Fi satisfy (i) and (ii) of d ∆A F = (Fi (bi ) − Fi (ai ))
i=1 When Fi (x) = x for all i, the resulting measure is Lebesgue measure on Rd . A.2 Carath´odory’s Extension Theorem
e
(1.6) Theorem. Suppose F : Rd → [0, 1] satisﬁes (i)–(iii) in Section 2.9. Then
there is a unique probability measure µ on (Rd , Rd ) so that µ(A) = ∆A F for
all ﬁnite rectangles.
Proof Let S be the semialgebra of rectangles A = (a, b], where −∞ ≤ ai <
bi ≤ ∞. We let µ(A) = ∆A F for all ﬁnite rectangles and then use monotonicity
to extend the deﬁnition to all rectangles.
To check (i), call A = +k Bk a regular subdivision of A if there are
sequences ai = αi,0 < αi,1 . . . < αi,ni = bi so that each rectangle Bk has the
form
(α1,j1 −1 , α1,j1 ] × · · · × (αd,jd −1 , αd,jd ] where 1 ≤ ji ≤ ni
It is easy to see that for regular subdivisions λ(A) = k λ(Bk ). (First consider
the case in which all the endpoints are ﬁnite and then take limits to get the
general case.) To extend this result to a general ﬁnite subdivision A = +j Aj ,
subdivide further to get a regular one.
The proof of (ii) is almost identical to that in (1.5). To make things easier
to write and to bring out the analogies with (1.5), we let
(x, y ) = (x1 , y1 ) × · · · × (xd , yd )
(x, y ] = (x1 , y1 ] × · · · × (xd , yd ]
[x, y ] = [x1 , y1 ] × · · · × [xd , yd ]
for x, y ∈ Rd . Suppose ﬁrst that −∞ < a < b < ∞, where the inequalities
mean that each component is ﬁnite, and suppose (a, b] ⊂ ∪i≥1 (ai , bi ], where
(without loss of generality) −∞ < ai < bi < ∞. Let ¯ = (1, . . . , 1), pick δ > 0
1
so that
µ((a + δ ¯, b]) < µ((a, b]) +
1
and pick ηi so that
µ((a, bi + ηi ¯ < µ((ai , bi ]) + 2−i
1])
1)
1
The open rectangles (ai , bi + ηi ¯ cover [a + δ ¯, b], so there is a ﬁnite subcover
(αj , β j ), 1 ≤ j ≤ J . Since (a + δ ¯, b] ⊂ ∪J=1 (αj , β j ], (b) in (1.4) implies
1
j
J µ([a + δ ¯, b]) ≤
1 ∞ 1])
µ((ai , bi + ηi ¯ µ((αj , β j ]) ≤
j =1 i=1 So, by the choice of δ and ηi ,
∞ µ((ai , bi ]) µ((a, b]) ≤ 2 +
i=1 443 444 Appendix: Measure Theory
and since is arbitrary, we have proved the result in the case −∞ < a < b < ∞.
The proof can now be completed exactly as before. A.2. Carath´odory’s Extension Theorem
e
This section is devoted to the proof of (1.1). The proof is slick but rather mysterious. The reader should not worry too much about the details but concentrate
on the structure of the proof and the deﬁnitions introduced.
Uniqueness. We will prove that the extension is unique before tackling the
more diﬃcult problem of proving its existence. The key to our uniqueness proof
is Dynkin’s π − λ theorem, a result that we will use many times in the book. As
usual, we need a few deﬁnitions before we can state the result. P is said to be
a π system if it is closed under intersection, i.e., if A, B ∈ P then A ∩ B ∈ P .
For example, the collection of rectangles (a1 , b1 ] × · · · × (ad , bd ] is a π system.
L is said to be a λsystem if it satisﬁes: (i) Ω ∈ L. (ii) If A, B ∈ L and A ⊂ B
then B − A ∈ L . (iii) If An ∈ L and An ↑ A then A ∈ L . The reader will see
in a moment that the next result is just what we need to prove uniqueness of
the extension.
(2.1) π − λ Theorem. If P is a π system and L is a λsystem that contains P
then σ (P ) ⊂ L.
Proof: We will show that
(a) if (P ) is the smallest λsystem containing P then (P ) is a σ ﬁeld.
The desired result follows from (a). To see this, note that since σ (P ) is the
smallest σ ﬁeld and (P ) is the smallest λsystem containing P we have
σ (P ) ⊂ (P ) ⊂ L
To prove (a) we begin by noting that a λsystem that is closed under intersection
is a σ ﬁeld since
if A ∈ L then Ac = Ω − A ∈ L
A ∪ B = (Ac ∩ B c )c
∪n Ai ↑ ∪∞ Ai as n ↑ ∞
i=1
i=1
Thus, it is enough to show
(b) (P ) is closed under intersection.
To prove (b), we let GA = {B : A ∩ B ∈ (P )} and prove A.2 Carath´odory’s Extension Theorem
e
(c) if A ∈ (P ) then GA is a λsystem.
To check this, we note: (i) Ω ∈ GA since A ∈ (P ).
(ii) if B, C ∈ GA and B ⊃ C then A ∩ (B − C ) = (A ∩ B ) − (A ∩ C ) ∈ (P )
since A ∩ B, A ∩ C ∈ (P ) and (P ) is a λsystem.
(iii) if Bn ∈ GA and Bn ↑ B then A ∩ Bn ↑ A ∩ B ∈ (P ) since A ∩ Bn ∈ (P )
and (P ) is a λsystem.
To get from (c) to (b), we note that since P is a π system,
if A ∈ P then GA ⊃ P and so (c) implies GA ⊃ (P )
i.e., if A ∈ P and B ∈ (P ) then A ∩ B ∈ (P ). Interchanging A and B in the
last sentence: if A ∈ (P ) and B ∈ P then A ∩ B ∈ (P ) but this implies
if A ∈ (P ) then GA ⊃ P and so (c) implies GA ⊃ (P ).
This conclusion implies that if A, B ∈ (P ) then A ∩ B ∈ (P ), which proves
(b) and completes the proof.
To prove that the extension in (1.1) is unique, we will show:
(2.2) Theorem. Let P be a π system. If ν1 and ν2 are measures (on σ ﬁelds
F1 and F2 ) that agree on P and there is a sequence An ∈ P with An ↑ Ω and
νi (An ) < ∞, then ν1 and ν2 agree on σ (P ).
Proof Let A ∈ P have ν1 (A) = ν2 (A) < ∞. Let
L = {B ∈ σ (P ) : ν1 (A ∩ B ) = ν2 (A ∩ B )} We will now show that L is a λsystem. Since A ∈ P , ν1 (A) = ν2 (A) and
Ω ∈ L. If B, C ∈ L with C ⊂ B then
ν1 (A ∩ (B − C )) = ν1 (A ∩ B ) − ν1 (A ∩ C )
= ν2 (A ∩ B ) − ν2 (A ∩ C ) = ν2 (A ∩ (B − C ))
Here we use the fact that νi (A) < ∞ to justify the subtraction. Finally, if
Bn ∈ L and Bn ↑ B , then part (iii) of Exercise 1.3 implies
ν1 (A ∩ B ) = lim ν1 (A ∩ Bn ) = lim ν2 (A ∩ Bn ) = ν2 (A ∩ B )
n→∞ n→∞ Since P is closed under intersection by assumption, the π − λ theorem implies
L ⊃ σ (P ), i.e., if A ∈ P with ν1 (A) = ν2 (A) < ∞ and B ∈ σ (P ) then 445 446 Appendix: Measure Theory
ν1 (A ∩ B ) = ν2 (A ∩ B ). Letting An ∈ P with An ↑ Ω, ν1 (An ) = ν2 (An ) < ∞,
and using the last result and part (iii) of Exercise 1.3, we have the desired
conclusion.
Exercise 2.1. Give an example of two probability measures µ = ν on F = all
subsets of {1, 2, 3, 4} that agree on a collection of sets C with σ (C ) = F , i.e.,
the smallest σ algebra containing C is F .
Existence. Our next step is to show that a measure (not necessarily σ ﬁnite)
deﬁned on an algebra A has an extension to the σ algebra generated by A.
If E ⊂ Ω, we let µ∗ (E ) = inf i µ(Ai ) where the inﬁmum is taken over all
sequences from A so that E ⊂ ∪i Ai . Intuitively, if ν is a measure that agrees
with µ on A, then it follows from part (ii) of Exercise 1.3 that
ν (E ) ≤ ν (∪i Ai ) ≤ ν (Ai ) =
i µ(Ai )
i so µ∗ (E ) is an upper bound on the measure of E . Intuitively, the measurable
sets are the ones for which the upper bound is tight. Formally, we say that E
is measurable if
µ∗ ( F ) = µ∗ ( F ∩ E ) + µ∗ ( F ∩ E c ) (∗) for all sets F ⊂ Ω The last deﬁnition is not very intuitive, but we will see in the proofs below that
it works very well.
It is immediate from the deﬁnition that µ∗ has the following properties:
(i) monotonicity. If E ⊂ F then µ∗ (E ) ≤ µ∗ (F ).
(ii) subadditivity. If F ⊂ ∪i Fi , a countable union, then µ∗ (F ) ≤ i µ∗ ( F i ) . Any set function with µ∗ (∅) = 0 that satisﬁes (i) and (ii) is called an outer
measure. Using (ii) with F1 = F ∩ E and F2 = F ∩ E c (and Fi = ∅ otherwise),
we see that to prove a set is measurable, it is enough to show
µ∗ ( F ) ≥ µ∗ ( F ∩ E ) + µ∗ ( F ∩ E c ) (∗ ) We begin by showing that our new deﬁnition extends the old one.
(2.3) Lemma. If A ∈ A then µ∗ (A) = µ(A) and A is measurable.
Proof Part (ii) of Exercise 1.3 implies that if A ⊂ ∪i Ai then
µ(A) ≤ µ(Ai )
i A.2 Carath´odory’s Extension Theorem
e
so µ(A) ≤ µ∗ (A). Of course, we can always take A1 = A and the other Ai = ∅
so µ∗ (A) ≤ µ(A).
To prove that any A ∈ A is measurable, we begin by noting that the
inequality is (∗ ) trivial when µ∗ (F ) = ∞, so we can without loss of generality
assume µ∗ (F ) < ∞. To prove that (∗ ) holds when E = A, we observe that
since µ∗ (F ) < ∞ there is a sequence Bi ∈ A so that ∪i Bi ⊃ F and
µ(Bi ) ≤ µ∗ (F ) +
i Since µ is additive on A, and µ = µ∗ on A we have
µ(Bi ) = µ∗ (Bi ∩ A) + µ∗ (Bi ∩ Ac )
Summing over i and using the subadditivity of µ∗ gives
µ∗ ( F ) + ≥ µ∗ (Bi ∩ A) +
i µ∗ (Bi ∩ Ac ) ≥ µ∗ (F ∩ A) + µ∗ (F c ∩ A)
i which proves the desired result since is arbitrary. (2.4) Lemma. The class A∗ of measurable sets is a σ ﬁeld, and the restriction
of µ∗ to A∗ is a measure.
Remark. This result is true for any outer measure.
Proof It is clear from the deﬁnition that: (a) If E is measurable then E c is.
Our ﬁrst nontrivial task is to prove:
(b) If E1 and E2 are measurable then E1 ∪ E2 and E1 ∩ E2 are.
Proof of (b) To prove the ﬁrst conclusion, let G be any subset of Ω. Usc
ing subadditivity, the measurability of E2 (let F = G ∩ E1 in (∗)), and the
measurability of E1 , we get
c
c
µ∗ (G ∩ (E1 ∪ E2 )) + µ∗ (G ∩ (E1 ∩ E2 ))
c
c
c
≤ µ∗ (G ∩ E1 ) + µ∗ (G ∩ E1 ∩ E2 ) + µ∗ (G ∩ E1 ∩ E2 )
c
= µ∗ (G ∩ E1 ) + µ∗ (G ∩ E1 ) = µ∗ (G)
c
c
To prove that E1 ∩ E2 is measurable, we observe E1 ∩ E2 = (E1 ∪ E2 )c and use
(a). 447 448 Appendix: Measure Theory
(c) Let G ⊂ Ω and E1 , . . . , En be disjoint measurable sets. Then
n µ∗ (G ∩ ∪n Ei ) =
i=1 µ∗ (G ∩ Ei )
i=1 Proof of (c)
En = ∅, so Let Fm = ∪i≤m Ei . En is measurable, Fn ⊃ En , and Fn−1 ∩
c
µ∗ (G ∩ Fn ) = µ∗ (G ∩ Fn ∩ En ) + µ∗ (G ∩ Fn ∩ En )
= µ∗ (G ∩ En ) + µ∗ (G ∩ Fn−1 ) The desired result follows from this by induction.
(d) If the sets Ei are measurable then E = ∪∞ Ei is measurable.
i=1
c
Proof of (d) Let Ei = Ei ∩ ∩j<i Ej . (a) and (b) imply Ei is measurable,
so we can suppose without loss of generality that the Ei are pairwise disjoint.
Let Fn = E1 ∪ . . . ∪ En . Fn is measurable by (b), so using monotonicity and
(c) we have
c
µ∗ (G) = µ∗ (G ∩ Fn ) + µ∗ (G ∩ Fn ) ≥ µ∗ (G ∩ Fn ) + µ∗ (G ∩ E c )
n µ∗ (G ∩ Ei ) + µ∗ (G ∩ E c ) =
i=1 Letting n → ∞ and using subadditivity
∞ µ∗ (G) ≥ µ∗ (G ∩ Ei ) + µ∗ (G ∩ E c ) ≥ µ∗ (G ∩ E ) + µ∗ (G ∩ E c )
i=1 which is (∗ ).
The last step in the proof of (2.4) is
(e) If E = ∪i Ei where E1 , E2 , . . . are disjoint and measurable, then
∞ µ∗ ( E ) = µ∗ (Ei )
i=1 Proof of (e) Let Fn = E1 ∪ . . . ∪ En . By monotonicity and (c)
n µ∗ ( E ) ≥ µ∗ ( F n ) = µ∗ (Ei )
i=1 Letting n → ∞ now and using subadditivity gives the desired conclusion. A.3 Completion, Etc. A.3. Completion, Etc.
The proof of (1.1) given in the last section deﬁnes an extension to A∗ ⊃ σ (A).
Our next goal is to describe the relationship between these two σ algebras. Let
Aσ denote the collection of countable unions of sets in A, and let Bδ denote the
collection of countable intersections of sets in B . Taking B = Aσ , we see that
Aσδ denotes the collection of countable intersections of sets in Aσ .
(3.1) Lemma. Let E be any set with µ∗ (E ) < ∞. (i) For any > 0, there is
an A ∈ Aσ with A ⊃ E and µ∗ (A) ≤ µ∗ (E ) + . (ii) There is a B ∈ Aσδ with
B ⊃ E and µ∗ (B ) = µ∗ (E ).
Proof By the deﬁnition of µ∗ , there is a sequence Ai so that A ≡ ∪i Ai ⊃ E
and i µ(Ai ) ≤ µ∗ (E ) + . The deﬁnition of µ∗ implies µ∗ (A) ≤ i µ(Ai ),
establishing (i). For (ii), let An ∈ Aσ with An ⊃ E and µ∗ (An ) ≤ µ∗ (E ) + 1/n,
and let B = ∩n An . Clearly, B ∈ Aσδ , B ⊃ E , and hence by monotonicity,
µ∗ (B ) ≥ µ∗ (E ). To prove the other inequality, notice that B ⊂ An and hence
µ∗ (B ) ≤ µ∗ (An ) ≤ µ∗ (E ) + 1/n for any n.
Exercise 3.1. Let A be an algebra, µ a measure on σ (A), and B ∈ σ (A)
with µ(B ) < ∞. For any > 0, there is an A ∈ A with µ(A∆B ) < , where
A∆B = (A − B ) ∪ (B − A).
(3.2) Theorem. Suppose µ is σ ﬁnite on A. B ∈ A∗ if and only if there is an
A ∈ Aσδ and a set N with µ∗ (N ) = 0 so that B = A − N (= A ∩ N c ).
Proof It follows from (2.3) and (2.4) if A ∈ Aσδ then A ∈ A∗ . (∗ ) in Section
A.2 and monotonicity imply sets with µ∗ (N ) = 0 are measurable, so using (2.4)
again it follows that A ∩ N c ∈ A∗ . To prove the other direction, let Ωi be a
disjoint collection of sets with µ(Ωi ) < ∞ and Ω = ∪i Ωi . Let Bi = B ∩ Ωi and
use (3.1) to ﬁnd An ∈ Aσ so that An ⊃ Bi and µ(An ) ≤ µ∗ (Ei ) + 1/n2i . Let
i
i
i
An = ∪i An . B ⊂ An and
i
∞ (An − Bi )
i An − B ⊂
i=1 so, by subadditivity,
∞ µ∗ (An − B ) ≤ µ∗ (An − Bi ) ≤ 1/n
i
i=1 Since An ∈ Aσ , the set A = ∩n An ∈ Aσδ . Clearly, A ⊃ B . Since N ≡ A − B ⊂
An − B for all n, monotonicity implies µ∗ (N ) = 0, and the proof of (3.2) is
complete. 449 450 Appendix: Measure Theory
A measure space (Ω, F , µ) is said to be complete if F contains all subsets
of sets of measure 0. In the proof of (3.2), we showed that (Ω, A∗ , µ∗ ) is complete. Our next result shows that (Ω, A∗ , µ∗ ) is the completion of (Ω, σ (A), µ).
(3.3) Theorem. If (Ω, F , µ) is a measure space, then there is a complete
¯¯
¯
measure space (Ω, F , µ), called the completion of (Ω, F , µ), so that: (i) E ∈ F
if and only if E = A ∪ B , where A ∈ F and B ⊂ N ∈ F with µ(N ) = 0, (ii) µ
¯
agrees with µ on F .
¯
Proof The ﬁrst step is to check that F is a σ algebra. If Ei = Ai ∪ Bi where
Ai ∈ F and Bi ⊂ Ni where µ(Ni ) = 0, then ∪i Ai ∈ F and subadditivity implies
¯
µ(∪i Ni ) ≤ i µ(Ni ) = 0, so ∪i Ei ∈ F . As for complements, if E = A ∪ B and
c
c
B ⊂ N , then B ⊃ N so
E c = Ac ∩ B c = (Ac ∩ N c ) ∪ (Ac ∩ B c ∩ N )
¯
Ac ∩ N c is in F and Ac ∩ B c ∩ N ⊂ N , so E c ∈ F .
We deﬁne µ in the obvious way: If E = A ∪ B where A ∈ F and B ⊂ N
¯
where µ(N ) = 0, then we let µ(E ) = µ(A). The ﬁrst thing to show is that
¯
µ is well deﬁned, i.e., if E = Ai ∪ Bi , i = 1, 2, are two decompositions, then
¯
µ(A1 ) = µ(A2 ). Let A0 = A1 ∩ A2 and B0 = B1 ∪ B2 . E = A0 ∪ B0 is a third
decomposition with A0 ∈ F and B0 ⊂ N1 ∪ N2 , and has the pleasant property
that if i = 1 or 2
µ(A0 ) ≤ µ(Ai ) ≤ µ(A0 ) + µ(N1 ∪ N2 ) = µ(A0 )
The last detail is to check that µ is measure, but that is easy. If Ei = Ai ∪ Bi
¯
are disjoint, then ∪i Ei can be decomposed as ∪i Ai ∪ (∪i Bi ), and the Ai ⊂ Ei
are disjoint, so
µ(∪i Ei ) = µ(∪i Ai ) =
¯ µ(Ai ) =
i µ(Ei )
¯
i (1.6) allows us to construct Lebesgue measure λ on (Rd , Rd ). Using (3.3),
¯
¯
we can extend λ to be a measure on (R, Rd ) where Rd is the completion of Rd .
Having done this, it is natural (if somewhat optimistic) to ask: Are there any
¯
sets that are not in Rd ? The answer is “Yes” and we will now give an example
of a nonmeasurable B in R.
A nonmeasurable subset of [0,1)
The key to our construction is the observation that λ is translation invariant:
¯
¯
i.e., if A ∈ R and x + A = {x + y : y ∈ A}, then x + A ∈ R and λ(A) = λ(x + A). A.3 Completion, Etc.
We say that x, y ∈ [0, 1) are equivalent and write x ∼ y if x − y is a rational
number. By the axiom of choice, there is a set B that contains exactly one
element from each equivalence class. B is our nonmeasurable set, that is,
(3.4) Theorem. B ∈ R.
/¯
Proof The key is the following: ¯
(3.5) Lemma. If E ⊂ [0, 1) is in R, x ∈ (0, 1), and x + E = {(x + y ) mod 1 :
y ∈ E }, then λ(E ) = λ(x + E ).
Proof Let A = E ∩ [0, 1 − x) and B = E ∩ [1 − x, 1). Let A = x + A = {x + y :
¯
¯
y ∈ A} and B = x − 1 + B . A, B ∈ R, so by translation invariance A , B ∈ R
and λ(A) = λ(A ), λ(B ) = λ(B ). Since A ⊂ [x, 1) and B ⊂ [0, x) are disjoint,
λ(E ) = λ(A) + λ(B ) = λ(A ) + λ(B ) = λ(x + E )
From (3.5), it follows easily that B is not measurable; if it were, then q + B ,
q ∈ Q ∩ [0, 1) would be a countable disjoint collection of measurable subsets of
[0,1), all with the same measure α and having
∪q∈Q∩[0,1) (q + B ) = [0, 1)
If α > 0 then λ([0, 1)) = ∞, and if α = 0 then λ([0, 1)) = 0. Neither conclusion
is compatible with the fact that λ([0, 1)) = 1 so B ∈ R.
/¯
Exercise 3.2. Let B be the nonmeasurable set constructed in (3.4). (i) Let
Bq = q + B and show that if Dq ⊂ Bq is measurable, then λ(Dq ) = 0. (ii) Use
(i) to conclude that if A ⊂ R has λ(A) > 0, there is a nonmeasurable S ⊂ A.
Letting B = B × [0, 1]d−1 where B is our nonmeasurable subset of (0,1),
we get a nonmeasurable set in d > 1. In d = 3, there is a much more interesting
example, but we need the reader to do some preliminary work. In Euclidean
geometry, two subsets of Rd are said to be congruent if one set can be mapped
onto the other by translations and rotations.
Claim. Two congruent measurable sets must have the same Lebesgue measure.
Exercise 3.3. Prove the claim in d = 2 by showing (i) if B is a rotation of a
rectangle A then λ∗ (B ) = λ(A). (ii) If C is congruent to D then λ∗ (C ) = λ∗ (D). 451 452 Appendix: Measure Theory
BanachTarski Theorem
Banach and Tarski (1924) used the axiom of choice to show that it is possible to
partition the sphere {x : x ≤ 1} in R3 into a ﬁnite number of sets A1 , . . . , An
and ﬁnd congruent sets B1 , . . . , Bn whose union is two disjoint spheres of radius
1! Since congruent sets have the same Lebesgue measure, at least one of the
sets Ai must be nonmeasurable. The construction relies on the fact that the
group generated by rotations in R3 is not Abelian. Lindenbaum (1926) showed
that this cannot be done with any bounded set in R2 . For a popular account
of the BanachTarski theorem, see French (1988).
Solovay’s Theorem
The axiom of choice played an important role in the last two constructions
of nonmeasurable sets. Solovay (1970) proved that its use is unavoidable. In
his own words, “We show that the existence of a nonLebesgue measurable set
cannot be proved in ZermeloFrankel set theory if the use of the axiom of choice
is disallowed.” This should convince the reader that all subsets of Rd that arise
¯
“in practice” are in Rd . A.4. Integration
Let µ be a σ ﬁnite measure on (Ω, F ). In this section we will deﬁne
a class of measurable functions. This is a fourstep procedure: f dµ for n Step 1. ϕ is said to be a simple function if ϕ(ω ) = i=1 ai 1Ai and Ai are
disjoint sets with µ(Ai ) < ∞. If ϕ is a simple function, we let
n ϕ dµ = ai µ(Ai )
i=1 The representation of ϕ is not unique since we have not supposed that the ai are
distinct. However, it is easy to see that the last deﬁnition does not contradict
itself.
We will prove the next three conclusions four times, but before we can state
them for the ﬁrst time, we need a deﬁnition. ϕ ≥ ψ µ almost everywhere
(or ϕ ≥ ψ µa.e.) means µ({ω : ϕ(ω ) < ψ (ω )}) = 0. When there is no doubt
about what measure we are referring to, we drop the µ.
(4.1) Lemma. Let ϕ and ψ be simple functions.
(i) If ϕ ≥ 0 a.e. then ϕ dµ ≥ 0. A.4 Integration
(ii) For any a ∈ R, aϕ dµ = a ϕ dµ.
(iii) ϕ + ψ dµ = ϕ dµ + ψ dµ.
Proof (i) and (ii) are immediate consequences of the deﬁnition. To prove
(iii), suppose
m n ai 1Ai ϕ= and ψ = i=1 bj 1Bj
j =1 To make the supports of the two functions the same, we let A0 = ∪i Bi − ∪i Ai ,
let B0 = ∪i Ai − ∪i Bi , and let a0 = b0 = 0. Now
m n ϕ+ψ = (ai + bj )1(Ai ∩Bj )
i=0 j =0 and the Ai ∩ Bj are pairwise disjoint, so
m n (ϕ + ψ ) dµ = (ai + bj )µ(Ai ∩ Bj )
i=0 j =0
mn = n m ai µ(Ai ∩ Bj ) +
i=0 j =0
m = n ai µ(Ai ) +
i=0 bj µ(Ai ∩ Bj )
j =0 i=0 bj µ(Bj ) = ϕ dµ + ψ dµ j =0 In the nexttolast step, we used Ai = +j (Ai ∩ Bj ) and Bj = +i (Ai ∩ Bj ),
where + denotes a disjoint union.
We will prove (i)–(iii) three more times as we generalize our integral. See
(4.3), (4.5), and (4.7). As a consequence of (i)–(iii), we get three more useful
properties. To keep from repeating their proofs, which do not change, we will
prove
(4.2) Lemma. If (i) and (iii) hold then we have:
(iv) If ϕ ≤ ψ a.e. then ϕ dµ ≤ ψ dµ.
(v) If ϕ = ψ a.e. then ϕ dµ = ψ dµ.
If, in addition, (ii) holds when a = −1 we have
(vi)  ϕ dµ ≤ ϕ dµ
Proof By (iii), ψ dµ = ϕ dµ + (ψ − ϕ) dµ and the second integral is ≥ 0
by (i), so (iv) holds. ϕ = ψ a.e. implies ϕ ≤ ψ a.e. and ψ ≤ ϕ a.e. so (v) follows
from two applications of (iv). To prove (vi) now, notice that ϕ ≤ ϕ so (iv) 453 454 Appendix: Measure Theory
implies ϕ dµ ≤ ϕ dµ. −ϕ ≤ ϕ, so (iv) and (ii) imply − ϕ dµ ≤
Since y  = max(y, −y ), the result follows. ϕ dµ. Step 2. Let E be a set with µ(E ) < ∞ and let f be a bounded function that
vanishes on E c . To deﬁne the integral of f , we observe that if ϕ, ψ are simple
functions that have ϕ ≤ f ≤ ψ , then we want to have
ϕ dµ ≤ f dµ ≤ ψ dµ so we let
(∗) f dµ = sup ϕ dµ = inf ψ dµ ψ ≥f ϕ≤f Here and for the rest of Step 2, we assume that ϕ and ψ vanish on E c . To
justify the deﬁnition in (∗), we have to prove that the sup and inf are equal. It
follows from (iv) in (4.2) that
sup
ϕ≤f ϕ dµ ≤ inf ψ dµ ψ ≥f To prove the other inequality, suppose f  ≤ M and let
Ek = x∈E:
n ψ n ( x) =
k=−n (k − 1)M
kM
≥ f ( x) >
n
n kM
1Ek
n n ϕ n ( x) =
k=−n for − n ≤ k ≤ n
(k − 1)M
1Ek
n By deﬁnition, ψn (x) − ϕn (x) = (M/n)1E , so
ψn (x) − ϕn (x) dµ = M
µ( E )
n Since ϕn (x) ≤ f (x) ≤ ψn (x), it follows from (iii) in (4.1) that
sup
ϕ≤f ϕ dµ ≥ M
µ(E ) + ψn dµ
n
M
ψ dµ
≥ − µ(E ) + inf
ψ ≥f
n ϕn dµ = − The last inequality holds for all n, so the proof of (∗) is complete. A.4 Integration
(4.3) Lemma. Let E be a set with µ(E ) < ∞. If f and g are bounded functions
that vanish on E c then:
(i) If f ≥ 0 a.e. then f dµ ≥ 0.
(ii) For any a ∈ R, af dµ = a f dµ.
(iii) f + g dµ = f dµ + g dµ.
(iv) If g ≤ f a.e. then g dµ ≤ f dµ.
(v) If g = f a.e. then g dµ = f dµ.
(vi)  f dµ ≤ f  dµ
Proof Since we can take ϕ ≡ 0, (i) is clear from the deﬁnition. To prove (ii),
we observe that if a > 0, then aϕ ≤ af if and only if ϕ ≤ f , so
aϕ dµ = sup a af dµ = sup
ϕ≤f ϕ dµ = a sup ϕ≤f ϕ dµ = a f dµ ϕ≤f For a < 0, we observe that aϕ ≤ af if and only if ϕ ≥ f , so
af dµ = sup aϕ dµ = sup a ϕ≥f ϕ dµ = a inf ϕ≥f ϕ≥f ϕ dµ = a f dµ To prove (iii), we observe that if ψ1 ≥ f and ψ2 ≥ g , then ψ1 + ψ2 ≥ f + g so
ψ dµ ≤ inf ψ ≥f +g inf ψ1 ≥f,ψ2 ≥g ψ1 + ψ2 dµ Using linearity for simple functions, it follows that
f + g dµ = inf ψ dµ ψ ≥f +g ≤ inf ψ1 ≥f,ψ2 ≥g ψ1 dµ + ψ2 dµ = f dµ + g dµ To prove the other inequality, observe that the last conclusion applied to −f
and −g and (ii) imply
− f + g dµ ≤ − f dµ − g dµ (iv)–(vi) follow from (i)–(iii) by (4.2).
Notation. We deﬁne the integral of f over the set E :
f dµ ≡
E f · 1E dµ 455 456 Appendix: Measure Theory
Step 3. If f ≥ 0 then we let
f dµ = sup h dµ : 0 ≤ h ≤ f, h is bounded and µ({x : h(x) > 0}) < ∞ The last deﬁnition is nice since it is clear that this is well deﬁned. The next
result will help us compute the value of the integral.
(4.4) Lemma. Let En ↑ Ω have µ(En ) < ∞ and let a ∧ b = min(a, b). Then
f ∧ n dµ ↑ f dµ as n ↑ ∞ En Proof It is clear that from (iv) in (4.3) that the lefthand side increases as
n does. Since h = (f ∧ n)1En is a possibility in the sup, each term is smaller
than the integral on the right. To prove that the limit is f dµ, observe that if
0 ≤ h ≤ f , h ≤ M , and µ({x : h(x) > 0}) < ∞, then for n ≥ M using h ≤ M ,
(iv), and (iii),
f ∧ n dµ ≥
En Now 0 ≤ c
En h dµ = h dµ − h dµ
c
En En c
h dµ ≤ M µ(En ∩ {x : h(x) > 0}) → 0 as n → ∞, so f ∧ n dµ ≥ lim inf
n→∞ h dµ En which proves the desired result since h is an arbitrary member of the class that
deﬁnes the integral of f .
(4.5) Lemma. Suppose f , g ≥ 0.
(i) f dµ ≥ 0
(ii) If a > 0 then af dµ = a f dµ.
(iii) f + g dµ = f dµ + g dµ
(iv) If 0 ≤ g ≤ f a.e. then g dµ ≤ f dµ.
(v) If 0 ≤ g = f a.e. then g dµ = f dµ.
Proof (i) is trivial from the deﬁnition. (ii) is clear, since when a > 0, ah ≤ af
if and only if h ≤ f and we have ah dµ = a h du for h in the deﬁning class.
For (iii), we observe that if f ≥ h and g ≥ k , then f + g ≥ h + k so taking the
sup over h and k in the deﬁning classes for f and g gives
f + g dµ ≥ f dµ + g dµ A.4 Integration
To prove the other direction, we observe (a + b) ∧ n ≤ (a ∧ n) + (b ∧ n) so (iv)
from (4.2) and (iii) from (4.3) imply
(f + g ) ∧ n dµ ≤
En f ∧ n dµ +
En g ∧ n dµ
En Letting n → ∞ and using (4.4) gives (iii). As before, (iv) and (v) follow from
(i), (iii), and (4.2).
Exercise 4.1. Show that if f ≥ 0 and f dµ = 0 then f = 0 a.e. Exercise 4.2. Let f ≥ 0 and En,m = {x : m/2n ≤ f (x) < (m + 1)/2n }. As
n ↑ ∞,
∞
m
µ(En,m ) ↑ f dµ
2n
m=1
Step 4. We say f is integrable if f  dµ < ∞. Let f + (x) = f (x) ∨ 0 and f − (x) = (−f (x)) ∨ 0
where a ∨ b = max(a, b). Clearly,
f ( x) = f + ( x) − f − ( x) and f (x) = f + (x) + f − (x) We deﬁne the integral of f by
f dµ = f + dµ − f − dµ The righthand side is well deﬁned since f + , f − ≤ f  and we have (iv) in (4.5).
For the ﬁnal time, we will prove our six properties. To do this, it is useful to
know:
(4.6) Lemma. If f = f1 − f2 where f1 , f2 ≥ 0 and
f dµ = Proof f1 dµ − fi dµ < ∞ then f2 dµ f1 + f − = f2 + f + and all four functions are ≥ 0, so by (iii) of (4.5), f1 dµ + f − dµ = f1 + f − dµ = f2 + f + dµ = f2 dµ + f + dµ 457 458 Appendix: Measure Theory
Rearranging gives the desired conclusion.
(4.7) Theorem. Suppose f and g are integrable.
(i) If f ≥ 0 a.e. then f dµ ≥ 0.
(ii) For all a ∈ R, af dµ = a f dµ.
(iii) f + g dµ = f dµ + g dµ
(iv) If g ≤ f a.e. then g dµ ≤ f dµ.
(v) If g = f a.e. then g dµ = f dµ.
(vi)  f dµ ≤ f  dµ
Proof (i) is trivial. (ii) is clear since if a > 0, then (af )+ = a(f + ), and so
on. To prove (iii), observe that f + g = (f + + g + ) − (f − + g − ), so using (4.6)
and (4.5)
f + g dµ =
= f + + g + dµ −
f + dµ + f − + g − dµ g + dµ − f − dµ − g − dµ As usual, (iv)–(vi) follow from (i)–(iii) and (4.2).
Notation for special cases:
(a) When (Ω, F , µ) = (Rd , Rd , λ), we write f (x) dx for (b) When (Ω, F , µ) = (R, R, λ) and E = [a, b], we write f dλ.
b
a f (x) dx for E f dλ. (c) When (Ω, F , µ) = (R, R, µ) with µ((a, b]) = G(b) − G(a) for a < b, we write
f (x) dG(x) for f dµ.
(d) When Ω is a countable set, F = all subsets of Ω, and µ is counting measure,
we write i∈Ω f (i) for f dµ.
We mention example (d) primarily to indicate that results for sums follow from
those for integrals.
For the rest of this section, we will consider the case (Ω, F , µ) = (R, R, λ).
Littlewood’s principles
Speaking of the theory of functions of a real variable Littlewood (1944) said
“The extent of knowledge required is nothing like so great as is sometimes
supposed. There are three principles, roughly expressible in the following terms:
1. Every measurable set is roughly a ﬁnite union of intervals. A.4 Integration
2. Every measurable function is almost continuous.
3. Every convergent sequence of measurable functions is almost uniformly convergent.
Most of the results of the theory are fairly intuitive applications of these ideas
and the student armed with them should be equal to most occasions when real
variable theory is called for.”
Exercise 3.1 above and Exercise 2.8 in Chapter 1 give versions of the ﬁrst
and third principles. The next two exercises develop a version of the second.
Exercise 4.3. Let g be an integrable function on R and > 0. (i) Use the
deﬁnition of the integral to conclude there is a simple function ϕ = k bk 1Ak
with g − ϕ dx < . (ii) Use Exercise 3.1 to approximate the Ak by ﬁnite
unions of intervals to get a step function
k q= cj 1(aj−1 ,jm )
j =1 with a0 < a1 < . . . < ak , so that ϕ − q  < . (iii) Round the corners of q to
get a continuous function r so that q − r dx < .
Exercise 4.4. Prove the RiemannLebesgue lemma. If g is integrable then
lim n→∞ g (x) cos nx dx = 0 Hint: If g is a step function, this is easy. Now use Exercise 4.3.
* Riemann Integration
Our treatment of the Lebesgue integral would not be complete if we did not
prove the classic theorem of Lebesgue that identiﬁes the functions for which the
Riemann integral exists. Let −∞ < a < b < ∞. A subdivision σ of [a, b] is a
ﬁnite sequence a = x0 < x1 . . . < xn = b. Given a subdivision σ , we deﬁne the
n upper Riemann sum U (σ ) = (xi+1 − xi ) sup{f (y ) : y ∈ [xi−1 , xi ]}
i=1
n lower Riemann sum L(σ ) = (xi+1 − xi ) inf {f (y ) : y ∈ [xi−1 , xi ]}
i=1 459 460 Appendix: Measure Theory
We say that f is Riemann integrable on [a, b] in the liberal sense if
∞ > inf U (σ ) = sup L(σ ) > −∞
σ σ The function q (x) that is 1 if x is irrational and 0 if x is rational is the classic
example of a function that is not Riemann integrable on [0, 1] in the liberal
sense but is Lebesgue integrable on [0, 1]. (q 1[0,1] is a simple function!) The
next result gives a necessary condition for Riemann integrability.
(4.8) Theorem. If f is Riemann integrable on [a, b] in the liberal sense, then
f is bounded and continuous a.e. on [a, b].
Proof If f is unbounded above, then U (σ ) = ∞ for all subdivisions. Likewise
if f is unbounded below L(σ ) = −∞ for all subdivisions. Thus, f must be
bounded. To prove that it must be continuous a.e., we begin by letting
un (x) = sup{f (y ) : x − y  < 2−n and y ∈ [a, b]}
vn (x) = inf {f (y ) : x − y  < 2−n and y ∈ [a, b]}
Exercise 2.6 in Chapter 1 implies un and vn are measurable. Let
f 0 = lim un
n→∞ and f0 = lim vn
n→∞ f 0 (x) ≥ f0 (x) with equality if and only if f is continuous at x. Given a subdivision σ ,
sup{f (y ) : y ∈ [xi−1 , xi ]} ≥ f 0 (x) for x ∈ (xi−1 , xi ) so U (σ ) ≥ [a,b] f 0 dx, the (Lebesgue) integral existing since f 0 is bounded
and measurable. Similar reasoning shows that any lower Riemann sum has
f dx ≥ L(σ ), so if f is Riemann integrable in the liberal sense [a,b] f 0 −
[a,b] 0
f0 dx = 0, and it follows from Exercise 4.1 that f 0 = f0 a.e.
To state a converse to (4.8), we need two deﬁnitions. The mesh of a
subdivision = sup(xi − xi−1 ). f is said to be Riemann integrable on [a, b] in
the strict sense if for any sequence of subdivisions with mesh → 0, U (σn ) −
L(σn ) → 0.
(4.9) Theorem. If f is bounded and continuous a.e. on [a, b] then f is Riemann
integrable on [a, b] in the strict sense.
Proof We need a little more theory before you can give a simple proof of this.
See Exercise 5.6. A.5 Properties of the Integral
Exercise 4.5. Give examples to show that for a function f deﬁned on R,
neither statement implies the other. (a) f is continuous a.e. (b) There is a
continuous function g so that f = g a.e.
Exercise 4.6. Let (Ω, F , µ) be a ﬁnite measure space and let f be a function
with f  < M . Given a sequence of subdivisions −M = xn < xn < . . . < xn =
0
1
n
M , deﬁne the
n upper Lebesgue sum ¯
U ( σn ) = xn µ({ω : f (ω ) ∈ [xn −1 , xn )})
m
m
m
m=1
n lower Lebesgue sum ¯
L(σn ) = xn −1 µ({ω : f (ω ) ∈ [xn −1 , xn )})
m
m
m
m=1 ¯
¯
Show that if mesh(σn ) → 0, then U (σn ), L(σn ) → f dµ. In short, in Riemann
integration we subdivide the domain, and in Lebesgue integration we subdivide
the range. A.5. Properties of the Integral
In this section, we will develop properties of the integral deﬁned in the last
section. Our ﬁrst result generalizes (vi) from (4.7).
(5.1) Jensen’s inequality. Suppose ϕ is convex, that is,
λϕ(x) + (1 − λ)ϕ(y ) ≥ ϕ(λ x + (1 − λ)y )
for all λ ∈ (0, 1) and x, y ∈ R. If µ is a probability measure, i.e., µ(R) = 1 and
f and ϕ(f ) are integrable then
ϕ f dµ ≤ ϕ(f ) dµ Proof Let c = f dµ and let (x) = ax + b be a linear function that has
(c) = ϕ(c) and ϕ(x) ≥ (x). To see that such a function exists, recall that
convexity implies
lim
h ↓0 ϕ(c) − ϕ(c − h)
ϕ(c + h) − ϕ(c)
≤ lim
h ↓0
h
h (The limits exist since the sequences are monotone.) If we let a be any number
between the two limits and let (x) = a(x − c) + ϕ(c), then has the desired 461 462 Appendix: Measure Theory
properties. With the existence of
implies
ϕ(f ) dµ ≥
since c = (af + b) dµ = a established, the rest is easy. (iv) in (5.1) f dµ + b = f dµ =ϕ f dµ f dµ and (c) = ϕ(c). Let f p = ( f p dµ)1/p for 1 ≤ p < ∞, and notice cf
any real number c. p = c · f p for (5.2) H¨lder’s inequality. If p, q ∈ (1, ∞) with 1/p + 1/q = 1 then
o
f g  dµ ≤ f p g q Proof If f p or g q = 0 then f g  = 0 a.e., so it suﬃces to prove the
result when f p and g q > 0 or by dividing both sides by f p g q , when
f p = g q = 1. Fix y ≥ 0 and let
ϕ(x) = xp /p + y q /q − xy
ϕ ( x) = x p−1 −y and for x ≥ 0
ϕ (x) = (p − 1)xp−2 so ϕ has a minimum at xo = y 1/(p−1) . xp = y p/(p−1) = y q and q = p/(p − 1) so
o
ϕ(xo ) = y q (1/p + 1/q ) − y 1/(p−1) y = 0
Since xo is the minimum, it follows that xy ≤ xp /p + y q /q . Letting x = f ,
y = g , and integrating
f g  dµ ≤ 11
+ =1= f
pq p g q Remark. The special case p = q = 2 is called the CauchySchwarz inequality. one can give a direct proof of the result in this case by observing that for
any θ,
0≤ (f + θg )2 dµ = f 2 dµ + θ 2 f g dµ + θ2 g 2 dµ so the quadratic aθ2 + bθ + c on the righthand side has at most one real root.
Recalling the formula for the roots of a quadratic
√
−b ± b2 − 4ac
2a A.5 Properties of the Integral
we see b2 − 4ac ≤ 0, which is the desired result.
Exercise 5.1. Let f ∞ = inf {M : µ({x : f (x) > M }) = 0}. Prove that
f g dµ ≤ f g 1 ∞ Exercise 5.2. Show that if µ is a probability measure then
f ∞ = lim f
p→∞ p Exercise 5.3. Minkowski’s inequality. (i) Suppose p ∈ (1, ∞). The inequality f + g p ≤ 2p (f p + g p ) shows that if f p and g p are < ∞ then
f + g p < ∞. Apply H¨lder’s inequality to f f + g p−1 and g f + g p−1 to
o
show f + g p ≤ f p + g p . (ii) Show that the last result remains true when
p = 1 or p = ∞.
Our next goal is to give conditions that guarantee
lim n→∞ fn dµ = lim fn dµ n→∞ First, we need a deﬁnition. We say that fn → f in measure, i.e., for any
> 0, µ({x : fn (x) − f (x) > }) → 0 as n → ∞. This is a weaker assumption
than fn → f a.e., but the next result is easier to prove in the greater generality.
(5.3) Bounded convergence theorem. Let E be a set with µ(E ) < ∞.
Suppose fn vanishes on E c , fn (x) ≤ M , and fn → f in measure. Then
f dµ = lim n→∞ fn dµ Example 5.1. The functions fn (x) = 1[n,n+1) (x), on R equipped with the
Borel sets R and Lebesgue measure λ, show that the conclusion of (5.4) does
not hold when µ(E ) = ∞.
Proof Let > 0, Gn = {x : fn (x) − f (x) < } and Bn = E − Gn . Using (iii)
and (iv) from (4.7),
f dµ − fn dµ = (f − fn ) dµ ≤ f − fn  dµ f − fn  dµ + =
Gn ≤ µ(E ) + 2M µ(Bn ) f − fn  dµ
Bn 463 464 Appendix: Measure Theory
fn → f in measure implies µ(Bn ) → 0.
the proof is complete. > 0 is arbitrary and µ(E ) < ∞, so Exercise 5.4. Use (5.3) to prove (4.9). Hint: Given a subdivision σ let
f σ (x) = sup{f (y ) : y ∈ [xi−1 , xi ]} for x ∈ (xi−1 , xi )
so that [a,b] f σ (x)dx = U (σ ). (5.4) Fatou’s lemma. If fn ≥ 0 then
lim inf
n→∞ fn dµ ≥ lim inf fn dµ
n→∞ Example 5.2. Example 5.1 shows that we may have strict inequality in (5.4).
The functions fn (x) = n1(0,1/n] (x) on (0,1) equipped with the Borel sets and
Lebesgue measure show that this can happen on a space of ﬁnite measure.
Proof Let gn (x) = inf m≥n fm (x). fn (x) ≥ gn (x) and as n ↑ ∞,
gn (x) ↑ g (x) = lim inf fn (x)
n→∞ Since fn dµ ≥ gn dµ, it suﬃces then to show that
lim inf
n→∞ gn dµ ≥ g dµ Let Em ↑ Ω be sets of ﬁnite measure. Since gn ≥ 0 and for ﬁxed m
(gn ∧ m) · 1Em → (g ∧ m) · 1Em a.e. the bounded convergence theorem (5.3) implies
lim inf
n→∞ gn dµ ≥ gn ∧ m dµ →
Em g ∧ m dµ
Em Taking the sup over m and using (4.4) gives the desired result.
(5.5) Monotone convergence theorem. If fn ≥ 0 and fn ↑ f then
fn dµ ↑ f dµ A.6 Product Measures, Fubini’s Theorem
Proof Fatou’s lemma, (5.4), implies liminf fn dµ ≥
hand, fn ≤ f implies lim sup fn dµ ≤ f dµ.
Exercise 5.5. If gn ↑ g and −
g1 dµ < ∞ then
∞
m=0 gm dµ Exercise 5.6. If gm ≥ 0 then gn dµ ↑ = ∞
m=0 f dµ. On the other g dµ.
gm dµ. Exercise 5.7. Let f ≥ 0. (i) Show that f ∧ n dµ ↑ f dµ as n → ∞. (ii)
Use (i) to conclude that if g is integrable and > 0 then we can pick δ > 0 so
that µ(A) < δ implies A g dµ < .
(5.6) Dominated convergence theorem. If fn → f a.e., fn  ≤ g for all n,
and g is integrable, then fn dµ → f dµ.
Proof fn + g ≥ 0 so Fatou’s lemma implies
lim inf
n→∞ Subtracting fn + g dµ ≥ f + g dµ g dµ from both sides gives
lim inf
n→∞ fn dµ ≥ f dµ Applying the last result to −fn , we get
lim sup fn dµ ≤ f dµ n→∞ and the proof is complete.
Exercise 5.8. If f is integrable and En are disjoint sets with union E then
∞ f dµ =
m=0 So if f ≥ 0, then ν (E ) = E Em f dµ
E f dµ deﬁnes a measure. Exercise 5.9. Show that if f is integrable on [a, b], g (x) =
continuous on (a, b). [a,x] f (y ) dy is Exercise 5.10. Show that if f has f p = ( f p dµ)1/p < ∞, then there are
simple functions ϕn so that ϕn − f p → 0.
465 466 Appendix: Measure Theory
Exercise 5.11. Show that if n fn dµ < ∞ then n fn dµ = n fn dµ. A.6. Product Measures, Fubini’s Theorem
Let (X, A, µ1 ) and (Y, B , µ2 ) be two σ ﬁnite measure spaces. Let
Ω = X × Y = {(x, y ) : x ∈ X, y ∈ Y }
S = {A × B : A ∈ A, B ∈ B}
Sets in S are called rectangles. It is easy to see that S is a semialgebra:
(A × B ) ∩ (C × D ) = ( A ∩ C ) × (B ∩ D )
(A × B )c = (Ac × B ) ∪ (A × B c ) ∪ (Ac × B c )
Let F = A × B be the σ algebra generated by S .
(6.1) Theorem. There is a unique measure µ on F with
µ(A × B ) = µ1 (A)µ2 (B )
Notation. µ is often denoted by µ1 × µ2 .
Proof By (1.3) it is enough to show that if A × B = +i (Ai × Bi ) then
µ(Ai × Bi ) µ( A × B ) =
i For each x ∈ A, let I (x) = {i : x ∈ Ai }. B = +i∈I (x) Bi , so
1A (x)µ2 (B ) = 1Ai (x)µ2 (Bi )
i Integrating with respect to µ1 and using Exercise 5.6 gives
µ1 (A)µ2 (B ) = µ1 (Ai )µ2 (Bi )
i which proves (6.1).
Exercise 6.1. Let Ao ⊂ A and Bo ⊂ B be semialgebras with σ (Ao ) = A and
σ (Bo ) = B . Given a measure µ1 on A and a measure µ2 on B , there is a unique A.6 Product Measures, Fubini’s Theorem
measure µ on A × B that has µ(A × B ) = µ1 (A)µ2 (B ) for A ∈ Ao and B ∈ Bo .
The point of this exercise is that we can deﬁne Lebesgue measure on R2 by the
requirement that λ((a, b] × (c, d]) = (b − a)(d − c).
Using (6.1) and induction, it follows that if (Ωi , Fi , µi ), i = 1, . . . , n, are
σ ﬁnite measure spaces and Ω = Ω1 × · · · × Ωn , there is a unique measure µ on
the σ algebra F generated by sets of the form A1 × · · · × An , Ai ∈ Fi , that has
n µ(A1 × · · · × An ) = µm (Am )
m=1 When (Ωi , Fi , µi ) = (R, R, λ) for all i, the result is Lebesgue measure on the
Borel subsets of n dimensional Euclidean space Rn .
Returning to the case in which (Ω, F , µ) is the product of two measure
spaces, (X, A, µ) and (Y, B , ν ), our next goal is to prove:
(6.2) Fubini’s theorem. If f ≥ 0 or
(∗) f  dµ < ∞ then f (x, y ) µ2 (dy ) µ1 (dx) =
X Y f dµ =
X ×Y f (x, y ) µ1 (dx) µ2 (dy )
Y X Proof We will prove only the ﬁrst equality, since the second one is similar.
Two technical things that need to be proved before we can assert that the ﬁrst
integral makes sense are:
When x is ﬁxed, y → f (x, y ) is B measurable.
x→ Y f (x, y )µ2 (dy ) is A measurable. We begin with the case f = 1E . Let Ex = {y : (x, y ) ∈ E } be the crosssection
at x.
(6.3) Lemma. If E ∈ F then Ex ∈ B .
Proof (E c )x = (Ex )c and (∪i Ei )x = ∪i (Ei )x , so if E is the collection of sets
E for which Ex ∈ B , then E is a σ algebra. Since E contains the rectangles, the
result follows.
(6.4) Lemma. If E ∈ F then g (x) ≡ µ2 (Ex ) is A measurable and
g dµ1 = µ(E )
X 467 468 Appendix: Measure Theory
Notice that it is not obvious that the collection of sets for which the conclusion
is true is a σ algebra since µ(E1 ∪ E2 ) = µ(E1 ) + µ(E2 ) − µ(E1 ∩ E2 ). Dynkin’s
π − λ theorem (2.1) was tailormade for situations like this.
Proof of (6.4) If conclusions hold for En and En ↑ E , then (2.5) in Chapter
1 and the monotone convergence theorem imply that they hold for E . Since µ1
and µ2 are σ ﬁnite, it is enough then to prove the result for E ⊂ A × B with
µ1 (A) < ∞ and µ2 (B ) < ∞, or taking Ω = A × B we can suppose without loss
of generality that µ(Ω) < ∞. Let L be the collection of sets E for which the
conclusions hold. We will now check that L is a λsystem. Property (i) of a
λsystem is trivial. (iii) follows from the ﬁrst sentence in the proof. To check
(ii) we observe that
µ2 ((A − B )x ) = µ2 (Ax − Bx ) = µ2 (Ax ) − µ2 (Bx )
and integrating over x gives the second conclusion. Since L contains the rectangles, a π system that generates F , the desired result follows from the π − λ
theorem.
We are now ready to prove (6.2) by verifying it in four increasingly more
general special cases.
Case 1. If E ∈ F and f = 1E then (∗) follows from (6.4).
Case 2. Since each integral is linear in f , it follows that (∗) holds for simple
functions.
Case 3. Now if f ≥ 0 and we let fn (x) = ([2n f (x)]/2n ) ∧ n, where [x] = the
largest integer ≤ x, then the fn are simple and fn ↑ f , so it follows from the
monotone convergence theorem that (∗) holds for all f ≥ 0.
Case 4. The general case now follows by writing f (x) = f (x)+ − f (x)− and
applying Case 3 to f + , f − , and f .
To illustrate why the various hypotheses of (6.2) are needed, we will now
give some examples where the conclusion fails.
Example 6.1. Let X = Y = {1, 2, . . .} with A = B = all subsets and µ1 =
µ2 = counting measure. For m ≥ 1, let f (m, m) = 1 and f (m + 1, m) = −1,
and let f (m, n) = 0 otherwise. We claim that
f (m, n) = 1
m n but f (m, n) = 0
n m A.6 Product Measures, Fubini’s Theorem
A picture is worth several dozen words:
.
.
.
.
.
.
.
.
.
.
.
.
00
0
1
↑0 0
1 −1
n 0 1 −1 0
1 −1 0
0
m→ .
.
.
...
...
...
... In words, if we sum the columns ﬁrst, the ﬁrst one gives us a 1 and the others
0, while if we sum the rows each one gives us a 0.
Example 6.2. Let X = (0, 1), Y = (1, ∞), both equipped with the Borel sets
and Lebesgue measure. Let f (x, y ) = e−xy − 2e−2xy .
1 ∞ 1 x−1 (e−x − e−2x ) dx > 0 f (x, y ) dy dx =
0 1
∞ 0
1 ∞ y −1 (e−2y − e−y ) dy < 0 f (x, y ) dx dy =
1 0 1 The next example indicates why µ1 and µ2 must be σ ﬁnite.
Example 6.3. Let X = (0, 1) with A = the Borel sets and µ1 = Lebesgue
measure. Let Y = (0, 1) with B = all subsets and µ2 = counting measure. Let
f (x, y ) = 1 if x = y and 0 otherwise
f (x, y ) µ2 (dy ) = 1 for all x so Y f (x, y ) µ2 (dy ) µ1 (dx) = 1
X Y Y X f (x, y ) µ1 (dx) = 0 for all y so
X f (x, y ) µ1 (dy ) µ2 (dx) = 0 Our last example shows that measurability is important or maybe that
some of the axioms of set theory are not as innocent as they seem.
Example 6.4. By the axiom of choice and the continuum hypothesis one can
deﬁne an order relation < on (0,1) so that {x : x < y } is countable for each
y . Let X = Y = (0, 1), let A = B = the Borel sets and µ1 = µ2 = Lebesgue
measure. Let f (x, y ) = 1 if x < y , = 0 otherwise. Since {x : x < y } and
{y : x < y }c are countable,
f (x, y ) µ1 (dx) = 0 for all y
X f (x, y ) µ2 (dy ) = 1
Y for all x 469 470 Appendix: Measure Theory
We turn now to applications of (6.2).
Exercise 6.2. If X Y f (x, y )µ2 (dy )µ1 (dx) < ∞ then f (x, y )µ2 (dy )µ1 (dx) =
X Y f d(µ1 × µ2 ) = f (x, y )µ1 (dx)µ2 (dy ) X ×Y Y X Example 6.5. Let X = {1, 2, . . .} , A = all subsets of X , and µ1 = counting
measure. If n fn dµ < ∞ then n fn dµ =
n fn dµ.
Exercise 6.3. Let g ≥ 0 be a measurable function on (X, A, µ). Use (6.2) to
conclude that
∞ g dµ = (µ × λ)({(x, y ) : 0 ≤ y < g (x)}) =
X µ({x : g (x) > y }) dy
0 Exercise 6.4. Let F , G be Stieltjes measure functions and let µ, ν be the
corresponding measures on (R, R). Show that
(i) (a,b] {F (y ) (ii) − F (a)}dG(y ) = (µ × ν )({(x, y ) : a < x ≤ y ≤ b}) F (y ) dG(y ) +
(a,b] G(y ) dF (y )
(a,b] = F (b)G(b) − F (a)G(a) + µ({x})ν ({x})
x∈(a,b] To see the second term is needed, let F (x) = G(x) = 1[0,∞) (x) and a < 0 < b.
(iii) If F = G is continuous then (a,b] 2F (y )dF (y ) = F 2 (b) − F 2 (a). Exercise 6.5. Let µ be a ﬁnite measure on R and F (x) = µ((−∞, x]). Show
that
(F (x + c) − F (x)) dx = cµ(R)
Exercise 6.6. Show that e−xy sin x is integrable in the strip 0 < x < a, 0 < y .
Perform the double integral in the two orders to get:
a
0 π
sin x
dx = − (cos a)
x
2 ∞
0 and replace 1 + y 2 by 1 to conclude e−ay
dy − (sin a)
1 + y2
a
(sin x)/x dx
0 ∞
0 ye−ay
dy
1 + y2 − (π/2) ≤ 2/a for a ≥ 1. A.7 Kolmogorov’s Extension Theorem A.7. Kolmogorov’s Extension Theorem
To construct some of the basic objects of study in probability theory, we will
need an existence theorem for measures on inﬁnite product spaces. Let N =
{1, 2, . . .} and
RN = {(ω1 , ω2 , . . .) : ωi ∈ R}
We equip RN with the product σ algebra RN , which is generated by the ﬁnite
dimensional rectangles = sets of the form {ω : ωi ∈ (ai , bi ] for i = 1, . . . , n},
where −∞ ≤ ai < bi ≤ ∞.
(7.1) Kolmogorov’s extension theorem. Suppose we are given probability
measures µn on (Rn , Rn ) that are consistent, that is,
µn+1 ((a1 , b1 ] × . . . × (an , bn ] × R) = µn ((a1 , b1 ] × . . . × (an , bn ])
Then there is a unique probability measure P on (RN , RN ) with
(∗) P (ω : ωi ∈ (ai , bi ], 1 ≤ i ≤ n) = µn ((a1 , b1 ] × . . . × (an , bn ])
An important example of a consistent sequence of measures is Example 7.1. Let F1 , F2 , . . . be distribution functions and let µn be the measure on Rn with
n µn ((a1 , b1 ] × . . . × (an , bn ]) = (Fm (bm ) − Fm (am ))
m=1 In this case, if we let Xn (ω ) = ωn , then the Xn are independent and Xn has
distribution Fn .
Proof of (7.1) Let S be the sets of the form {ω : ωi ∈ (ai , bi ], 1 ≤ i ≤ n},
and use (∗) to deﬁne P on S . S is a semialgebra, so by (1.3) it is enough to
show that if A ∈ S is a disjoint union of Ai ∈ S , then P (A) ≤ i P (Ai ). If the
union is ﬁnite, then all the Ai are determined by the values of a ﬁnite number
of coordinates and the conclusion follows from results in Section A.6.
Suppose now that the union is inﬁnite. Let A = { ﬁnite disjoint unions of
sets in S} be the algebra generated by S . Since A is an algebra (by (1.2))
Bn ≡ A − ∪n Ai
i=1
is a ﬁnite disjoint union of rectangles, and by the result for ﬁnite unions,
n P (Ai ) + P (Bn ) P (A) =
i=1 471 472 Appendix: Measure Theory
It suﬃces then to show
(7.2) Lemma. If Bn ∈ A and Bn ↓ ∅ then P (Bn ) ↓ 0.
Proof Suppose P (Bn ) ↓ δ > 0. By repeating sets in the sequence, we can
suppose
Bn = ∪Kn {ω : ωi ∈ (ak , bk ], 1 ≤ i ≤ n} where − ∞ ≤ ak < bk ≤ ∞
i
i
i
i
k=1
The strategy of the proof is to approximate the Bn from within by compact
rectangles with almost the same probability and then use a diagonal argument
to show that ∩n Bn = ∅. There is a set Cn ⊂ Bn of the form
Cn = ∪Kn {ω : ωi ∈ [¯k , ¯k ], 1 ≤ i ≤ n} with − ∞ < ai < ¯i < ∞
ai bi
¯k bk
k=1
that has P (Bn − Cn ) ≤ δ/2n+1 . Let Dn = ∩n =1 Cm .
m
n P (Bn − Dn ) ≤ P (Bm − Cm ) ≤ δ/2
m=1 ∗
∗
so P (Dn ) ↓ a limit ≥ δ/2. Now there are sets Cn , Dn ⊂ Rn so that
∗
∗
Cn = {ω : (ω1 , . . . , ωn ) ∈ Cn } and Dn = {ω : (ω1 , . . . , ωn ) ∈ Dn } Note that
∗
Cn = Cn × R × R × . . . ∗
and Dn = Dn × R × R × . . . ∗
∗
∗
so Cn and Cn (and Dn and Dn ) are closely related but Cn ⊂ Ω and Cn ⊂ Rn .
∗
Cn is a ﬁnite union of closed rectangles, so
∗
∗
∗
Dn = Cn ∩n−1 (Cm × Rn−m )
m=1 is a compact set. For each m, let ωm ∈ Dm . Dm ⊂ D1 so ωm,1 (i.e., the ﬁrst
∗
∗
coordinate of ωm ) is in D1 Since D1 is compact, we can pick a subsequence
m(1, j ) ≥ j so that as j → ∞,
ωm(1,j ),1 → a limit θ1
∗
∗
For m ≥ 2, Dm ⊂ D2 and hence (ωm,1 , ωm,2 ) ∈ D2 . Since D2 is compact,
we can pick a subsequence of the previous subsequence (i.e., m(2, j ) = m(1, ij )
with ij ≥ j ) so that as j → ∞ ωm(2,j ),2 → a limit θ2 A.8 RadonNikodym Theorem
Continuing in this way, we deﬁne m(k, j ) a subsequence of m(k − 1, j ) so that
as j → ∞,
ωm(k,j ),k → a limit θk
Let ωi = ωm(i,i) . ωi is a subsequence of all the subsequences so ωi,k → θk for
∗
∗
∗
all k . Now ωi,1 ∈ D1 for all i ≥ 1 and D1 is closed so θ1 ∈ D1 . Turning to
∗
∗
∗
the second set, (ωi,1 , ωi,2 ) ∈ D2 for i ≥ 2 and D2 is closed, so (θ1 , θ2 ) ∈ D2 .
∗
Repeating the last argument, we conclude that (θ1 , . . . , θk ) ∈ Dk for all k , so
ω = (θ1 , θ2 , . . .) ∈ Dk (no star here since we are now talking about subsets of
Ω) for all k and
∅ = ∩k Dk ⊂ ∩k Bk
a contradiction that proves the desired result. A.8. RadonNikodym Theorem
In this section, we prove the RadonNikodym theorem. To develop that result,
we begin with a topic that at ﬁrst may appear to be unrelated. Let (Ω, F ) be
a measurable space. α is said to be a signed measure on (Ω, F ) if (i) α takes
values in (−∞, ∞], (ii) α(∅) = 0, and (iii) if E = +i Ei is a disjoint union then
α(E ) = i α(Ei ), in the following sense:
If α(E ) < ∞, the sum converges absolutely and = α(E ).
If α(E ) = ∞, then i α(Ei )− < ∞ and i α(Ei )+ = ∞. Clearly, a signed measure cannot be allowed to take both the values ∞ and
−∞, since α(A) + α(B ) might not make sense. In most formulations, a signed
measure is allowed to take values in either (−∞, ∞] or [−∞, ∞). We will
ignore the second possibility to simplify statements later. As usual, we turn to
examples to help explain the deﬁnition.
Example 8.1. Let µ be a measure, f be a function with f − dµ < ∞, and let
α(A) = A f dµ. Exercise 5.8 implies that α is a signed measure.
Example 8.2. Let µ1 and µ2 be measures with µ2 (Ω) < ∞, and let α(A) =
µ1 (A) − µ2 (A).
The Jordan decomposition, (8.4) below, will show that Example 8.2 is the
general case. To derive that result, we begin with two deﬁnitions. A set A is
positive if every measurable B ⊂ A has α(B ) ≥ 0. A set A is negative if
every measurable B ⊂ A has α(B ) ≤ 0.
Exercise 8.1. In Example 8.1, A is positive if and only if µ(A ∩ {x : f (x) <
0}) = 0. 473 474 Appendix: Measure Theory
(8.1) Lemma. (i) Every measurable subset of a positive set is positive. (ii) If
the sets An are positive then A = ∪n An is also positive.
Proof (i) is trivial. To prove (ii), observe that
Bn = An ∩ ∩n−1 Ac ⊂ An
m=1 m are positive, disjoint, and ∪n Bn = ∪n An . Let E ⊂ A be measurable, and let
En = E ∩ Bn . α(En ) ≥ 0 since Bn is positive, so α(E ) = n α(En ) ≥ 0.
The conclusions in (8.1) remain valid if positive is replaced by negative.
The next result is the key to the proof of (8.3).
(8.2) Lemma. Let E be a measurable set with α(E ) < 0. Then there is a
negative set F ⊂ E with α(F ) < 0.
Proof If E is negative, this is true. If not, let n1 be the smallest positive
integer so that there is an E1 ⊂ E with α(E1 ) ≥ 1/n1 . Let k ≥ 2. If Fk =
E −(E1 ∪. . .∪Ek−1 ) is negative, we are done. If not, we continue the construction
letting nk be the smallest positive integer so that there is an Ek ⊂ Fk with
α(Ek ) ≥ 1/nk . If the construction does not stop for any k < ∞, let
F = ∩k Fk = E − (∪k Ek )
Since 0 > α(E ) > −∞ and α(Ek ) ≥ 0, it follows from the deﬁnition of signed
measure that
∞
α(E ) = α(F ) + α(Ek )
k=1 α(F ) ≤ α(E ) < 0, and the sum is ﬁnite. From the last observation and the
construction, it follows that F can have no subset G with α(G) > 0, for then
α(G) ≥ 1/N for some N and we would have a contradiction.
(8.3) Hahn decompositon. Let α be a signed measure. Then there is a
positive set A and a negative set B so that Ω = A ∪ B and A ∩ B = ∅.
Proof Let c = inf {α(B ) : B is negative} ≤ 0. Let Bi be negative sets with
α(Bi ) ↓ c. Let B = ∪i Bi . By (8.1), B is negative, so by the deﬁnition of c,
α(B ) ≥ c. To prove α(B ) ≤ c, we observe that α(B ) = α(Bi ) + α(B − Bi ) ≤
α(Bi ), since B is negative, and let i → ∞. The last two inequalities show
that α(B ) = c, and it follows from our deﬁnition of a signed measure that
c > −∞. Let A = B c . To show A is positive, observe that if A contains a set
with α(E ) < 0, then by (8.2), it contains a negative set F with α(F ) < 0, but A.8 RadonNikodym Theorem
then B ∪ F would be a negative set that has α(B ∪ F ) = α(B ) + α(F ) < c, a
contradiction.
The Hahn decomposition is not unique. In Example 8.1, A can be any set
with
{x : f (x) > 0} ⊂ A ⊂ {x : f (x) ≥ 0} a.e.
where B ⊂ C a.e. means µ(B ∩ C c ) = 0. The last example is typical of the
general situation. Suppose Ω = A1 ∪B1 = A2 ∪B2 are two Hahn decompositions.
A2 ∩ B1 is positive and negative, so it is a null set: All its subsets have measure
0. Similarly, A1 ∩ B2 is a null set.
Two measures µ1 and µ2 are said to be mutually singular if there is a
set A with µ1 (A) = 0 and µ2 (Ac ) = 0. In this case, we also say µ1 is singular
with respect to µ2 and write µ1 ⊥ µ2 .
Exercise 8.2. Show that the uniform distribution on the Cantor set (Example
1.8 in Chapter 1) is singular with respect to Lebesgue measure.
(8.4) Jordan decompositon. Let α be a signed measure. There are mutually
singular measures α+ and α− so that α = α+ − α− . Moreover, there is only
one such pair.
Proof Let Ω = A ∪ B be a Hahn decomposition. Let
α+ (E ) = α(E ∩ A) and α− (E ) = −α(E ∩ B ) Since A is positive and B is negative, α+ and α− are measures. α+ (Ac ) = 0
and α− (A) = 0, so they are mutually singular. To prove uniqueness, suppose
α = ν1 − ν2 and D is a set with ν1 (D) = 0 and ν2 (Dc ) = 0. If we set C = Dc ,
then Ω = C ∪ D is a Hahn decomposition, and it follows from the choice of D
that
ν1 (E ) = α(C ∩ E ) and ν2 (E ) = −α(D ∩ E )
Our uniqueness result for the Hahn decomposition shows that A ∩ D = A ∩ C c
and B ∩ C = Ac ∩ C are null sets, so α(E ∩ C ) = α(E ∩ (A ∪ C )) = α(E ∩ A)
and ν1 = α+ .
Exercise 8.3. Show that α+ (E ) = sup{α(F ) : F ⊂ E }.
Remark. Let α be a ﬁnite signed measure (i.e., one that does not take the
value ∞ or −∞) on (R, R). Let α = α+ − α− be its Jordan decomposition. Let
A(x) = α((−∞, x]), F (x) = α+ ((−∞, x]), and G(x) = α− ((−∞, x]). A(x) =
F (x) − G(x) so the distribution function for a ﬁnite signed measure can be 475 476 Appendix: Measure Theory
written as a diﬀerence of two bounded increasing functions. It follows from
Example 8.2 that the converse is also true. Let α = α+ + α− . α is called the
total variation of α, since in this example α((a, b]) is the total variation of
A over (a, b] as deﬁned in analysis textbooks. See, for example, Royden (1988),
p. 103. We exclude the left endpoint of the interval since a jump there makes
no contribution to the total variation on [a, b], but it does appear in α.
Our third and ﬁnal decomposition is:
(8.5) Lebesgue decomposition. Let µ, ν be σ ﬁnite measures. ν can be
written as νr + νs , where νs is singular with respect to µ and
νr (E ) = g dµ
E Proof By decomposing Ω = +i Ωi , we can suppose without loss of generality
that µ and ν are ﬁnite measures. Let G be the set of g ≥ 0 so that E g dµ ≤
ν (E ) for all E.
(a) If g, h ∈ G then g ∨ h ∈ G .
Proof of (a) Let A = {g > h}, B = {g ≤ h}.
g ∨ h dµ = g dµ + E E ∩A h dµ ≤ ν (E ∩ A) + ν (E ∩ B ) = ν (E )
E ∩B Let κ = sup{ g dµ : g ∈ G} ≤ ν (Ω) < ∞. Pick gn so that gn dµ > κ − 1/n
and let hn = g1 ∨ . . . ∨ gn . By (a), hn ∈ G . As n ↑ ∞, hn ↑ h. The deﬁnition of
κ, the monotone convergence theorem, and the choice of gn imply that
κ≥
Let νr (E ) = E h dµ = lim n→∞ hn dµ ≥ lim n→∞ gn dµ = κ h dµ and νs (E ) = ν (E ) − νr (E ). The last detail is to show: (b) νs is singular with respect to µ.
Proof of (b) Let > 0 and let Ω = A ∪ B be a Hahn decomposition for
νs − µ. Using the deﬁnition of νr and then the fact that A is positive for
νs − µ (so µ(A ∩ E ) ≤ νs (A ∩ E )),
(h + 1A ) dµ = νr (E ) + µ(A ∩ E ) ≤ ν (E )
E A.8 RadonNikodym Theorem
This holds for all E , so k = h + 1A ∈ G . It follows that µ(A ) = 0 ,for if
not, then k dµ > κ a contradiction. Letting A = ∪n A1/n , we have µ(A) = 0.
To see that νs (Ac ) = 0, observe that if νs (Ac ) > 0, then (νs − µ)(Ac ) > 0 for
small , a contradiction since Ac ⊂ B , a negative set.
Exercise 8.4. Prove that the Lebesgue decomposition is unique. Note that
you can suppose without loss of generality that µ and ν are ﬁnite.
We are ﬁnally ready for the main business of the section. We say a measure
ν is absolutely continuous with respect to µ (and write ν < µ) if µ(A) = 0
<
implies that ν (A) = 0.
<
Exercise 8.5. If µ1 < µ2 and µ2 ⊥ ν then µ1 ⊥ ν.
(8.6) RadonNikodym theorem. If µ, ν are σ ﬁnite measures and ν is absolutely continuous with respect to µ, then there is a g ≥ 0 so that ν (E ) = E g dµ.
If h is another such function then g = h µ a.e.
Proof Let ν = νr + νs be any Lebesgue decomposition. Let A be chosen so
that νs (Ac ) = 0 and µ(A) = 0. Since ν < µ, 0 = ν (A) ≥ νs (A) and νs ≡ 0.
<
To prove uniqueness, observe that if E g dµ = E h dµ for all E , then letting
E ⊂ {g > h, g ≤ n} be any subset of ﬁnite measure, we conclude µ(g > h,
g ≤ n) = 0 for all n, so µ(g > h) = 0, and, similarly, µ(g < h) = 0.
Example 8.3. (8.6) may fail if µ is not σ ﬁnite. Let (Ω, F ) = (R, R), µ =
counting measure and ν = Lebesgue measure.
The function g whose existence is proved in (8.6) is often denoted dν/dµ.
This notation suggests the following properties, whose proofs are left to the
reader.
<
<
Exercise 8.6. If ν1 , ν2 < µ then ν1 + ν2 < µ
d(ν1 + ν2 )/dµ = dν1 /dµ + dν2 /dµ
Exercise 8.7. If ν < µ and f ≥ 0 then
< f dν = f dν
dµ dµ. Exercise 8.8. If π < ν < µ then dπ/dµ = (dπ/dν ) · (dν/dµ).
<
<
Exercise 8.9. If ν < µ and µ < ν then dµ/dν = (dν/dµ)−1 .
<
< 477 478 Appendix: Measure Theory A.9. Diﬀerentiating Under the Integral
At several places in the text, we need to interchange diﬀerentiate inside a sum
or an integral. This section is devoted to results that can be used to justify
those computations.
(9.1) Theorem. Let (S, S , µ) be a measure space. Let f be a complex valued
function deﬁned on R × S . Let δ > 0, and suppose that for x ∈ (y − δ, y + δ )
we have
(i) u(x) = S f (x, s) µ(ds) with S f (x, s) µ(ds) < ∞ (ii) for ﬁxed s, ∂f /∂x(x, s) exists and is a continuous function of x,
(iii) v (x) = ∂f
(x, s) µ(ds)
S ∂x and (iv) δ
∂f
(y
−δ ∂x S is continuous at x = y , + θ, s) dθ µ(ds) < ∞ then u (y ) = v (y ).
Proof Letting h ≤ δ and using (i), (ii), (iv), and Fubini’s theorem in the
form given in Exercise 6.2, we have
u( y + h) − u( y ) = f (y + h, s) − f (y, s) µ(ds)
S
h =
S0
h =
0 S ∂f
(y + θ, s) dθ µ(ds)
∂x
∂f
(y + θ, s) µ(ds) dθ
∂x The last equation implies
u( y + h) − u( y )
1
=
h
h h v (y + θ) dθ
0 Since v is continuous at y by (iii), letting h → 0 gives the desired result.
Example 9.1. For a result in Section 2.3, we need to know that we can
diﬀerentiate under the integral sign in
u ( x) = cos(xs)e−s 2 /2 ds A.9 Diﬀerentiating Under the Integral
For convenience, we have dropped a factor (2π )−1/2 and changed variables to
match (9.1). Clearly, (i) and (ii) hold. The dominated convergence theorem
implies (iii)
−s sin(sx)e−s x→ 2 /2 ds is a continuous. For (iv), we note
∂f
(x, s) ds =
∂x se−s 2 /2 ds < ∞ and the value does not depend on x, so (iv) holds.
To prepare for the next example, we introduce:
(9.2) Lemma. For (iii) and (iv) to hold, it is suﬃcient that we have (ii) and
(iii ) sup
S θ ∈[−δ,δ ] Proof ∂f
(y + θ, s) µ(ds) < ∞
∂x Since
δ
−δ ∂f
∂f
(y + θ, s) dθ ≤ 2δ sup
(y + θ, s)
∂x
θ ∈[−δ,δ ] ∂x it is clear that (iv) holds. To check (iii), we note that
v ( x ) − v ( y )  ≤
S ∂f
∂f
(x, s) −
(y, s) µ(ds)
∂x
∂x (ii) implies that the integrand → 0 as x → y . The desired result follows from
(iii ) and the dominated convergence theorem.
To indicate the usefulness of the new result, we prove:
(9.3) Theorem. If ϕ(θ) = EeθZ < ∞ for θ ∈ [− , ] then ϕ (0) = EZ .
Proof Here θ plays the role of x, and we take µ to be the distribution of Z .
Let δ = /2. f (x, s) = exs ≥ 0, so (i) holds by assumption. ∂f /∂x = sexs is
clearly a continuous function, so (ii) holds. To check (iii ), we note that there
is a constant C so that if x ∈ (−δ, δ ), then sexs ≤ C (e− s + e s ).
Taking S = Z with S = all subsets of S and µ = counting measure in (9.1)
and using (9.2) gives the following: 479 480 Appendix: Measure Theory
(9.4) Corollary. Let δ > 0. Suppose that for x ∈ (y − δ, y + δ ) we have
(i) u(x) = ∞
n=1 fn (x) with ∞
n=1 fn (x) < ∞ (ii) for each n, fn (x) exists and is a continuous function of x,
and (iii) ∞
n=1 supθ∈(−δ,δ) fn (y + θ) < ∞ then u (x) = v (x).
Example 9.2. We want to show that if p ∈ (0, 1) then
∞ ∞ (1 − p)n
n=1 n(1 − p)n−1 =−
n=1 Let fn (x) = (1 − x)n , y = p, and pick δ so that [y − δ, y + δ ] ⊂ (0, 1). Clearly
∞
(i) n=1 (1 − x)n  < ∞ and (ii) fn (x) = n(1 − x)n−1 is continuous for x in
[y − δ, y + δ ]. To check (iii), we note that if we let 2η = y − δ then there is a
constant C so that if x ∈ [y − δ, y + δ ] and n ≥ 1, then
n(1 − x)n−1 = n(1 − x)n−1
· (1 − η )n−1 ≤ C (1 − η )n−1
(1 − η )n−1 References D. Aldous and P. Diaconis (1986) Shuﬄing cards and stopping times. Amer.
Math. Monthly. 93, 333–348
P.H. Algoet and T.M. Cover (1988) A sandwich proof of the Shannon McMillan
Breiman theorem. Ann. Probab. 16, 899–909
E. S. Andersen and B. Jessen (1984) On the introduction of measures in inﬁnite
product spaces. Danske Vid. Selsk. Mat.Fys. Medd. 25, No. 4
D.V. Anosov (1963) Ergodic properties of geodesic ﬂows on closed Riemannian
manifolds of negative curvature. Soviet Math. Doklady 4, 1153–1156
D.V. Anosov (1967) Geodesic ﬂows on compact Riemannian manifolds of negative curvature. Proceedings of the Steklov. Inst. of Math, No. 90
K. Athreya and P. Ney (1972) Branching Processes. SpringerVerlag, New York
K. Athreya and P. Ney (1978) A new approach to the limit theory of recurrent
Markov chains. Trans. AMS. 245, 493–501
K. Athreya, D. McDonald, and P. Ney (1978) Coupling and the renewal theorem. Amer. Math. Monthly. 85, 809–814
´
L. Bachelier (1900) Th´orie de la sp´culation. Ann. Sci. Ecole Norm. Sup. 17,
e
e
21–86
R. Ballerini and S. Resnick (1985) Records from improving populations. J. Appl.
Probab. 22, 487–502
R. Ballerini and S. Resnick (1987) Records in the presence of a linear trend.
Adv. Appl. Probab. 19, 801–828
S. Banach and A. Tarski (1924) Sur la d´composition des ensembles de points
e
en parties respectivements congruent. Fund. Math. 6, 244–277
L.E. Baum and P. Billingsley (1966) Asymptotic distributions for the coupon
collector’s problem. Ann. Math. Statist. 36, 1835–1839
F. Benford (1938) The law of anomalous numbers. Proc. Amer. Phil. Soc. 78,
552–572
J.D. Biggins (1977) Chernoﬀ’s theorem in branching random walk. J. Appl.
Probab. 14, 630–636
J.D. Biggins (1978) The asymptotic shape of branching random walk. Adv. in
Appl. Probab. 10, 62–84
J.D. Biggins (1979) Growth rates in branching random walk. Z. Warsch. verw.
Gebiete. 48, 17–34 482 References
P. Billingsley (1961) The LindebergL´vy theorem for martingales. Proc. AMS
e
12, 788–792
P. Billingsley (1968) Convergence of probability measures. John Wiley and Sons,
New York
P. Billingsley (1979) Probability and measure. John Wiley and Sons, New York
G.D. Birkhoﬀ (1931) Proof of the ergodic theorem. Proc. Nat. Acad. Sci. 17,
656–660
D. Blackwell and D. Freedman (1964) The tail σ ﬁeld of a Markov chain and a
theorem of Orey. Ann. Math. Statist. 35, 1291–1295
R.M. Blumenthal and R.K. Getoor (1968) Markov processes and their potential
theory. Academic Press, New York
E. Borel (1909) Les probabilit´s d´nombrables et leur applications arithm´tee
e
iques. Rend. Circ. Mat. Palermo. 27, 247–271
L. Breiman (1957) The individual ergodic theorem of ergodic theory. Ann. Math.
Statist. 28 (1957), 809–811, Correction 31, 809–810
L. Breiman (1968) Probability. AddisonWesley, Reading, MA
K.L. Chung (1974) A Course in Probability Theory, second edition. Academic
Press, New York.
K.L. Chung, P. Erd¨s, and T. Sirao (1959) On the Lipschitz’s condition for
o
Brownian motion. J. Math. Soc. Japan. 11, 263–274
K.L. Chung and W.H.J. Fuchs (1951) On the distribution of values of sums of
independent random variables. Memoirs of the AMS, No. 6
V. Chv´tal and D. Sankoﬀ (1975) Longest common subsequences of two random
a
sequences. J. Appl. Probab. 12, 306–315
J. Cohen, H. Kesten, and C. Newman (1985) Random matrices and their applications. AMS Contemporary Math. 50, Providence, RI
J.T. Cox and R. Durrett (1981) Limit theorems for percolation processes with
necessary and suﬃcient conditions. Ann. Probab. 9, 583603
B. Davis (1983) On Brownian slow points. Z. Warsch. verw. Gebiete. 64, 359–
367
Y. Derriennic (1983) Un th´or`me ergodique presque sous additif. Ann. Probab.
ee
11, 669677
P. Diaconis and D. Freedman (1980) Finite exchangeable sequences. Ann. Prob.
8, 745–764
J. Dieudonn´ (1948) Sur la th´or`me de LebesgueNikodym, II. Ann. Univ.
e
ee
Grenoble 23, 25–53
M. Donsker (1951) An invariance principle for certain probability limit theorems. Memoirs of the AMS, No. 6
M. Donsker (1952) Justiﬁcation and extension of Doob’s heurisitc approach to
the KolmogorovSmirnov theorems. Ann. Math. Statist. 23, 277–281
J.L. Doob (1949) A heuristic approach to the KolmogorovSmirnov theorems.
Ann. Math. Statist. 20, 393–403 References
J.L. Doob (1953) Stochastic Processes. John Wiley and Sons, New York
L.E. Dubins (1968) On a theorem of Skorkhod. Ann. Math. Statist. 39, 2094–
2097
L.E. Dubins and D.A. Freedman (1965) A sharper form of the BorelCantelli
lemma and the strong law. Ann. Math. Statist. 36, 800–807
L.E. Dubins and D.A. Freedman (1979) Exchangeable processes need not be
distributed mixtures of independent and identically distributed random
variables. Z. Warsch. verw. Gebiete. 48, 115–132
R.M. Dudley (1989) Real Analysis and Probability. Wadsworth Pub. Co., Paciﬁc
Grove, CA
R. Durrett and S. Resnick (1978) Functional limit theorems for dependent random variables. Ann. Probab. 6, 829–846
A. Dvoretsky (1972) Asymptotic normality for sums of dependent random variables. Proc. 6th Berkeley Symp., Vol. II, 513–535
A. Dvoretsky, P. Erd¨s, and S. Kakutani (1961) Nonincrease everywhere of the
o
Brownian motion process. Proc. 4th Berkeley Symp., Vol. II, 103–116
E.B. Dynkin (1965) Markov processes. SpringerVerlag, New York
P. Erd¨s (1942) On the law of the iterated logarithm. Ann. Math. 43, 419–436
o
P. Erd¨s and M. Kac (1946) On certain limit theorems of the theory of probao
bility. Bull. AMS. 52, 292–302
P. Erd¨s and M. Kac (1947) On the number of positive sums of independent
o
random variables. Bull. AMS. 53, 1011–1020
N. Etemadi (1981) An elementary proof of the strong law of large numbers. Z.
Warsch. verw. Gebiete. 55, 119–122
A.M. Faden (1985) The existence of regular conditional probabilities: necessary
and suﬃcient conditions. Ann. Probab. 13, 288–298
¨
M. Fekete (1923) Uber die Verteilung der Wurzeln bei gewissen algebraischen
Gleichungen mit ganzzahligen Koeﬃzienten. Math. Z. 17, 228–249
W. Feller (1943) The general form of the socalled law of the iterated logarithm.
Trans. AMS. 54, 373–402
W. Feller (1946) A limit theorem for random variables with inﬁnite moments.
Amer. J. Math. 68, 257–262
W. Feller (1961) A simple proof of renewal theorems. Comm. Pure Appl. Math.
14, 285–293
W. Feller (1968) An introduction to probability theory and its applications, Vol.
I, third edition. John Wiley and Sons, New York
W. Feller (1971) An introduction to probability theory and its applications, Vol.
II, second edition. John Wiley and Sons, New York
D. Freedman (1965) Bernard Friedman’s urn. Ann. Math. Statist. 36, 956970
D. Freedman (1971a) Brownian motion and diﬀusion. Originally published by
Holden Day, San Francisco, CA. Second edition by SpringerVerlag, New
York 483 484 References
D. Freedman (1971b) Markov chains. Originally published by Holden Day, San
Francisco, CA. Second edition by SpringerVerlag, New York
D. Freedman (1980) A mixture of independent and identically distributed random variables need not admit a regular conditional probability given the
exchangeable σ ﬁeld. Z. Warsch. verw. Gebiete 51, 239248
R.M. French (1988) The BanachTarski theorem. Math. Intelligencer 10, No. 4,
21–28
B. Friedman (1949) A simple urn model. Comm. Pure Appl. Math. 2, 59–70
H. Furstenburg (1970) Random walks in discrete subgroups of Lie Groups. In
Advances in Probability edited by P.E. Ney.
H. Furstenburg and H. Kesten (1960) Products of random matrices. Ann. Math.
Statist. 31, 451–469
A. Garsia (1965) A simple proof of E. Hopf’s maximal ergodic theorem. J.
Math. Mech. 14, 381–382
M.L. Glasser and I.J. Zucker (1977) Extended Watson integrals for the cubic
lattice. Proc. Nat. Acad. Sci. 74, 1800–1801
B.V. Gnedenko (1943) Sur la distribution limit´ du terme maximum d’une s´rie
e
e
al´atoire. Ann. Math. 44, 423–453
e
B.V. Gnedenko and A.V. Kolmogorov (1954) Limit distributions for sums of
independent random variables. AddisonWesley, Reading, MA
M.I. Gordin (1969) The central limit theorem for stationary processes. Soviet
Math. Doklady. 10, 1174–1176
P. Hall (1982) Rates of convergence in the central limit theorem. Pitman Pub.
Co., Boston, MA
P. Hall and C.C. Heyde (1976) On a uniﬁed approach to the law of the iterated
logarithm for martingales. Bull. Austral. Math. Soc. 14, 435447
P. Hall and C.C. Heyde (1980) Martingale limit theory and its application.
Academic Press, New York
P.R. Halmos (1950) Measure theory. Van Nostrand, New York
J.M. Hammersley (1970) A few seedlings of research. Proc. 6th Berkeley Symp.,
Vol. I, 345–394
G.H. Hardy and J.E. Littlewood (1914) Some problems of Diophantine approximation. Acta Math. 37, 155–239
G.H. Hardy and E.M. Wright (1959) An introduction to the theory of numbers,
fourth edition. Oxford University Press, London
T.E. Harris (1956) The existence of stationary measures for certain Markov
processes. Proc. 3rd Berkeley Symp., Vol. II, 113–124
P. Hartman and A. Wintner (1941) On the law of the iterated logarithm. Amer.
J. Math. 63, 169–176
F. Hausdorﬀ (1913) Gr¨ndzuge der Mengenlehre. Viet, Leipzig
u
E. Hewitt and L.J. Savage (1956) Symmetric measures on Cartesian products.
Trans. AMS. 80, 470–501 References
E. Hewitt and K. Stromberg (1965) Real and abstract analysis. SpringerVerlag,
New York
C.C. Heyde (1963) On a property of the lognormal distribution. J. Royal. Stat.
Soc. B. 29, 392–393
C.C. Heyde (1967) On the inﬂuence of moments on the rate of convergence to
the normal distribution. Z. Warsch. verw. Gebiete. 8, 12–18
C.C. Heyde and D.J. Scott (1973) Invariance principles for the law of the iterated logarithm for martingales and for processes with stationary increments. Ann. Probab. 1, 428–436
J.L. Hodges, Jr. and L. Le Cam (1960) The Poisson approximation to the
binomial distribution. Ann. Math. Statist. 31, 737–740
G. Hunt (1956) Some theorems concerning Brownian motion. Trans. AMS 81,
294–319
I.A. Ibragimov (1962) Some limit theorems for stationary processes. Theory
Probab. Appl. 7, 349–382
I.A. Ibragimov (1963) A central limit theorem for a class of dependent random
variables. Theory Probab. Appl. 8, 83–89
I.A. Ibragimov and Y. V. Linnik (1971) Independent and stationary sequences
of random variables. WoltersNoordhoﬀ, Groningen
H. Ishitani (1977) A central limit theorem for the subadditive process and its
application to products of random matrices. RIMS, Kyoto. 12, 565–575
K. Itˆ and H.P. McKean (1965). Diﬀusion processes and their sample paths.
o
SpringerVerlag, New York
M. Kac (1947a) Brownian motion and the theory of random walk. Amer. Math.
Monthly. 54, 369–391
M. Kac (1947b) On the notion of recurrence in discrete stochastic processes.
Bull. AMS. 53, 1002–1010
M. Kac (1959) Statistical independence in probability, analysis, and number
theory. Carus Monographs, Math. Assoc. of America
Y. Katznelson and B. Weiss (1982) A simple proof of some ergodic theorems.
Israel J. Math. 42, 291–296
E. Keeler and J. Spencer (1975) Optimal doubling in backgammon. Operations
Research. 23, 1063–1071
´
H. Kesten (1986) Aspects of ﬁrst passage percolation. In Ecole d’´t´ de probaee
bilit´s de SaintFlour XIV. Lecture Notes in Math 1180, SpringerVerlag,
e
New York
H. Kesten (1987) Percolation theory and ﬁrst passage percolation. Ann. Probab.
15, 1231–1271
¨
A. Khintchine (1923) Uber dyadische Br¨che. Math. Z. 18, 109–116
u
¨
A. Khintchine (1924) Uber einen Satz der Wahrscheinlichkeitsrechnung. Fund.
Math. 6, 9–20 485 486 References
J. Kielson and D.M.G. Wishart (1964) A central limit theorem for processes
deﬁned on a Markov chain. Proc. Camb. Phil. Soc. 60, 547–567
J.F.C. Kingman (1968) The ergodic theory of subadditive processes. J. Roy.
Stat. Soc. B 30, 499–510
J.F.C. Kingman (1973) Subadditive ergodic theory. Ann. Probab. 1, 883–909
J.F.C. Kingman (1975) The ﬁrst birth problem for age dependent branching
processes. Ann. Probab. 3, 790–801
¨
A.N. Kolmogorov (1929) Uber das Gesetz des iterierten Logarithmus. Math.
Ann. 101, 126–135
A.N. Kolmogorov and Y.A. Rozanov (1964) On strong mixing conditions for
stationary Gaussian processes. Theory Probab. Appl. 5, 204–208
K. Kondo and T. Hara (1987) Critical exponent of susceptibility for a general
class of ferromagnets in d > 4 dimensions. J. Math. Phys. 28, 1206–1208
U. Krengel (1985) Ergodic theorems. deGruyter, New York
A. Lasota and M. MacKay (1985) Probabilistic properties of deterministic systems. Cambridge Univ. Press, London
S. Leventhal (1988) A proof of Liggett’s version of the subadditive ergodic
theorem. Proc. AMS. 102, 169–173
P. L´vy (1937) Th´orie de l addition des variables al´atoires. Gauthier Villars,
e
e
e
Paris
T.M. Liggett (1985) An improved subadditive ergodic theorem. Ann. Probab.
13, 1279–1285
A. Lindenbaum (1926) Contributions ` l’´tude de l’espace metrique. Fund.
ae
Math. 8, 209–222
T. Lindvall (1977) A probabilistic proof of Blackwell’s renewal theorem. Ann.
Probab. 5, 482–485
J.E. Littlewood (1944) Lectures on the theory of functions. Oxford U. Press,
London
B.F. Logan and L.A. Shepp (1977) A variational problem for random Young
tableaux. Adv. in Math. 26, 206–222
H.P. McKean (1969) Stochastic integrals. Academic Press, New York
B. McMillan (1953) The basic theorems of information theory. Ann. Math.
Statist. 24, 196219
M. Motoo (1959) Proof of the law of the iterated logarithm through diﬀusion
equation. Ann. Inst. Stat. Math. 10, 21–28
J. Neveu (1965) Mathematical foundations of the calculus of probabilities.
HoldenDay, San Francisco, CA
J. Neveu (1975) Discrete parameter martingales. North Holland, Amsterdam
S. Newcomb (1881) Note on the frequency of use of the diﬀerent digits in natural
numbers. Amer. J. Math. 4, 39–40
G. O’Brien (1974) Limit theorems for sums of chain dependent processes. J.
Appl. Probab. 11, 582–587 References
D. Ornstein (1969) Random walks. Trans. AMS. 138, 1–60
V.I. Oseled˘c (1968) A multiplicative ergodic theorem. Lyapunov characteristic
e
numbers for synmaical systems. Trans. Moscow Math. Soc. 19, 197–231
R.E.A.C. Paley, N. Wiener and A. Zygmund (1933) Notes on random functions.
Math. Z. 37, 647–668
E.J.G. Pitman (1956) On derivatives of characteristic functions at the origin.
Ann. Math. Statist. 27, 1156–1160
S.C. Port and C.J. Stone (1969) Potential theory of random walks on abelian
groups. Acta Math. 122, 19–114
M.S. Ragunathan (1979) A proof of Oseled˘c’s multiplicative ergodic theorem.
e
Israel J. Math. 32, 356–362
R. Raimi (1976) The ﬁrst digit problem. Amer. Math. Monthly. 83, 521–538
S. Resnick (1987) Extreme values, regular variation, and point processes.
SpringerVerlag, New York
D. Revuz (1984) Markov chains, second edition. North Holland, Amsterdam
D.H. Root (1969) The existence of certain stopping times on Brownian motion.
Ann. Math. Statist. 40, 715–718
M. Rosenblatt (1956) A central limit theorem and a strong mixing condition.
Proc. Nat. Acad. Sci. 42, 43–47
H. Royden (1988) Real analysis, third edition. McMillan, New York
D. Ruelle (1979) Ergodic theory of diﬀerentiable dynamical systems. IHES Pub.
Math. 50, 275–306
C. RyllNardzewski (1951) On the ergodic theorems, II. Studia Math. 12, 74–79
L.J. Savage (1972) The foundations of statistics, second edition. Dover, New
York
D.J. Scott (1973) Central limit theorems for martingales and for processes with
stationary independent increments using a Skorokhod representation approach. Adv. Appl. Probab. 5, 119–137
C. Shannon (1948) A mathematical theory of communication. Bell Systems
Tech. J. 27, 379–423
L.A. Shepp (1964) Recurrent random walks may take arbitrarily large steps.
Bull. AMS. 70, 540–542
S. Sheu (1986) Representing a distribution by stopping Brownian motion:
Root’s construction. Bull. Austral. Math. Soc. 34, 427–431
N.V. Smirnov (1949) Limit distributions for the terms of a variational series.
AMS Transl. Series. 1, No. 67
R. Smythe and J.C. Wierman (1978) First passage percolation on the square
lattice. Lecture Notes in Math 671, SpringerVerlag, New York
R.M. Solovay (1970) A model of set theory in which every set of reals is Lebesgue
measurable. Ann. Math. 92, 1–56
F. Spitzer (1964) Principles of random walk. Van Nostrand, Princeton, NJ 487 488 References
C. Stein (1987) Approximate computation of expectations. IMS Lecture Notes
Vol. 7
H. Steinhaus (1922) Les probabilit´s denombrables et leur rapport ` la theorie
e
a
de la mesure. Fund. Math. 4, 286–310
C.J. Stone (1969) On the potential operator for one dimensional recurrent random walks. Trans. AMS. 136, 427–445
J. Stoyanov (1987) Counterexamples in probability. John Wiley and Sons, New
York
V. Strassen (1964) An invariance principle for the law of the iterated logarithm.
Z. Warsch. verw. Gebiete 3, 211–226
V. Strassen (1965) A converse to the law of the iterated logarithm. Z. Warsch.
verw. Gebiete. 4, 265–268
V. Strassen (1967) Almost sure behavior of the sums of independent random
variables and martingales. Proc. 5th Berkeley Symp., Vol. II, 315–343
H. Thorisson (1987) A complete coupling proof of Blackwell’s renewal theorem.
Stoch. Proc. Appl. 26, 87–97
P. van Beek (1972) An application of Fourier methods to the problem of sharpening the BerryEsseen inequality. Z. Warsch. verw. Gebiete. 23, 187–196
A.M. Vershik and S.V. Kerov (1977) Asymptotic behavior of the Plancherel
measure of the symmetric group and the limit form of random Young
tableau. Dokl. Akad. Nauk SSR 233, 1024–1027
H. Wegner (1973) On consistency of probability measures. Z. Warsch. verw.
Gebiete 27, 335–338
L. Weiss (1955) The stochastic convergence of a function of sample successive
diﬀerences. Ann. Math. Statist. 26, 532–536
N. Wiener (1923) Diﬀerential space. J. Math. Phys. 2, 131–174
K. Yosida and S. Kakutani (1939) Birkhoﬀ’s ergodic theorem and the maximal
ergodic theorem. Proc. Imp. Acad. Tokyo 15, 165–168 Notation N
Z
Q
R
C natural numbers 1, 2, . . .
integers
rational numbers
real numbers
complex numbers Real Numbers
[x ]
x∧y
x∨y
x+
x−
sgn (x)
xn → x
an ↑
an ↓ a
∼
O
o integer part of x, the largest integer n ≤ x
minimum of x and y
maximum of x and y
positive part, x ∨ 0
negative part, (−x) ∨ 0
the sign of x, 1 if x > 0, −1 if x < 0, 0 if x = 0
limn→∞ xn = x
a1 ≤ a2 ≤ . . .
a1 ≥ a2 ≥ . . . and an → a
asymptotically, an ∼ bn means an /bn → 1 as n → ∞
f (t) is O(t2 ) as t → 0 means lim supt→0 f (t)/t2 < ∞
f (t) is o(t) as t → 0 means f (t)/t → 0 as t → 0 Complex Numbers, z = a + bi
z
¯
z 
Re z
Im z complex conjugate, = a − bi
modulus, = (a2 + b2 )1/2
real part of z , = a
imaginary part of z , = b 490 Notation Vectors
x 
x1
x·y the length of x, (x2 + . . . + x2 )1/2
1
d
the L1 norm, x1  + · · · + xd 
dot product, x1 y1 + . . . + xd yd Sets in Euclidean Space
¯
A
Ao
∂A
B (x, r)
∂B (x, r)
x+A
rA closure of A
interior of A
¯
boundary of A, = A − Ao
ball of radius r with center at x, {y : x − y  < r}
boundary of B (x, r), {y : x − y  = r}
{x + y : y ∈ A}
{rx : x ∈ A} Set Theory
∅
A∪B
A∩B
A+B
Ac
A−B
A∆B
lim sup An
lim inf An
An ↓ A
An ↑ A empty set
the union of A and B
the intersection of A and B
disjoint union, i.e., A + B = A ∪ B and A ∩ B = ∅
the complement of A
diﬀerence, A ∩ (B c )
symmetric diﬀerence, (A − B ) ∪ (B − A)
∩m≥1 (∪n≥m An ), points in inﬁnitely many An
∪m≥1 (∩n≥m An ), points in all but ﬁnitely many An
A1 ⊃ A2 ⊃ . . . and ∩n An = A
A1 ⊂ A2 ⊂ . . . and ∪n An = A Probability
Rd
A
1A
X∈F Borel subsets of Rd
Lebesgue measure of A
indicator function, = 1 on A, = 0 on Ac
X is measurable with respect to F , see Section 1.2 Notation
σ (. . . )
σ (C )
σ (X )
Fn ↑ F∞
Fn ↓ F∞
EX
E (X ; A)
Xp
var(X )
L2 (F )
E (X F )
P (AF )
χ
N ( x)
X =d Y
⇒ σ ﬁeld generated by . . .
the smallest σ ﬁeld containg all the sets in C
the smallest σ ﬁeld G so that X ∈ G
F1 ⊂ F2 ⊂ . . ., σ (∪Fn ) = F∞
F1 ⊃ F2 ⊃ . . ., ∩Fn = F∞
expected value of X , see Section 1.3
integral of X over A, = E (X 1A )
(E X p )1/p
the variance of X , = E (X − EX )2
{X : X ∈ F , EX 2 < ∞}
conditional expectation of X given F , see Section 4.1
E (X F ) when X = 1A
random variable with a standard normal distribution
P (χ ≤ x), normal distribution function
X and Y have the same distribution
converges weakly, see Section 2.2 Abbreviations
a.e.
a.s.
ch.f.
CLT
d.f.
f.d.d.’s
g.c.d.
i.i.d.
i.o.
LIL
l.s.c.
MCT
r.c.d.
r.v.
u.s.c.
w.r.t. almost everywhere
almost surely
characteristic function
central limit theorem
distribution function
ﬁnite dimensional distributions
greatest common divisor
independent and identically distributed
inﬁnitely often
law of the iterated logarithm
lower semicontinuous
monotone class theorem
regular conditional distribution
random variable
upper semicontinuous
with respect to 491 492 Notation Table of the Normal Distribution
x Φ(x) =
−∞ 2
1
√ e−y /2 dy
2π To illustrate the use of the table: Φ(0.36) = 0.6406, Φ(1.34) = 0.9099 0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0 0
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.9452
0.9554
0.9641
0.9713
0.9772
0.9821
0.9861
0.9893
0.9918
0.9938
0.9953
0.9965
0.9974
0.9981
0.9986 1
0.5040
0.5438
0.5832
0.6217
0.6591
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.8665
0.8869
0.9049
0.9207
0.9345
0.9463
0.9564
0.9649
0.9719
0.9778
0.9826
0.9864
0.9896
0.9920
0.9940
0.9955
0.9966
0.9975
0.9982
0.9987 2
0.5080
0.5478
0.5871
0.6255
0.6628
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.8686
0.8888
0.9066
0.9222
0.9357
0.9474
0.9573
0.9656
0.9726
0.9783
0.9830
0.9868
0.9898
0.9922
0.9941
0.9956
0.9967
0.9976
0.9982
0.9987 3
0.5120
0.5517
0.5910
0.6293
0.6664
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.8708
0.8907
0.9082
0.9236
0.9370
0.9484
0.9582
0.9664
0.9732
0.9788
0.9834
0.9871
0.9901
0.9924
0.9943
0.9957
0.9968
0.9977
0.9983
0.9988 4
0.5160
0.5557
0.5948
0.6331
0.6700
0.7054
0.7389
0.7703
0.7995
0.8264
0.8508
0.8729
0.8925
0.9099
0.9251
0.9382
0.9495
0.9591
0.9671
0.9738
0.9793
0.9838
0.9875
0.9904
0.9927
0.9945
0.9959
0.9969
0.9977
0.9984
0.9988 5
0.5199
0.5596
0.5987
0.6368
0.6736
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.8749
0.8943
0.9115
0.9265
0.9394
0.9505
0.9599
0.9678
0.9744
0.9798
0.9842
0.9878
0.9906
0.9929
0.9946
0.9960
0.9970
0.9978
0.9984
0.9989 6
0.5239
0.5636
0.6026
0.6406
0.6772
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.8770
0.8962
0.9131
0.9279
0.9406
0.9515
0.9608
0.9686
0.9750
0.9803
0.9846
0.9881
0.9909
0.9931
0.9948
0.9961
0.9971
0.9979
0.9985
0.9989 7
0.5279
0.5675
0.6064
0.6443
0.6808
0.7157
0.7486
0.7793
0.8078
0.8340
0.8577
0.8790
0.8980
0.9147
0.9292
0.9418
0.9525
0.9616
0.9693
0.9756
0.9808
0.9850
0.9884
0.9911
0.9932
0.9949
0.9962
0.9972
0.9979
0.9985
0.9989 8
0.5319
0.5714
0.6103
0.6480
0.6844
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.8810
0.8997
0.9162
0.9306
0.9429
0.9535
0.9625
0.9699
0.9761
0.9812
0.9854
0.9887
0.9913
0.9934
0.9951
0.9963
0.9973
0.9980
0.9986
0.9990 9
0.5359
0.5753
0.6141
0.6517
0.6879
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
0.8830
0.9015
0.9177
0.9319
0.9441
0.9545
0.9633
0.9706
0.9767
0.9817
0.9857
0.9890
0.9916
0.9936
0.9952
0.9964
0.9974
0.9981
0.9986
0.9990 Index absolutely continuous 7, 218, 477
absorbing state 284
adapted sequence 228
agedependent branching process 367
algebra 437
almost everywhere 452
almost sure convergence 10
alternating renewal process 215
Anosov map 351
arcsine laws
Brownian motion 393, 403
random walk 197, 198
arithmetic distribution 203
asymptotic density of subset of Z 437
asymptotic equipartition property
i.i.d. sequences 59
stationary seqeunces 357
Backgammon 396
backwards martingale 262
ballot theorem 198, 264
BanachTarksi theorem 452
Bayes’ formula 220
Benford’s law 342
BernoulliLaplace model 313
Bernoulli shift 333, 336, 342, 350
Bernstein polynomials 36
birth and death chains
280, 292, 297, 304
birthday problem 81
Blackwell’s renewal theorem 204
Blumenthal’s 01 law 381
Bonferroni inequalities 21 BorelCantelli lemmas
25, 26, 29, 237, 253
Borel sets 2
Borel’s paradox 222
bounded convergence theorem 15, 463
branching process 251, 278, 292
brothersister mating 281
Brownian bridge 428
Brownian motion 372
continuity of paths 374
hitting times 389, 398
H¨lder cotninuity 376
o
Markov property 378
martingales 395
modulus of continuity 393
nondiﬀerentiability 377
quadratic variation 377
reﬂection principle 391
scaling relation 372
strong Markov property 387
tail σ ﬁeld 383
temporal inversion 382
zeros 389
Cantor set 7
Carath´odary’s extension theorem 438
e
card trick 312
Carleman’s condition 108
Cauchy distribution 43
CauchySchwarz inequality 13, 97, 462
central limit theorem
i.i.d. sequences 111, 402
martingales 409, 411, 414 494 Index
prime divisors 122
random indices 115
Rd 168
renewal theory 115
stationary sequences 416
central order statistic 82
chaindependent r.v.’s 414
change of variables formula 16
ChapmanKolmogorov equation 284
characteristic function 90
convergence theorem 97, 167
in Rd 166
inversion 93, 167
Chebyshev’s inequality 14, 223
ChungFuchs theorem 188
closed set in Markov chain 290
completion 450
conditional expection 217
conditional variance formula 250
continued fractions 337
continuous mapping theorem 85
convergence
almost surely 10
in measure 463
in probability 34
of types 157
weak 80, 164
converging together lemma 89
connvolution 28
countably generated σ ﬁeld 7
coupon collector’s problem 38, 142
Cramer’s estimates of ruin
CramerWold device 168
cyclic decomposition 315 distribution function 4
dominated convergence theorem
15, 48, 262, 465
Donsker’s theorem 403
Doob’s decomposition 234
Doob’s inequality 247
double exponential distribution 83
doubly stochastic 297
dual transition probability 298
Dubin’s inequality 236 decomposition theorem 291
de Finetti’s theorem 266
delayed renewal process 204
De MoivreLaplace theorem 79
density function 6, 163
directly Riemman integrable 213 Hahn decomposition 474
Hamburger moment problem 110
Harris chain 322
head runs 53
Helly’s selection theorem 86
HewittSavage 01 law 172, 265 Ehrenfest chain 280, 297
empirical distribution 58, 425
entropy 59, 307
Erd¨sKac central limit theorem 122
o
ergodic sequence 335
ergodic theorem 337
exchangeable σ ﬁeld 172
extended real line 11
extreme value distribution 83
Fatou’s lemma 15, 48, 84, 464
ﬁltration 228
ﬁrst entrance decomposition 287
ﬁrst passage percolation 368
Friedman’s urn 254
Fubini’s theorem 467
GaltonWatson process 243
Gaussian process 373
germ σ ﬁeld 381
GI/G/1 queue 323, 329
GlivenkoCantelli theorem 58
Gumbel distribution 83 Index
histogram correction 112
hitting time 174
H¨lder’s inequality 13, 462
o
Holtsmark distribution 158
i.i.d. 36
inclusionexclusion formula 20
independence, deﬁned 22
index of a stable law 155
indicator function 4
inﬁnitely divisible distribution 159
invariant set 334
inversion formula for ch.f. 93, 167
investment problem 60
irreducible set in Markov chain 290
Jensen’s inequality 13, 223, 461
Jordan decomposition 475
Kac’s recurrence theorem 345
Kakutani dichotomy 241
KochenStone lemma 55
Kolmogorov’s
continuity criterion 375
cycle condition 303
extension theorem 31, 471
maximal inequality 61
test 433
zeroone law 61
Kronecker’s lemma 63
Ky Fan metric 89
ladder variables 177
large deviations 69–76
lattice distribution 129
law of the iterated logarithm 129
lazy janitor 58
Lebesgue decomposition 476
Lebesgue measure 2
L´vyKhintchine theorem 161
e
L´vy measure 161
e L´vy metric 89
e
L´vy’s 01 law
e
LindebergFeller theorem 114
for martingales 411
Littlewood’s principles 458
local central limit theorem 130, 132
lognormal distribution 106
longest common subsequence 359
Lyapunov’s theorem 119
marginal distribution 163
Markov chain 274, 335, 349
additive functionals 320, 418
convergence theorem 310, 315
Markov property 282
of Brownian motion 378
Markov’s inequality 14
martingale 229, 395
convergence theorem 233
Lp convergence theorem 250
Lp maximal inequality 247
orthogonality of increments 250
matching 140
maxima, limit theorem for 83
maximal ergodic lemma 338
mdependent 418
mean 12, 18
measurable map 8
measurable space 1
measure 1, 437
measure preserving 333
method of moments 105
M/G/1 queue 279, 295, 305, 319
Minkowski’s inequality 463
mixing 420
M/M/∞ queue 296, 306
moment 18
moment problem 105
monotone class theorem 277
monotone convergence theorem
15, 223, 464 495 496 Index
Monte Carlo integration 45
moving average process 419
multivariate normal distribution 169
mutually singular measures 475
negative set 473
nice measurable space 32
nonarithmetic distribution 203
nonmeasuarble set 450
normal number 343
null recurrent 304
null set 475
occupancy problem 39, 141, 146
optional random variable 174
optional stopping theorem 269
outer measure 446
pairwise independent 33, 113
Parseval relation 190
pedestrian delay 210
period of a state 309
permutable event 171
π − λ theorem 24, 444
Poisson approximation 36
Poisson convergence 135
Poisson process 143, 147, 150
PollaczekKhintchine formula 331
Polya’s criterion 102
Polya’s distribution 96
Polya’s urn scheme 238
positive recurrent 304
positive set 473
predictable sequence 231
probability measure 2
probability space 1
product space 3, 464
products of random matrices 364
RadonNikodym derivative 218, 239
RadonNikodym theorem 218, 477 random index c.l.t. 114
random matrices 364
random permutations 38, 116
increasing sequences in 366
longest common subsequences 359
random variable 3, 8, 11
random vector 8
random walk 173, 275, 297
see also simple random walk
on graphs 298
on hypercube 314
on trees 319
on Zd 179, 183, 317, 343
range of 346?, 358
symmetric 173
ratio limit theorems 320
record values 51, 116
from improving populations 359
recurrent random walk 183
recurrent state 289
reﬂection principle 195, 285, 391
regular conditional probability 227
renewal equation 208
renewal measure 203
renewal process
alternating 215
delayed 204
stationary 205
terminating 210
renewal theorem 205, 212
central limit theorem 114
strong law 57
residual waiting time 215
Riemann integration 459
RiemannLebesgue lemma 459
St. Petersburg paradox 43
Scheﬀ´’s theorem 81
e
selfnormalized 114
semialgebra 438
semicontinuous functions 11 Index
sequence space 276
ShannonMacMillanBreiman th. 354
Shannon’s theorem 58
shift on sequence space 175
shuﬄing cards 314
σ algebra = σ ﬁeld 1
σ ﬁeld generated by
collection of sets 3
random variables 9
σ ﬁnite measure 438
signed measure 473
simple function 11, 452
simple random walk 169, 174, 179, 183
asymmetric 180, 271, 293, 297
singular measure 7
Skorokhod’s representation 399
slowly varying 152
Solovay’s theorem 452
span of a lattice distribution 129
stable laws 155, 156
stationary distribution 296, 303
stationary measure 296
stationary renewal process 205
stationary sequence 332
step function 459
Stieltjes measure function 441
Stieltjes moment problem 110
Stirling’s formula 78
stopping time 174, 232, 384
Strassen’s invariance principle 435
strong law of large numbers 55, 64
backwards martingale proof 264
converse 49
in renewal theory 57
subadditive ergodic theorem 358
submartingale 229
superharmonic 296
supermartingale 229 switching principle 235
tail σ ﬁeld 60
for Brownian motion 383
for Markov chain 316
terminating renewal process 210
three series theorem 63
converse 117
tight sequence of distributions 87, 166
total variation norm 81, 137
transition probability 275
dual 298
transient random walk 183, 186
transient state 289
uncorrelated random variables 28, 34
unfair “fair” game 45
uniform integrability 256
upcrossing inequality 232
vague convergence 87
variance 18
Wald’s equation 178, 180
weak convergence 80, 164
weak law of large numbers 41
for positive random variables 46
for triangular arrays 40
L2 version 35
Weyl’s equidistribution theorem 341
Wiener’s maximal 340
zeroone laws
Blumenthal’s 381
HewittSavage 172, 265
Kolmogorov’s 61
L´vy’s 263
e 497 ...
View
Full Document
 Spring '09
 R.Srikant
 The Land, Probability theory

Click to edit the document details