This preview shows pages 1–4. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: i Probability: Theory and Examples Rick Durrett Fourth Edition Draft, June 24, 2009 Copyright 2009, All rights reserved. ii Contents
1 Laws of Large Numbers
1.1 Basic Deﬁnitions . . . . . . . . . . . . . . . . . . . .
1.2 Random Variables . . . . . . . . . . . . . . . . . . .
1.3 Expected Value . . . . . . . . . . . . . . . . . . . . .
1.3.1 Inequalities . . . . . . . . . . . . . . . . . . .
1.3.2 Integration to the Limit . . . . . . . . . . . .
1.3.3 Computing Expected Values . . . . . . . . .
1.4 Independence . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Suﬃcient Conditions for Independence . . . .
1.4.2 Independence, Distribution, and Expectation
1.4.3 Sums of Independent Random Variables . . .
1.4.4 Constructing Independent Random Variables
1.5 Weak Laws of Large Numbers . . . . . . . . . . . . .
1.5.1 L2 Weak Laws . . . . . . . . . . . . . . . . .
1.5.2 Triangular Arrays . . . . . . . . . . . . . . .
1.5.3 Truncation . . . . . . . . . . . . . . . . . . .
1.6 BorelCantelli Lemmas . . . . . . . . . . . . . . . . .
1.7 Strong Law of Large Numbers . . . . . . . . . . . . .
1.8 Convergence of Random Series* . . . . . . . . . . . .
1.9 Large Deviations* . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 2 Central Limit Theorems
2.1 The De MoivreLaplace Theorem . . . . . . .
2.2 Weak Convergence . . . . . . . . . . . . . . .
2.2.1 Examples . . . . . . . . . . . . . . . .
2.2.2 Theory . . . . . . . . . . . . . . . . .
2.3 Characteristic Functions . . . . . . . . . . . .
2.3.1 Deﬁnition, Inversion Formula . . . . .
2.3.2 Weak Convergence . . . . . . . . . . .
2.3.3 Moments and Derivatives . . . . . . .
2.3.4 Polya’s Criterion* . . . . . . . . . . .
2.3.5 The Moment Problem* . . . . . . . .
2.4 Central Limit Theorems . . . . . . . . . . . .
2.4.1 i.i.d. Sequences . . . . . . . . . . . . .
2.4.2 Triangular Arrays . . . . . . . . . . .
2.4.3 Prime Divisors (Erd¨sKac)* . . . . .
o
2.4.4 Rates of Convergence (BerryEsseen)*
2.5 Local Limit Theorems* . . . . . . . . . . . .
2.6 Poisson Convergence . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 63
. 63
. 66
. 66
. 69
. 74
. 74
. 80
. 81
. 84
. 86
. 90
. 90
. 93
. 97
. 101
. 105
. 110 iii .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 1
1
7
10
10
11
13
18
19
21
23
25
28
28
30
32
38
45
49
56 iv CONTENTS
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
. 110
114
116
120
130
133 .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. 139
139
149
159
165 4 Martingales
4.1 Conditional Expectation . . . . . . . . . .
4.1.1 Examples . . . . . . . . . . . . . .
4.1.2 Properties . . . . . . . . . . . . . .
4.1.3 Regular Conditional Probabilities*
4.2 Martingales, Almost Sure Convergence . .
4.3 Examples . . . . . . . . . . . . . . . . . .
4.3.1 Bounded Increments . . . . . . . .
4.3.2 Polya’s Urn Scheme . . . . . . . .
4.3.3 RadonNikodym Derivatives . . . .
4.3.4 Branching Processes . . . . . . . .
4.4 Doob’s Inequality, Convergence in Lp . . .
4.4.1 Square Integrable Martingales* . .
4.5 Uniform Integrability, Convergence in L1 .
4.6 Backwards Martingales . . . . . . . . . . .
4.7 Optional Stopping Theorems . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. 177
177
179
181
185
187
194
194
195
196
199
202
206
210
215
220 5 Markov Chains
5.1 Deﬁnitions . . . . . . . . . . . . . .
5.2 Examples . . . . . . . . . . . . . .
5.3 Extensions of the Markov Property
5.4 Recurrence and Transience . . . .
5.5 Stationary Measures . . . . . . . .
5.6 Asymptotic Behavior . . . . . . . .
5.7 Periodicity, Tail σ ﬁeld* . . . . . .
5.8 General State Space* . . . . . . . .
5.8.1 Recurrence and Transience
5.8.2 Stationary Measures . . . .
5.8.3 Convergence Theorem . . .
5.8.4 GI/G/1 queue . . . . . . . .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. 225
225
228
233
238
245
254
260
264
267
268
269
269 6 Ergodic Theorems
6.1 Deﬁnitions and Examples . . . .
6.2 Birkhoﬀ’s Ergodic Theorem . . .
6.3 Recurrence . . . . . . . . . . . .
6.4 A Subadditive Ergodic Theorem*
6.5 Applications* . . . . . . . . . . . .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. .
.
.
.
. 273
273
278
283
287
291 2.7
2.8
2.9 2.6.1 The Basic Limit Theorem . . . .
2.6.2 Two Examples with Dependence
2.6.3 Poisson Processes . . . . . . . . .
Stable Laws* . . . . . . . . . . . . . . .
Inﬁnitely Divisible Distributions* . . . .
Limit Theorems in Rd . . . . . . . . . . 3 Random Walks
3.1 Stopping Times . .
3.2 Recurrence . . . .
3.3 Visits to 0, Arcsine
3.4 Renewal Theory* . ....
....
Laws*
.... .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
. .
.
.
.
. .
.
.
. .
.
.
. .
.
.
. CONTENTS v 7 Brownian Motion
7.1 Deﬁnition and Construction . . . . . . . .
7.2 Markov Property, Blumenthal’s 01 Law .
7.3 Stopping Times, Strong Markov Property
7.4 Path Properites . . . . . . . . . . . . . . .
7.4.1 Zeros of Brownian Motion . . . . .
7.4.2 Hitting times . . . . . . . . . . . .
7.4.3 L´vy’s Modulus of Continuity . . .
e
7.5 Martingales . . . . . . . . . . . . . . . . .
7.5.1 Multidimensional Brownian motion
7.6 Donsker’s Theorem . . . . . . . . . . . . .
7.7 Empirical Distributions, Brownian Bridge
7.8 Laws of the Iterated Logarithm* . . . . . .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. 297
297
303
308
312
312
312
316
317
320
323
330
335 A Measure Theory
A.1 LebesgueStieltjes Measures . . . . .
A.2 Carath´odory’s Extension Theorem .
e
A.3 Completion, Etc. . . . . . . . . . . .
A.4 Integration . . . . . . . . . . . . . .
A.5 Properties of the Integral . . . . . .
A.6 Product Measures, Fubini’s Theorem
A.7 Kolmogorov’s Extension Theorem . .
A.8 RadonNikodym Theorem . . . . . .
A.9 Diﬀerentiating Under the Integral . . .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. 339
339
345
349
352
359
363
367
369
373 .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. vi CONTENTS Chapter 1 Laws of Large Numbers
In the ﬁrst three sections, we will recall some deﬁnitions and results from measure
theory. Our purpose is not only to review that material but also to introduce the
terminology of probability theory, which diﬀers slightly from that of measure theory. In Section 1.4, we introduce the crucial concept of independence and explore its
properties. In Section 1.5, we prove the weak law of large numbers and give several
applications. In Section 1.6, we prove some BorelCantelli lemmas to prepare for the
proof of the strong law of large numbers in Section 1.7. In Section 1.8, we investigate
the convergence of random series that leads to estimates on the rate of convergence
in the law of large numbers. Finally, in Section 1.9, we show that in nice situations
convergence in the weak law occurs exponentially rapidly. 1.1 Basic Deﬁnitions Here and throughout the book, terms being deﬁned are set in boldface. We begin
with the most basic quantity. A probability space is a triple (Ω, F , P ) where Ω
is a set of “outcomes,” F is a set of “events,” and P : F → [0, 1] is a function that
assigns probabilities to events. We assume that F is a σ ﬁeld (or σ algebra), i.e., a
(nonempty) collection of subsets of Ω that satisfy
(i) if A ∈ F then Ac ∈ F , and
(ii) if Ai ∈ F is a countable sequence of sets then ∪i Ai ∈ F .
Here and in what follows, countable means ﬁnite or countably inﬁnite. Since ∩i Ai =
(∪i Ac )c , it follows that a σ ﬁeld is closed under countable intersections. We omit the
i
last property from the deﬁnition to make it easier to check.
Without P , (Ω, F ) is called a measurable space, i.e., it is a space on which we
can put a measure. A measure is a nonnegative countably additive set function; that
is, a function µ : F → R with
(i) µ(A) ≥ µ(∅) = 0 for all A ∈ F , and
(ii) if Ai ∈ F is a countable sequence of disjoint sets, then
µ(∪i Ai ) = µ(Ai )
i If µ(Ω) = 1, we call µ a probability measure. In this book, probability measures
are usually denoted by P . The next exercise gives some consequences of the deﬁnition
1 2 CHAPTER 1. LAWS OF LARGE NUMBERS that we will need later. In all cases, we assume that the sets we mention are in F .
For (i) one needs to know that B − A = B ∩ Ac . For (iv), it is useful to note that (ii)
of the deﬁnition with A1 = A and A2 = Ac implies P (Ac ) = 1 − P (A).
Exercise 1.1.1. Let P be a probability measure on (Ω, F )
(i) monotonicity. If A ⊂ B then P (B ) − P (A) = P (B − A) ≥ 0.
(ii) subadditivity.
∞
m=1 P (Am ). If Am ∈ F for m ≥ 1 and A ⊂ ∪∞=1 Am then P (A) ≤
m (iii) continuity from below. If Ai ↑ A (i.e., A1 ⊂ A2 ⊂ . . . and ∪i Ai = A) then
P (Ai ) ↑ P (A).
(iv) continuity from above. If Ai ↓ A (i.e., A1 ⊃ A2 ⊃ . . . and ∩i Ai = A) then
P (Ai ) ↓ P (A).
Some examples of probability measures should help to clarify the concept. We
leave it to the reader to check that they are examples, i.e., F is a σ ﬁeld and P is a
probability measure.
Example 1.1.1. Discrete probability spaces. Let Ω = a countable set, i.e., ﬁnite
or countably inﬁnite. Let F = the set of all subsets of Ω. Let
p(ω ) where p(ω ) ≥ 0 and P (A) =
ω ∈A p(ω ) = 1
ω ∈Ω A little thought reveals that this is the most general probability measure on this space.
In many cases when Ω is a ﬁnite set, we have p(ω ) = 1/Ω where Ω = the number
of points in Ω. Concrete examples in this category are:
a. ﬂipping a fair coin: Ω = { Heads, Tails }
b. rolling a die: Ω = {1, 2, 3, 4, 5, 6}
Example 1.1.2. Real line and unit interval. Let R = the real line, R = the
Borel sets = the smallest σ ﬁeld containing the open sets, λ = Lebesgue measure
= the only measure on R with λ((a, b]) = b − a for all a < b. The construction of
Lebesgue measure is carried out in Section 1 of the Appendix. λ(R) = ∞. To get
a probability space, let Ω = (0, 1), F = {A ∩ (0, 1) : A ∈ R} and P (B ) = λ(B ) for
B ∈ F . P is Lebesgue measure restricted to the Borel subsets of (0,1).
Exercise 1.1.2. (i) If Fi , i ∈ I are σ ﬁelds then ∩i∈I Fi is. Here I = ∅ is an arbitrary
index set (i.e., possibly uncountable). (ii) Use the result in (i) to show if we are given
a set Ω and a collection A of subsets of Ω, then there is a smallest σ ﬁeld containing
A. We will call this the σ ﬁeld generated by A and denote it by σ (A).
Example 1.1.3. Product spaces. If (Ωi , Fi , Pi ) i = 1, . . . , n are probability spaces,
we can let Ω = Ω1 × · · · × Ωn = {(ω1 , . . . , ωn ) : ωi ∈ Ωi }. F = F1 × · · · × Fn = the
σ ﬁeld generated by {A1 × · · · × An : Ai ∈ Fi }. Let P = P1 × · · · × Pn = the measure
on F that has
P (A1 × · · · × An ) = P1 (A1 ) · P2 (A2 ) · · · Pn (An )
For more details, see Section 6 of the Appendix. Concrete examples of product spaces
are:
a. Roll two dice. Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}, F = all subsets of Ω,
P (A) = A/36. 1.1. BASIC DEFINITIONS 3 b. Unit cube. If Ωi = (0, 1), Fi = the Borel sets, and Pi =Lebesgue measure, then
the product space deﬁned above is the unit cube Ω = (0, 1)n , F = the Borel subsets
of Ω, and P is ndimensional Lebesgue measure restricted to F .
Exercise 1.1.3. Let Rn = {(x1 , . . . , xn ) : xi ∈ R}. Rn = the Borel subsets of Rn is
deﬁned to be the σ ﬁeld generated by the open subsets of Rn . Prove this is the same
as R · · · × R = the σ ﬁeld generated by sets of the form A1 × · · · × An . Hint: Show
that both σ ﬁelds coincide with the one generated by (a1 , b1 ) × · · · × (an , bn ).
Probability spaces become a little more interesting when we deﬁne random variables on them. A real valued function X deﬁned on Ω is said to be a random
variable if for every Borel set B ⊂ R we have
X −1 (B ) = {ω : X (ω ) ∈ B } ∈ F
When we need to emphasize the σ ﬁeld, we will say that X is F measurable or write
X ∈ F . If Ω is a discrete probability space (see Example 1.1.1), then any function
X : Ω → R is a random variable. A second trivial, but useful, type of example of a
random variable is the indicator function of a set A ∈ F :
1A (ω ) = 1
0 ω∈A
ω∈A The notation is supposed to remind you that this function is 1 on A. Analysts call this
object the characteristic function of A. In probability, that term is used for something
quite diﬀerent. (See Section 2.3.)
If X is a random variable, then X induces a probability measure on R called its
distribution by setting µ(A) = P (X ∈ A) for Borel sets A. Using the notation
introduced above, the righthand side can be written as P (X −1 (A)). In words, we
pull A ∈ R back to X −1 (A) ∈ F and then take P of that set.
To check that µ is a probability measure we observe that if the Ai are disjoint then
using the deﬁnition of µ; the fact that X lands in the union if and only if it lands in
one of the Ai ; the fact that if the sets Ai ∈ R are disjoint then the events {X ∈ Ai }
are disjoint; and the deﬁnition of µ again; we have:
µ (∪i Ai ) = P (X ∈ ∪i Ai ) = P (∪i {X ∈ Ai }) = P (X ∈ Ai ) =
i µ(Ai )
i The distribution of a random variable X is usually described by giving its distribution function, F (x) = P (X ≤ x).
Theorem 1.1.1. Any distribution function F has the following properties:
(i) F is nondecreasing.
(ii) limx→∞ F (x) = 1, limx→−∞ F (x) = 0.
(iii) F is right continuous, i.e. limy↓x F (y ) = F (x).
(iv) If F (x−) = limy↑x F (y ) then F (x−) = P (X < x).
(v) P (X = x) = F (x) − F (x−).
Proof. To prove (i), note that if x ≤ y then {X ≤ x} ⊂ {X ≤ y }, and then use (i) in
Exercise 1.1.1 to conclude that P (X ≤ x) ≤ P (X ≤ y ).
To prove (ii), we observe that if x ↑ ∞, then {X ≤ x} ↑ Ω, while if x ↓ −∞ then
{X ≤ x} ↓ ∅ and then use (iii) and (iv) of Exercise 1.1.1. 4 CHAPTER 1. LAWS OF LARGE NUMBERS To prove (iii), we observe that if y ↓ x, then {X ≤ y } ↓ {X ≤ x}.
To prove (iv), we observe that if y ↑ x, then {X ≤ y } ↑ {X < x}.
For (v), note P (X = x) = P (X ≤ x) − P (X < x) and use (iii) and (iv).
The next result shows that we have found more than enough properties to characterize distribution functions.
Theorem 1.1.2. If F satisﬁes (i), (ii), and (iii) in Theroem 1.1.1, then it is the
distribution function of some random variable.
Proof. Let Ω = (0, 1), F = the Borel sets, and P = Lebesgue measure. If ω ∈ (0, 1),
let
X (ω ) = sup{y : F (y ) < ω }
Once we show that
{ω : X (ω ) ≤ x} = {ω : ω ≤ F (x)} () the desired result follows immediately since P (ω : ω ≤ F (x)) = F (x). (Recall P is
Lebesgue measure.) To check ( ), we observe that if ω ≤ F (x) then X (ω ) ≤ x, since
x ∈ {y : F (y ) < ω }. On the other hand if ω > F (x), then since F is right continuous,
/
there is an > 0 so that F (x + ) < ω and X (ω ) ≥ x + > x. y ¨ ¨
¨¨ x
F −1 (x) F −1 (y )
Figure 1.1: Picture of the inverse deﬁned in the proof of Theorem 1.1.2. Even though F may not be 11 and onto we will call X the inverse of F and denote
it by F −1 . The scheme in the proof of Theorem 1.1.2 is useful in generating random
variables on a computer. Standard algorithms generate random variables U with a
uniform distribution, then one applies the inverse of the distribution function deﬁned
in Theorem 1.1.2 to get a random variable F −1 (U ) with distribution function F .
An immediate consequence of Theorem 1.1.2 is
Theorem 1.1.3. If F satisﬁes (i), (ii), and (iii) in (1.1), there is a unique probability
measure µ on (R, R) that has µ((a, b]) = F (b) − F (a) for all a, b.
Proof. Theorem 1.1.2 gives the existence of a random variable X with distribution
function F . The measure it induces on (R, R) is the desired µ. There is only one
measure associated with a given F because the sets (a, b] are closed under intersection
and generate the σ ﬁeld. (See (2.2) in the Appendix.) 1.1. BASIC DEFINITIONS 5 If X and Y induce the same distribution µ on (R, R) we say X and Y are equal
in distribution. In view of Theorem 1.1.3, this holds if and only if X and Y have
the same distribution function, i.e., P (X ≤ x) = P (Y ≤ x) for all x. When X and Y
have the same distribution, we like to write
d X=Y
but this is too tall to use in text, so for typographical reasons we will also use X =d Y .
When the distribution function F (x) = P (X ≤ x) has the form
x (∗) F (x) = f (y ) dy
−∞ we say that X has density function f . In remembering formulas, it is often useful
to think of f (x) as being P (X = x) although
x+ P (X = x) = lim f (y ) dy = 0 →0 x− We can start with f and use (∗) to deﬁne F . In order to end up with a distribution
function it is necessary and suﬃcient that f (x) ≥ 0 and f (x) dx = 1. Three examples
that will be important in what follows are:
Example 1.1.4. Uniform distribution
otherwise. Distribution function: 0 F (x) = x 1 on (0,1). f (x) = 1 for x ∈ (0, 1), 0
x≤0
0≤x≤1
x>1 Example 1.1.5. Exponential distribution. f (x) = e−x for x ≥ 0, 0 otherwise.
Distribution function:
0
x≤0
F (x) =
−x
1−e
x≥0
Example 1.1.6. Standard normal distribution.
f (x) = (2π )−1/2 exp(−x2 /2)
In this case, there is no closed form expression for F (x), but we have the following
bounds that are useful for large x:
Theorem 1.1.4. For x > 0,
∞ (x−1 − x−3 ) exp(−x2 /2) ≤ exp(−y 2 /2)dy ≤ x−1 exp(−x2 /2)
x Proof. Changing variables y = x + z and using exp(−z 2 /2) ≤ 1 gives
∞ ∞ exp(−xz ) dz = x−1 exp(−x2 /2) exp(−y 2 /2) dy ≤ exp(−x2 /2)
0 x For the other direction, we observe
∞ (1 − 3y −4 ) exp(−y 2 /2) dy = (x−1 − x−3 ) exp(x2 /2)
x 6 CHAPTER 1. LAWS OF LARGE NUMBERS A distribution function on R is said to be absolutely continuous if it has a density and singular if the corresponding measure is singular w.r.t. Lebesgue measure.
See Section 8 of the Appendix for more on these notions. An example of a singular
distribution is:
Example 1.1.7. Uniform distribution on the Cantor set. The Cantor set C
is deﬁned by removing (1/3, 2/3) from [0,1] and then removing the middle third of
each interval that remains. We deﬁne an associated distribution function by setting
F (x) = 0 for x ≤ 0, F (x) = 1 for x ≥ 1, F (x) = 1/2 for x ∈ [1/3, 2/3], F (x) = 1/4 for
x ∈ [1/9, 2/9], F (x) = 3/4 for x ∈ [7/9, 8/9], ... The function F that results is called
Lebesgue’s singular function because there is no f for which (∗) holds. From the
deﬁnition, it is immediate that the corresponding measure has µ(C c ) = 0.
A probability measure P (or its associated distribution function) is said to be
discrete if there is a countable set S with P (S c ) = 0. The simplest example of a
discrete distribution is
Example 1.1.8. Pointmass at 0. F (x) = 1 for x ≥ 0, F (x) = 0 for x < 0.
The next example shows that the distribution function associated with a discrete
probability measure can be quite wild.
Example 1.1.9. Dense discontinuities. Let q1 , q2 , ... be an enumeration of the
rationals and let
∞ 2−i 1[qi ,∞) F (x) =
i=1 where 1[θ,∞) (x) = 1 if x ∈ [θ, ∞) = 0 otherwise.
Exercises
1.1.4. Let Ω = R, F = all subsets so that A or Ac is countable, P (A) = 0 in the ﬁrst
case and = 1 in the second. Show that (Ω, F , P ) is a probability space.
1.1.5. A σ ﬁeld F is said to be countably generated if there is a countable collection
C ⊂ F so that σ (C ) = F . Show that Rd is countably generated.
1.1.6. Suppose X and Y are random variables on (Ω, F , P ) and let A ∈ F . Show
that if we let Z (ω ) = X (ω ) for ω ∈ A and Z (ω ) = Y (ω ) for ω ∈ Ac , then Z is a
random variable.
1.1.7. Let χ have the standard normal distribution. Use (1.4) to get upper and lower
bounds on P (χ ≥ 4).
1.1.8. Show that a distribution function has at most countably many discontinuities.
1.1.9. Show that if F (x) = P (X ≤ x) is continuous then Y = F (X ) has a uniform
distribution on (0,1), that is, if y ∈ [0, 1], P (Y ≤ y ) = y.
1.1.10. Suppose X has continuous density f , P (α ≤ X ≤ β ) = 1 and g is a function that is strictly increasing and diﬀerentiable on (α, β ). Then g (X ) has density
f (g −1 (y ))/g (g −1 (y )) for y ∈ (g (α), g (β )) and 0 otherwise. When g (x) = ax + b with
a > 0, g −1 (y ) = (y − b)/a so the answer is (1/a)f ((y − b)/a).
1.1.11. Suppose X has a normal distribution. Use the previous exercise to compute
the density of exp(X ). (The answer is called the lognormal distribution.)
1.1.12. (i) Suppose X has density function f . Compute the distribution function
of X 2 and then diﬀerentiate to ﬁnd its density function. (ii) Work out the answer
when X has a standard normal distribution to ﬁnd the density of the chisquare
distribution. 1.2. RANDOM VARIABLES 1.2 7 Random Variables In this section, we will develop some results that will help us later to prove that
quantities we deﬁne are random variables, i.e., they are measurable. Since most of
what we have to say is true for random elements of an arbitrary measurable space
(S, S ) and the proofs are the same (sometimes easier), we will develop our results in
that generality. First we need a deﬁnition. A function X : Ω → S is said to be a
measurable map from (Ω, F ) to (S, S ) if
X −1 (B ) ≡ {ω : X (ω ) ∈ B } ∈ F for all B ∈ S If (S, S ) = (Rd , Rd ) and d > 1 then X is called a random vector. Of course, if
d = 1, X is called a random variable, or r.v. for short.
The next result is useful for proving that maps are measurable.
Theorem 1.2.1. If {ω : X (ω ) ∈ A} ∈ F for all A ∈ A and A generates S (i.e., S
is the smallest σ ﬁeld that contains A), then X is measurable.
Proof. Writing {X ∈ B } as shorthand for {ω : X (ω ) ∈ B }, we have
{X ∈ ∪i Bi } = ∪i {X ∈ Bi }
c {X ∈ B } = {X ∈ B }c
So the class of sets B = {B : {X ∈ B } ∈ F} is a σ ﬁeld. Since B ⊃ A and A generates
S, B ⊃ S.
It follows from the two equations displayed in the previous proof that if S is a
σ ﬁeld, then {{X ∈ B } : B ∈ S} is a σ ﬁeld. It is the smallest σ ﬁeld on Ω that
makes X a measurable map. It is called the σ ﬁeld generated by X and denoted
σ (X ).
Exercise 1.2.1. Show that if A generates S , then X −1 (A) ≡ {{X ∈ A} : A ∈ A}
generates σ (X ) = {{X ∈ B } : B ∈ S}.
Example 1.2.1. If (S, S ) = (R, R) then possible choices of A in Theorem 1.2.1 are
{(−∞, x] : x ∈ R} or {(−∞, x) : x ∈ Q} where Q = the rationals.
Example 1.2.2. If (S, S ) = (Rd , Rd ), a useful choice of A is
{(a1 , b1 ) × · · · × (ad , bd ) : −∞ < ai < bi < ∞}
or occasionally the larger collection of open sets.
Theorem 1.2.2. If X : (Ω, F ) → (S, S ) and f : (S, S ) → (T, T ) are measurable
maps, then f (X ) is a measurable map from (Ω, F ) to (T, T )
Proof. Let B ∈ T . {ω : f (X (ω )) ∈ B } = {ω : X (ω ) ∈ f −1 (B )} ∈ F , since by
assumption f −1 (B ) ∈ S .
From Theorem 1.2.2, it follows immediately that if X is a random variable then so
is cX for all c ∈ R, X 2 , sin(X ), etc. The next result shows why we wanted to prove
Theorem 1.2.2 for measurable maps.
Theorem 1.2.3. If X1 , . . . Xn are random variables and f : (Rn , Rn ) → (R, R) is
measurable, then f (X1 , . . . , Xn ) is a random variable. 8 CHAPTER 1. LAWS OF LARGE NUMBERS Proof. In view of Theorem 1.2.2, it suﬃces to show that (X1 , . . . , Xn ) is a random
vector. To do this, we observe that if A1 , . . . , An are Borel sets then
{(X1 , . . . , Xn ) ∈ A1 × · · · × An } = ∩i {Xi ∈ Ai } ∈ F
Since sets of the form A1 × · · · × An generate Rn , the desired result follows from
Theorem 1.2.1.
Theorem 1.2.4. If X1 , . . . , Xn are random variables then X1 + . . . + Xn is a random
variable.
Proof. In view of Theorem 1.2.3 it suﬃces to show that f (x1 , . . . , xn ) = x1 + . . . + xn
is measurable. To do this, we use Example 1.2.1 and note that {x : x1 + . . . + xn < a}
is an open set and hence is in Rn .
To get a feeling for the bare hands approach to proving measurability, try
Exercise 1.2.2. Prove Theorem 1.2.4 when n = 2 by checking {X1 + X2 < x} ∈ F .
Theorem 1.2.5. If X1 , X2 , . . . are random variables then so are
inf Xn sup Xn n lim sup Xn n lim inf Xn n n Proof. Since the inﬁmum of a sequence is < a if and only if some term is < a (if all
terms are ≥ a then the inﬁmum is), we have
{inf Xn < a} = ∪n {Xn < a} ∈ F
n A similar argument shows {supn Xn > a} = ∪n {Xn > a} ∈ F . For the last two, we
observe
lim inf Xn = sup
n→∞ inf Xm m≥n n lim sup Xn = inf
n→∞ n sup Xm
m≥n To complete the proof in the ﬁrst case, note that Yn = inf m≥n Xm is a random variable
for each n so supn Yn is as well.
From Theorem 1.2.5, we see that
Ωo ≡ {ω : lim Xn exists } = {ω : lim sup Xn − lim inf Xn = 0}
n→∞ n→∞ n→∞ is a measurable set. (Here ≡ indicates that the ﬁrst equality is a deﬁnition.) If
P (Ωo ) = 1, we say that Xn converges almost surely, or a.s. for short. This type
of convergence called almost everywhere in measure theory. To have a limit deﬁned
on the whole space, it is convenient to let
X∞ = lim sup Xn
n→∞ but this random variable may take the value +∞ or −∞. To accommodate this and
some other headaches, we will generalize the deﬁnition of random variable.
A function whose domain is a set D ∈ F and whose range is R∗ ≡ [−∞, ∞] is said
to be a random variable if for all B ∈ R∗ we have X −1 (B ) = {ω : X (ω ) ∈ B } ∈ F . 1.2. RANDOM VARIABLES 9 Here R∗ = the Borel subsets of R∗ with R∗ given the usual topology, i.e., the one
generated by intervals of the form [−∞, a), (a, b) and (b, ∞] where a, b ∈ R. The
reader should note that the extended real line (R∗ , R∗ ) is a measurable space, so
all the results above generalize immediately.
Exercises
1.2.3. Show that if f is continuous and Xn → X almost surely then f (Xn ) → f (X )
almost surely.
1.2.4. (i) Show that a continuous function from Rd → R is a measurable map from
(Rd , Rd ) to (R, R). (ii) Show that Rd is the smallest σ ﬁeld that makes all the
continuous functions measurable.
1.2.5. A function f is said to be lower semicontinuous or l.s.c. if
lim inf f (y ) ≥ f (x)
y →x and upper semicontinuous (u.s.c.) if −f is l.s.c. Show that f is l.s.c. if and only if
{x : f (x) ≤ a} is closed for each a ∈ R and conclude that semicontinuous functions
are measurable.
1.2.6. Let f : Rd → R be an arbitrary function and let f δ (x) = sup{f (y ) : y −x < δ }
2
2
and fδ (x) = inf {f (y ) : y − x < δ } where z  = (z1 + . . . + zd )1/2 . Show that f δ is
l.s.c. and fδ is u.s.c. Let f 0 = limδ↓0 f δ , f0 = limδ↓0 fδ , and conclude that the set of
points at which f is discontinuous = {f 0 = f0 } is measurable.
1.2.7. A function ϕ : Ω → R is said to be simple if
n cm 1Am (ω ) ϕ(ω ) =
m=1 where the cm are real numbers and Am ∈ F . Show that the class of F measurable functions is the smallest class containing the simple functions and closed under
pointwise limits.
1.2.8. Use the previous exercise to conclude that Y is measurable with respect to
σ (X ) if and only if Y = f (X ) where f : R → R is measurable.
1.2.9. To get a constructive proof of the last result, note that {ω : m2−n ≤ Y <
(m + 1)2−n } = {X ∈ Bm,n } for some Bm,n ∈ R and set fn (x) = m2−n for x ∈ Bm,n
and show that as n → ∞ fn (x) → f (x) and Y = f (X ). 10 1.3 CHAPTER 1. LAWS OF LARGE NUMBERS Expected Value If X ≥ 0 is a random variable on (Ω, F , P ) then we deﬁne its expected value to be
EX = X dP , which always makes sense, but may be ∞. (The integral is deﬁned
in Section 4 of the Appendix.) To reduce the general case to the nonnegative case,
let x+ = max{x, 0} be the positive part and let x− = max{−x, 0} be the negative
part of x. We declare that EX exists and set EX = EX + − EX − whenever the
subtraction makes sense, i.e., EX + < ∞ or EX − < ∞.
EX is often called the mean of X and denoted by µ. EX is deﬁned by integrating
X , so it has all the properties that integrals do. From (4.5) and (4.7) in the Appendix
and the trivial observation that E (b) = b for any real number b, we get the following:
Theorem 1.3.1. Suppose X, Y ≥ 0 or E X , E Y  < ∞.
(a) E (X + Y ) = EX + EY .
(b) E (aX + b) = aE (X ) + b for any real numbers a, b.
(c) If X ≥ Y then EX ≥ EY .
In this section, we will recall some properties of expected value and prove some
new ones. To organize things, we will divide the developments into three subsections. 1.3.1 Inequalities Our ﬁrst two results are (5.1) and (5.2) from the Appendix.
Theorem 1.3.2. Jensen’s inequality. Suppose ϕ is convex, that is,
λϕ(x) + (1 − λ)ϕ(y ) ≥ ϕ(λx + (1 − λ)y )
for all λ ∈ (0, 1) and x, y ∈ R. Then
E (ϕ(X )) ≥ ϕ(EX )
provided both expectations exist, i.e., E X  and E ϕ(X ) < ∞.
Two useful special cases are
EX  ≤ E X  (EX )2 ≤ E (X 2 ) To recall the direction in which the inequality goes note that if P (X = x) = λ and
P (X = y ) = 1 − λ then
Eφ(X ) = λϕ(x) + (1 − λ)ϕ(y ) ≥ ϕ(λx + (1 − λ)y ) = φ(EX )
Exercise 1.3.1. Suppose ϕ is strictly convex, i.e., > holds for λ ∈ (0, 1). Show that,
under the assumptions of Theorem 1.3.2, ϕ(EX ) = Eϕ(X ) implies X = EX a.s.
Exercise 1.3.2. Suppose φ : Rn → R is convex. Imitate the proof of (5.1) in the
Appendix to show
Eφ(X1 , . . . , Xn ) ≥ φ(EX1 , . . . , EXn )
provided E φ(X1 , . . . , Xn ) < ∞ and E Xi  < ∞ for all i.
Theorem 1.3.3. H¨lder’s inequality. If p, q ∈ [1, ∞] with 1/p + 1/q = 1 then
o
E XY  ≤ X
Here X r = (E X r )1/r for r ∈ [1, ∞); X p ∞ Y q = inf {M : P (X  > M ) = 0}. 1.3. EXPECTED VALUE 11 The special case p = q = 2 is called the CauchySchwarz inequality:
E XY  ≤ E X 2 EY 2 1 /2 To state our next result, we need some notation. If we only integrate over A ⊂ Ω,
we write
E (X ; A) = X dP
A Theorem 1.3.4. Chebyshev’s inequality. Suppose ϕ : R → R has ϕ ≥ 0, let
A ∈ R and let iA = inf {ϕ(y ) : y ∈ A}.
iA P (X ∈ A) ≤ E (ϕ(X ); X ∈ A) ≤ Eϕ(X )
Proof. The deﬁnition of iA and the fact that φ ≥ 0 imply that
iA 1(X ∈A) ≤ ϕ(X )1(X ∈A) ≤ ϕ(X )
So taking expected values and using part (c) of Theorem 1.3.1 gives the desired
result.
Remark. Some authors call Theorem 1.3.4 Markov’s inequality and use the name
Chebyshev’s inequality for the special case in which ϕ(x) = x2 and A = {x : x ≥ a}:
(∗) a2 P (X  ≥ a) ≤ EX 2 Exercise 1.3.3. Chebyshev’s inequality is and is not sharp. (i) Show that
Theorem 1.3.4 is sharp by showing that if 0 < a ≤ b are ﬁxed there is an X with
EX 2 = b2 for which equality holds. (ii) Show that Theorem 1.3.4 is not sharp by
showing that if X has 0 < EX 2 < ∞ then
lim a2 P (X  ≥ a)/EX 2 = 0 a→∞ Exercise 1.3.4. Onesided Chebyshev bound. (i) Let a > b > 0, 0 < p < 1,
and let X have P (X = a) = p and P (X = −b) = 1 − p. Apply Theorem 1.3.4 to
φ(x) = (x + b)2 and conclude that if Y is any random variable with EY = EX and
var (Y ) = var (X ), then P (Y ≥ a) ≤ p and equality holds when Y = X .
(ii) Suppose EY = 0, var (Y ) = σ 2 , and a > 0. Show that P (Y ≥ a) ≤ σ 2 /(a2 + σ 2 ),
and there is a Y for which equality holds.
Exercise 1.3.5. Two nonexistent lower bounds.
Show that: (i) if > 0, inf {P (X  > ) : EX = 0, var (X ) = 1} = 0.
(ii) if y ≥ 1, σ 2 ∈ (0, ∞), inf {P (X  > y ) : EX = 1, var (X ) = σ 2 } = 0.
Exercise 1.3.6. A useful lower bound. Let Y ≥ 0 with EY 2 < ∞. Apply the
CauchySchwarz inequality to Y 1(Y >0) and conclude
P (Y > 0) ≥ (EY )2 /EY 2 1.3.2 Integration to the Limit There are three classic real analysis results, (5.4)–(5.6) in the Appendix, about what
happens when we interchange limits and integrals. 12 CHAPTER 1. LAWS OF LARGE NUMBERS Theorem 1.3.5. Fatou’s lemma. If Xn ≥ 0 then
lim inf EXn ≥ E (lim inf Xn )
n→∞ n→∞ To recall the direction of the inequality, think of the special case Xn = n1(0,1/n) (on
the unit interval equipped with the Borel sets and Lebesgue measure). Here Xn → 0
a.s. but EXn = 1 for all n.
Theorem 1.3.6. Monotone convergence theorem. If 0 ≤ Xn ↑ X then EXn ↑
EX.
This follows immediately from Theorem 1.3.5 since Xn ↑ X and (c) of Theorem 1.3.1
imply
lim sup EXn ≤ EX
n→∞ Theorem 1.3.7. Dominated convergence theorem. If Xn → X a.s., Xn  ≤ Y
for all n, and EY < ∞, then EXn → EX.
The special case of Theorem 1.3.7 in which Y is constant is called the bounded
convergence theorem.
In the developments below, we will need another result on integration to the limit.
Perhaps the most important special case of this result occurs when g (x) = xp with
p > 1 and h(x) = x.
Theorem 1.3.8. Suppose Xn → X a.s. Let g, h be continuous functions with
(i) g ≥ 0 and g (x) → ∞ as x → ∞,
(ii) h(x)/g (x) → 0 as x → ∞,
and (iii) Eg (Xn ) ≤ K < ∞ for all n.
Then Eh(Xn ) → Eh(X ).
Proof. By subtracting a constant from h, we can suppose without loss of generality
that h(0) = 0. Pick M large so that P (X  = M ) = 0 and g (x) > 0 when x ≥ M .
¯
¯
¯
Given a random variable Y , let Y = Y 1(Y ≤M ) . Since P (X  = M ) = 0, Xn → X a.s.
¯ n ) is bounded and h is continuous, it follows from the bounded convergence
Since h(X
theorem that
(a) ¯
¯
Eh(Xn ) → Eh(X ) To control the eﬀect of the truncation, we use the following:
(b) ¯
¯
Eh(Y ) − Eh(Y ) ≤ E h(Y ) − h(Y ) ≤ E (h(Y ); Y  > M ) ≤ M Eg (Y ) where M = sup{h(x)/g (x) : x ≥ M }. To check the second inequality, note that
¯
when Y  ≤ M , Y = Y , and we have supposed h(0) = 0. The third inequality follows
from the deﬁnition of M .
Taking Y = Xn in (b) and using (iii), it follows that
(c) ¯
Eh(Xn ) − Eh(Xn ) ≤ K M ¯
To estimate Eh(X ) − Eh(X ), we observe that g ≥ 0 and g is continuous, so Fatou’s
lemma implies
Eg (X ) ≤ lim inf Eg (Xn ) ≤ K
n→∞ 1.3. EXPECTED VALUE 13 Taking Y = X in (b) gives
¯
Eh(X ) − Eh(X ) ≤ K (d) M The triangle inequality implies
¯
Eh(Xn ) − Eh(X ) ≤ Eh(Xn ) − Eh(Xn )
¯
¯
¯
+ Eh(Xn ) − Eh(X ) + Eh(X ) − Eh(X )
Taking limits and using (a), (c), (d), we have
lim sup Eh(Xn ) − Eh(X ) ≤ 2K
n→∞ which proves the desired result since K < ∞ and M M → 0 as M → ∞. A simple example shows that Theorem 1.3.8 can sometimes be applied when Theorem
1.3.7 cannot.
Exercise 1.3.7. Let Ω = (0, 1) equipped with the Borel sets and Lebesgue measure.
Let α ∈ (1, 2) and Xn = nα 1(1/(n+1),1/n) → 0 a.s. Show that Theorem 1.3.8 can
be applied with h(x) = x and g (x) = x2/α , but the Xn are not dominated by an
integrable function. 1.3.3 Computing Expected Values Integrating over (Ω, F , P ) is nice in theory, but to do computations we have to shift
to a space on which we can do calculus. In most cases, we will apply the next result
with S = Rd .
Theorem 1.3.9. Change of variables formula. Let X be a random element of
(S, S ) with distribution µ, i.e., µ(A) = P (X ∈ A). If f is a measurable function from
(S, S ) to (R, R) so that f ≥ 0 or E f (X ) < ∞, then
Ef (X ) = f (y ) µ(dy )
S Remark. To explain the name, write h for X and P ◦ h−1 for µ to get
f (y ) d(P ◦ h−1 ) f (h(ω )) dP =
Ω S Proof. We will prove this result by verifying it in four increasingly more general special
cases that parallel the way that the integral is deﬁned (see section 4 of the Appendix).
The reader should note the method employed, since it will be used several times below.
Case 1: Indicator functions. If B ∈ S and f = 1B then recalling the relevant
deﬁnitions shows
E 1B (X ) = P (X ∈ B ) = µ(B ) = 1B (y ) µ(dy )
S n Case 2: Simple functions. Let f (x) = m=1 cm 1Bm where cm ∈ R, Bm ∈ S .
The linearity of expected value, the result of Case 1, and the linearity of integration 14 CHAPTER 1. LAWS OF LARGE NUMBERS imply
n Ef (X ) = cm E 1Bm (X )
m=1
n = cm
m=1 1Bm (y ) µ(dy ) =
S f (y ) µ(dy )
S Case 3: Nonegative functions. Now if f ≥ 0 and we let
fn (x) = ([2n f (x)]/2n ) ∧ n
where [x] = the largest integer ≤ x and a ∧ b = min{a, b}, then the fn are simple
and fn ↑ f , so using the result for simple functions and the monotone convergence
theorem:
Ef (X ) = lim Efn (X ) = lim
n n fn (y ) µ(dy ) =
S f (y ) µ(dy )
S Case 4: Integrable functions. The general case now follows by writing f (x) =
f (x)+ − f (x)− . The condition E f (X ) < ∞ guarantees that Ef (X )+ and Ef (X )−
are ﬁnite. So using the result for nonnegative functions and linearity of expected value
and integration:
Ef (X ) = Ef (X )+ − Ef (X )− = f (y )− µ(dy ) f (y )+ µ(dy ) −
S = S f (y ) µ(dy )
S which completes the proof.
For practice with the proof technique of Theorem 1.3.9, do
Exercise 1.3.8. Suppose that the probability measure µ has µ(A) =
all A ∈ R. Then for any g with g ≥ 0 or g (x) µ(dx) < ∞ we have
g (x) µ(dx) = A f (x) dx for g (x)f (x) dx A consequence of Theorem 1.3.9 is that we can compute expected values of functions of random variables by performing integrals on the real line. Before we can
do some examples, we need to introduce the terminology for what we are about to
compute. If k is a positive integer then EX k is called the kth moment of X . The
ﬁrst moment EX is usually called the mean and denoted by µ. If EX 2 < ∞ then
the variance of X is deﬁned to be var (X ) = E (X − µ)2 . To compute the variance
the following formula is useful:
var (X ) = E (X − µ)2
= EX 2 − 2µEX + µ2 = EX 2 − µ2 (1.3.1) From this it is immediate that
var (X ) ≤ EX 2 (1.3.2) 1.3. EXPECTED VALUE 15 Here EX 2 is the expected value of X 2 . When we want the square of EX , we will
write (EX )2 . Since E (aX + b) = aEX + b by (b) of Theorem 1.3.1, it follows easily
from the deﬁnition that
var (aX + b) = E (aX + b − E (aX + b))2
= a2 E (X − EX )2 = a2 var (X ) (1.3.3) We turn now to concrete examples and leave the calculus in the ﬁrst two examples to
the reader. (Integrate by parts.)
Example 1.3.1. If X has an exponential distribution then
∞ xk e−x dx = k ! EX k =
0 So the mean of X is 1 and variance is EX 2 − (EX )2 = 2 − 12 = 1. If we let Y = X/λ,
then by Exercise 1.1.10, Y has density λe−λy for y ≥ 0, the exponential density
with parameter λ. From (b) of Theorem 1.3.1 and (1.3.3), it follows that Y has mean
1/λ and variance 1/λ2 .
Example 1.3.2. If X has a standard normal distribution,
EX = x(2π )−1/2 exp(−x2 /2) dx = 0 var (X ) = EX 2 = (by symmetry) x2 (2π )−1/2 exp(−x2 /2) dx = 1 If we let σ > 0, µ ∈ R, and Y = σX + µ, then (b) of Theorem 1.3.1 and (1.3.3), imply
EY = µ and var (Y ) = σ 2 . By Exercise 1.1.10, Y has density
(2πσ 2 )−1/2 exp(−(y − µ)2 /2σ 2 )
the normal distribution with mean µ and variance σ 2 .
We will next consider some discrete distributions. The ﬁrst is ridiculously simple,
but we will need the result several times below, so we record it here.
Example 1.3.3. We say that X has a Bernoulli distribution with parameter p if
P (X = 1) = p and P (X = 0) = 1 − p. Clearly,
EX = p · 1 + (1 − p) · 0 = p
Since X 2 = X , we have EX 2 = EX = p and
var (X ) = EX 2 − (EX )2 = p − p2 = p(1 − p)
Example 1.3.4. We say that X has a Poisson distribution with parameter λ if
P (X = k ) = e−λ λk /k ! for k = 0, 1, 2, . . .
To evaluate the moments of the Poisson random variable, we use a little inspiration
to observe that for k ≥ 1
∞ j (j − 1) · · · (j − k + 1)e−λ E (X (X − 1) · · · (X − k + 1)) =
j =k ∞ e−λ = λk
j =k λj −k
= λk
(j − k )! λj
j! 16 CHAPTER 1. LAWS OF LARGE NUMBERS where the equalities follow from the facts that (i) j (j − 1) · · · (j − k + 1) = 0 when
j < k , (ii) cancelling part of the factorial, (iii) the fact that Poisson distribution has
total mass 1. Using the last formula, it follows that EX = λ while
var (X ) = EX 2 − (EX )2 = E (X (X − 1)) + EX − λ2 = λ
Example 1.3.5. N is said to have a geometric distribution with success probability p ∈ (0, 1) if
P (N = k ) = p(1 − p)k−1 for k = 1, 2, . . . N is the number of independent trials needed to observe an event with probability p.
Diﬀerentiating the identity
∞ (1 − p)k = 1/p
k=0 and referring to Example 9.2 in the Appendix for the justiﬁcation gives
∞ k (1 − p)k−1 = −1/p2 −
k=1
∞ k (k − 1)(1 − p)k−2 = 2/p3
k=2 From this it follows that
∞ kp(1 − p)k−1 = 1/p EN =
k=1
∞ k (k − 1)p(1 − p)k−1 = 2(1 − p)/p2 EN (N − 1) =
k=1 var (N ) = EN 2 − (EN )2 = EN (N − 1) + EN − (EN )2
= p
1
1−p
2(1 − p)
+ 2− 2=
p2
p
p
p2 Exercises
1.3.9. Inclusionexclusion formula. Let A1 , A2 , . . . An be events and A = ∪n Ai .
i=1
n
Prove that 1A = 1− i=1 (1−1Ai ). Expand out the right hand side, then take expected
value to conclude
n P (∪n Ai ) =
i=1 P (Ai ) −
i=1 P (Ai ∩ Aj )
i<j P (Ai ∩ Aj ∩ Ak ) − . . . + (−1)n−1 P (∩n Ai )
i=1 +
i<j<k 1.3.10. Bonferroni inequalities. Let A1 , A2 , . . . An be events and A = ∪n Ai .
i=1 1.3. EXPECTED VALUE
Show that 1A ≤ n
i=1 17 1Ai , etc. and then take expected values to conclude
n P (∪n Ai )
i=1 ≤ P (Ai )
i=1
n P (∪n Ai ) ≥
i=1 P (Ai ) −
i=1
n P (∪n Ai ) ≤
i=1 P (Ai ∩ Aj )
i<j P (Ai ) −
i=1 P (Ai ∩ Aj ) +
i<j P (Ai ∩ Aj ∩ Ak )
i<j<k In general, if we stop the inclusion exclusion formula after an even (odd) number of
sums, we get an lower (upper) bound.
1.3.11. If E X k < ∞ then for 0 < j < k , E X j < ∞, and furthermore
E X j ≤ (E X k )j/k
1.3.12. Apply Jensen’s inequality with ϕ(x) = ex and P (X = log ym ) = p(m) to
n
conclude that if m=1 p(m) = 1 and p(m), ym > 0 then
n n
p(
ym m) p(m)ym ≥
m=1 m=1 When p(m) = 1/n, this says the arithmetic mean exceeds the geometric mean.
−
1.3.13. If EX1 < ∞ and Xn ↑ X then EXn ↑ EX . 1.3.14. Let X ≥ 0 but do NOT assume E (1/X ) < ∞. Show
lim yE (1/X ; X > y ) = 0, y →∞ 1.3.15. If Xn ≥ 0 then E ( ∞
n=0 Xn ) = lim yE (1/X ; X > y ) = 0.
y ↓0 ∞
n=0 EXn . 1.3.16. If X is integrable and An are disjoint sets with union A then
∞ E (X ; An ) = E (X ; A)
n=0 i.e., the sum converges absolutely and has the value on the right. 18 1.4 CHAPTER 1. LAWS OF LARGE NUMBERS Independence We begin with what is hopefully a familiar deﬁnition and then work our way up to a
deﬁnition that is appropriate for our current setting.
Two events A and B are independent if P (A ∩ B ) = P (A)P (B ).
Two random variables X and Y are independent if for all C, D ∈ R,
P (X ∈ C, Y ∈ D) = P (X ∈ C )P (Y ∈ D)
i.e., the events A = {X ∈ C } and B = {Y ∈ D} are independent.
Two σ ﬁelds F and G are independent if for all A ∈ F and B ∈ G the events A and
B are independent.
As the next exercise shows, the second deﬁnition is a special case of the third.
Exercise 1.4.1. (i) Show that if X and Y are independent then σ (X ) and σ (Y ) are.
(ii) Conversely, if F and G are independent, X ∈ F , and Y ∈ G , then X and Y are
independent.
The ﬁrst deﬁnition is, in turn, a special case of the second.
Exercise 1.4.2. (i) Show that if A and B are independent then so are Ac and B , A
and B c , and Ac and B c . (ii) Conclude that events A and B are independent if and
only if their indicator random variables 1A and 1B are independent.
In view of the fact that the ﬁrst deﬁnition is a special case of the second, which
is a special case of the third, we take things in the opposite order when we say what
it means for several things to be independent. We begin by reducing to the case of
ﬁnitely many objects. An inﬁnite collection of objects (σ ﬁelds, random variables, or
sets) is said to be independent if every ﬁnite subcollection is.
σ ﬁelds F1 , F2 , . . . , Fn are independent if whenever Ai ∈ Fi for i = 1, . . . , n, we
have
n P (∩n Ai ) =
i=1 P (Ai )
i=1 Random variables X1 , . . . , Xn are independent if whenever Bi ∈ R for i = 1, . . . , n
we have
n P (∩n {Xi ∈ Bi }) =
i=1 P (Xi ∈ Bi )
i=1 Sets A1 , . . . , An are independent if whenever I ⊂ {1, . . . n} we have
P (∩i∈I Ai ) = P (Ai )
i∈I At ﬁrst glance, it might seem that the last deﬁnition does not match the other two.
However, if you think about it for a minute, you will see that if the indicator variables
1Ai , 1 ≤ i ≤ n are independent and we take Bi = {1} for i ∈ I , and Bi = R for i ∈ I
then the condition in the deﬁnition results. Conversely,
Exercise 1.4.3. Let A1 , A2 , . . . , An be independent. Show (i) Ac , A2 , . . . , An are
1
independent; (ii) 1A1 , . . . , 1An are independent. 1.4. INDEPENDENCE 19 One of the ﬁrst things to understand about the deﬁnition of independent events is
that it is not enough to assume P (Ai ∩ Aj ) = P (Ai )P (Aj ) for all i = j . A sequence of
events A1 , . . . , An with the last property is called pairwise independent. It is clear
that independent events are pairwise independent. The next example shows that the
converse is not true.
Example 1.4.1. Let X1 , X2 , X3 be independent random variables with
P (Xi = 0) = P (Xi = 1) = 1/2
Let A1 = {X2 = X3 }, A2 = {X3 = X1 } and A3 = {X1 = X2 }. These events are
pairwise independent since if i = j then
P (Ai ∩ Aj ) = P (X1 = X2 = X3 ) = 1/4 = P (Ai )P (Aj )
but they are not independent since
P (A1 ∩ A2 ∩ A3 ) = 1/4 = 1/8 = P (A1 )P (A2 )P (A3 )
In order to show that random variables X and Y are independent, we have to
check that P (X ∈ A, Y ∈ B ) = P (X ∈ A)P (Y ∈ B ) for all Borel sets A and B .
Since there are a lot of Borel sets, our next topic is 1.4.1 Suﬃcient Conditions for Independence Our main result is Theorem 1.4.3. To state that result, we need a deﬁnition that
generalizes all our earlier deﬁnitions.
Collections of sets A1 , A2 , . . . , An ⊂ F are said to be independent if whenever
Ai ∈ Ai and I ⊂ {1, . . . , n} we have P (∩i∈I Ai ) = i∈I P (Ai )
If each collection is a single set i.e., Ai = {Ai }, this deﬁnition reduces to the one for
sets.
Lemma 1.4.1. Without loss of generality we can suppose each Ai contains Ω. In
this case the condition is equivalent to
n P (∩n Ai ) =
i=1 P (Ai ) whenever Ai ∈ Ai i=1 since we can set Ai = Ω for i ∈ I .
¯
¯¯
¯
Proof. If A1 , A2 , . . . , An are independent and Ai = Ai ∪ {Ω} then A1 , A2 , . . . , An are
¯
independent, since if Ai ∈ Ai and I = {j : Aj = Ω} ∩i Ai = ∩i∈I Ai .
The proof of Theorem 1.4.3 is based on Dynkin’s π − λ theorem ((2.1) in the
Appendix). To state this result, we need two deﬁnitions. We say that A is a π system if it is closed under intersection, i.e., if A, B ∈ A then A ∩ B ∈ A. We say
that L is a λsystem if: (i) Ω ∈ L. (ii) If A, B ∈ L and A ⊂ B then B − A ∈ L. (iii)
If An ∈ L and An ↑ A then A ∈ L.
Theorem 1.4.2. π − λ Theorem. If P is a π system and L is a λsystem that
contains P then σ (P ) ⊂ L.
Theorem 1.4.3. Suppose A1 , A2 , . . . , An are independent and each Ai is a π system.
Then σ (A1 ), σ (A2 ), . . . , σ (An ) are independent. 20 CHAPTER 1. LAWS OF LARGE NUMBERS Proof. Let A2 , . . . , An be sets with Ai ∈ Ai , let F = A2 ∩ · · · ∩ An and let L = {A :
P (A ∩ F ) = P (A)P (F )}. Since P (Ω ∩ F ) = P (Ω)P (F ), (i) Ω ∈ L. To check (ii), we
note that if A, B ∈ L with A ⊂ B then (B − A) ∩ F = (B ∩ F ) − (A ∩ F ). So using
(i) in Exercise 1.1.1, the fact A, B ∈ L and then (i) in Exercise 1.1.1 again:
P ((B − A) ∩ F ) = P (B ∩ F ) − P (A ∩ F ) = P (B )P (F ) − P (A)P (F )
= {P (B ) − P (A)}P (F ) = P (B − A)P (F )
and we have B − A ∈ L. To check (iii) let Bk ∈ L with Bk ↑ B and note that
(Bk ∩ F ) ↑ (B ∩ F ) so using (iii) in Exercise 1.1.1, then the fact Bk ∈ L and then (iii)
in Exercise 1.1.1 again:
P (B ∩ F ) = lim P (Bk ∩ F ) = lim P (Bk )P (F ) = P (B )P (F )
k k Applying the π − λ theorem now gives L ⊃ σ (A1 ). It follows that if A1 ∈ σ (A∞ )
and Ai ∈ A for 2 ≤ i ≤ n then
n P (∩n Ai )P (A1 )P (∩n Ai ) =
i=1
i=2 P (Ai )
i=1 Using Lemma 1.4.1 now, we have:
(∗) If A1 , A2 , . . . , An are independent then σ (A1 ), A2 , . . . , An are independent.
Applying (∗) to A2 , . . . , An , σ (A1 ) (which are independent since the deﬁnition is unchanged by permuting the order) shows that σ (A2 ), A3 , . . . , An , σ (A1 ) are independent, and after n iterations we have the desired result.
Remark. The reader should note that it is not easy to show that if A, B ∈ L then
A ∩ B ∈ L, or A ∪ B ∈ L, but it is easy to check that if A, B ∈ L with A ⊂ B then
B − A ∈ L.
Having worked to establish Theorem 1.4.3, we get several corollaries.
Theorem 1.4.4. In order for X1 , . . . , Xn to be independent, it is suﬃcient that for
all x1 , . . . , xn ∈ (−∞, ∞]
n P (X1 ≤ x1 , . . . , Xn ≤ xn ) = P (Xi ≤ xi )
i=1 Proof. Let Ai = the sets of the form {Xi ≤ xi }. Since {Xi ≤ x} ∩ {Xi ≤ y } = {Xi ≤
x ∧ y }, Ai is a π system. Since we have allowed xi = ∞, Ω ∈ Ai . Exercise 1.2.1
implies σ (Ai ) = σ (Xi ), so the result follows from Theorem 1.4.3.
The last result expresses independence of random variables in terms of their distribution functions. The next two exercises treat density functions and discrete random
variables.
Exercise 1.4.4. Suppose (X1 , . . . , Xn ) has density f (x1 , x2 , . . . , xn ), that is
f (x) dx for A ∈ Rn P ((X1 , X2 , . . . , Xn ) ∈ A) =
A If f (x) can be written as g1 (x1 ) · · · gn (xn ) where the gm ≥ 0 are measurable, then
X1 , X2 , . . . , Xn are independent. Note that the gm are not assumed to be probability
densities. 1.4. INDEPENDENCE 21 Exercise 1.4.5. Suppose X1 , . . . , Xn are random variables that take values in countable sets S1 , . . . , Sn . Then in order for X1 , . . . , Xn to be independent, it is suﬃcient
that whenever xi ∈ Si
n P (X1 = x1 , . . . , Xn = xn ) = P (Xi = xi )
i=1 Our next goal is to prove that functions of disjoint collections of independent
random variables are independent. See Theorem 1.4.6 for the precise statement. First
we will prove an analogous result for σ ﬁelds.
Theorem 1.4.5. Suppose Fi,j , 1 ≤ i ≤ n, 1 ≤ j ≤ m(i) are independent and let
Gi = σ (∪j Fi,j ). Then G1 , . . . , Gn are independent.
Proof. Let Ai be the collection of sets of the form ∩j Ai,j where Ai,j ∈ Fi,j . Ai is a
π system that contains Ω and contains ∪j Fi,j so Theorem 1.4.3 implies σ (Ai ) = Gi
are independent.
Theorem 1.4.6. If for 1 ≤ i ≤ n, 1 ≤ j ≤ m(i), Xi,j are independent and fi :
Rm(i) → R are measurable then fi (Xi,1 , . . . , Xi,m(i) ) are independent.
Proof. Let Fi,j = σ (Xi,j ) and Gi = σ (∪j Fi,j ). Since fi (Xi,1 , . . . , Xi,m(i) ) ∈ Gi , the
desired result follows from Theorem 1.4.5 and Exercise 1.4.1.
A concrete special case of Theorem 1.4.6 that we will use in a minute is: if
X1 , . . . , Xn are independent then X = X1 and Y = X2 · · · Xn are independent. Later,
when we study sums Sm = X1 + · · · + Xm of independent random variables X1 , . . . , Xn ,
we will use Theorem 1.4.6 to conclude that if m < n then Sn − Sm is independent of
the indicator function of the event {max1≤k≤m Sk > x}. 1.4.2 Independence, Distribution, and Expectation Our next goal is to obtain formulas for the distribution and expectation of independent
random variables.
Theorem 1.4.7. Suppose X1 , . . . , Xn are independent random variables and Xi has
distribution µi , then (X1 , . . . , Xn ) has distribution µ1 × · · · × µn .
Proof. Using the deﬁnitions of (i) A1 × · · · × An , (ii) independence, (iii) µi , and (iv)
µ1 × · · · × µn
P ((X1 , . . . , Xn ) ∈ A1 × · · · × An ) = P (X1 ∈ A1 , . . . , Xn ∈ An )
n n P (Xi ∈ Ai ) = =
i=1 µi (Ai ) = µ1 × · · · × µn (A1 × · · · × An )
i=1 The last formula shows that the distribution of (X1 , . . . , Xn ) and the measure µ1 ×
· · · × µn agree on sets of the form A1 × · · · × An , a π system that generates Rn . So
(2.2) in the Appendix implies they must agree.
Theorem 1.4.8. Suppose X and Y are independent and have distributions µ and ν .
If h : R2 → R is a measurable function with h ≥ 0 or E h(X, Y ) < ∞ then
Eh(X, Y ) = h(x, y ) µ(dx) ν (dy ) 22 CHAPTER 1. LAWS OF LARGE NUMBERS In particular, if h(x, y ) = f (x)g (y ) where f, g : R → R are measurable functions with
f, g ≥ 0 or E f (X ) and E g (Y ) < ∞ then
Ef (X )g (Y ) = Ef (X ) · Eg (Y )
Proof. Using Theorem 1.3.9 and then Fubini’s theorem ((6.2) in the Appendix) we
have
h d(µ × ν ) = Eh(X, Y ) = h(x, y ) µ(dx) ν (dy ) R2 To prove the second result, we start with the result when f, g ≥ 0. In this case, using
the ﬁrst result, the fact that g (y ) does not depend on x and then Theorem 1.3.9 twice
we get
Ef (X )g (Y ) =
= f (x)g (y ) µ(dx) ν (dy ) = g (y ) f (x) µ(dx) ν (dy ) E f (X )g (y ) ν (dy ) = Ef (X )Eg (Y ) Applying the result for nonnegative f and g to f  and g , shows E f (X )g (Y ) =
E f (X )E g (Y ) < ∞, and we can repeat the last argument to prove the desired
result.
From Theorem 1.4.8, it is only a small step to
Theorem 1.4.9. If X1 , . . . , Xn are independent and have (a) Xi ≥ 0 for all i, or (b)
E Xi  < ∞ for all i then
n n E = Xi EXi
i=1 i=1 i.e., the expectation on the left exists and has the value given on the right.
Proof. X = X1 and Y = X2 · · · Xn are independent by Theorem 1.4.6 so taking
f (x) = x and g (y ) = y  we have E X1 · · · Xn  = E X1 E X2 · · · Xn , and it follows
by induction that if 1 ≤ m ≤ n
n E Xm · · · Xn  = E Xk 
i=m If the Xi ≥ 0, then Xi  = Xi and the desired result follows from the special case
m = 1. To prove the result in general note that the special case m = 2 implies
E Y  = E X2 · · · Xn  < ∞, so using Theorem 1.4.8 with f (x) = x and g (y ) = y shows
E (X1 · · · Xn ) = EX1 · E (X2 · · · Xn ), and the desired result follows by induction.
Example 1.4.2. It can happen that E (XY ) = EX · EY without the variables being
independent. Suppose the joint distribution of X and Y is given by the following
table X 1
0
−1 1
0
b
0 Y
0
a
c
a −1
0
b
0 1.4. INDEPENDENCE 23 where a, b > 0, c ≥ 0, and 2a + 2b + c = 1. Things are arranged so that XY ≡ 0.
Symmetry implies EX = 0 and EY = 0, so E (XY ) = 0 = EXEY . The random
variables are not independent since
P (X = 1, Y = 1) = 0 < ab = P (X = 1)P (Y = 1)
Two random variables X and Y with EX 2 , EY 2 < ∞ that have EXY = EXEY are
said to be uncorrelated. The ﬁnite second moments are needed so that we know
E XY  < ∞ by the CauchySchwarz inequality.
Exercise 1.4.6. Let Ω = (0, 1), F = Borel sets, P = Lebesgue measure. Xn (ω ) =
sin(2πnω ), n = 1, 2, . . . are uncorrelated but not independent. 1.4.3 Sums of Independent Random Variables Theorem 1.4.10. If X and Y are independent, F (x) = P (X ≤ x), and G(y ) =
P (Y ≤ y ), then
P (X + Y ≤ z ) = F (z − y ) dG(y ) The integral on the righthand side is called the convolution of F and G and is
denoted F ∗ G(z ). The meaning of dG(y ) will be explained in the proof.
Proof. Let h(x, y ) = 1(x+y≤z) . Let µ and ν be the probability measures with distribution functions F and G. Since for ﬁxed y
1(−∞,z−y] (x) µ(dx) = F (z − y ) h(x, y ) µ(dx) =
using Theorem 1.4.8 gives
P (X + Y ≤ z ) =
= 1(x+y≤z) µ(dx) ν (dy )
F (z − y ) ν (dy ) = F (z − y ) dG(y ) The last equality is just a change of notation: We regard dG(y ) as a shorthand for
“integrate with respect to the measure ν with distribution function G.”
Exercise 1.4.7. (i) Show that if X and Y are independent with distributions µ and
ν then
P (X + Y = 0) =
µ({−y })ν ({y })
y (ii) Conclude that if X has continuous distribution P (X = Y ) = 0.
To treat concrete examples, we need a special case of Theorem 1.4.10.
Theorem 1.4.11. Suppose that X with density f and Y with distribution function
G are independent. Then X + Y has density
h(x) = f (x − y ) dG(y ) When Y has density g , the last formula can be written as
h(x) = f (x − y ) g (y ) dy 24 CHAPTER 1. LAWS OF LARGE NUMBERS Proof. From Theorem 1.4.10, the deﬁnition of density function, and Fubini’s theorem
((6.2) in the Appendix), which is justiﬁed since everything is nonnegative, we get
z P (X + Y ≤ z ) = F (z − y ) dG(y ) = f (x − y ) dx dG(y )
−∞ z f (x − y ) dG(y ) dx =
−∞ The last equation says that X + Y has density h(x) = f (x − y )dG(y ). The second
formula follows from the ﬁrst when we recall the meaning of dG(y ) given in the proof
of Theorem 1.4.10 and use Exercise 1.3.8.
Theorem 1.4.11 plus some ugly calculus allows us to treat two standard examples. These facts should be familiar from undergraduate probability. We give one
calculation and leave the other to the reader.
Example 1.4.3. The gamma density with parameters α and λ is given by
λα xα−1 e−λx /Γ(α)
0 f (x) =
where Γ(α) = ∞
0 for x ≥ 0
for x < 0 xα−1 e−x dx. To prove this, we will show
Theorem 1.4.12. If X = gamma(α, λ) and Y = gamma(β, λ) are independent then
X + Y is gamma(α + β, λ).
Proof. Writing fX +Y (z ) for the density function of X + Y and using Theorem 1.4.11
x λα (x − y )α−1 −λ(x−y) λβ y β −1 −λy
e
e
dy
Γ(α)
Γ(β )
0
λα+β e−λx x
=
(x − y )α−1 y β −1 dy
Γ(α)Γ(β ) 0 fX +Y (x) = so it suﬃces to show the integral is xα+β −1 Γ(α)Γ(β )/Γ(α + β ). To do this, we begin
by changing variables y = xu, dy = x du to get
1 xα+β −1 x (1 − u)α−1 uβ −1 du =
0 (x − y )α−1 y β −1 dy
0 There are two ways to complete the proof at this point. The soft solution is to
note that we have shown that the density fX +Y (x) = cα,β e−λ λα+β xα+β −1 where
cα,β = 1
Γ(α)Γ(β ) 1 (1 − u)α−1 uβ −1 du
0 There is only one norming constant cα,β that makes this a probability distribution,
so we must have cα,β = 1/Γ(α + β ).
The less elegant approach is to check the last equality by calculus. Multiplying
each side by e−x , integrating from 0 to ∞, and then using Fubini’s theorem on the 1.4. INDEPENDENCE 25 right we have
1 (1 − u)α−1 uβ −1 du Γ(α + β )
0 ∞ x y β −1 e−y (x − y )α−1 e−(x−y) dy dx =
0 0
∞ ∞ y β −1 e−y =
0 (x − y )α−1 e−(x−y) dx dy = Γ(α)Γ(β )
x which gives the desired result.
Exercise 1.4.8. Use the fact that a gamma(1, λ) is an exponential with parameter
λ, and induction to show that the sum of n independent exponential(λ) r.v.’s, X1 +
· · · + Xn , has a gamma(n, λ) distribution.
Exercise 1.4.9. In Example 1.3.2, we introduced the normal density with mean µ
and variance a, (2πa)−1/2 exp(−(x − µ)2 /2a). Show that if X = normal(µ, a) and
Y = normal(ν, b) are independent then X + Y = normal(µ + ν, a + b). To simplify
this tedious calculation notice that it is enough to prove the result for µ = ν = 0. In
Exercise 3.4 of Chapter 2 you will give a simpler proof of this result. 1.4.4 Constructing Independent Random Variables The last question that we have to address before we can study independent random
variables is: Do they exist? (If they don’t exist, then there is no point in studying
them!) If we are given a ﬁnite number of distribution functions Fi , 1 ≤ i ≤ n, it is
easy to construct independent random variables X1 , . . . , Xn with P (Xi ≤ x) = Fi (x).
Let Ω = Rn , F = Rn , Xi (ω1 , . . . , ωn ) = ωi (the ith coordinate of ω ∈ Rn ), and let P
be the measure on Rn that has
P ((a1 , b1 ] × · · · × (an , bn ]) = (F1 (b1 ) − F1 (a1 )) · · · (Fn (bn ) − Fn (an ))
If µi is the measure with distribution function Fi then P = µ1 × · · · × µn .
To construct an inﬁnite sequence X1 , X2 , . . . of independent random variables with
given distribution functions, we want to perform the last construction on the inﬁnite
product space
RN = {(ω1 , ω2 , . . .) : ωi ∈ R} = {functions ω : N → R}
where N = {1, 2, . . .} and N stands for natural numbers. We deﬁne Xi (ω ) = ωi
and we equip RN with the product σ ﬁeld RN , which is generated by the ﬁnite
dimensional sets = sets of the form {ω : ωi ∈ Bi , 1 ≤ i ≤ n} where Bi ∈ R. It is
clear how we want to deﬁne P for ﬁnite dimensional sets. To assert the existence of
a unique extension to RN we use (7.1) from the Appendix:
Theorem 1.4.13. Kolmogorov’s extension theorem. Suppose we are given probability measures µn on (Rn , Rn ) that are consistent, that is,
µn+1 ((a1 , b1 ] × · · · × (an , bn ] × R) = µn ((a1 , b1 ] × · · · × (an , bn ])
Then there is a unique probability measure P on (RN , RN ) with
P (ω : ωi ∈ (ai , bi ], 1 ≤ i ≤ n) = µn ((a1 , b1 ] × · · · × (an , bn ]) 26 CHAPTER 1. LAWS OF LARGE NUMBERS In what follows we will need to construct sequences of random variables that take
values in other measurable spaces (S, S ). Unfortunately, Theorem 1.4.13 is not valid
for arbitrary measurable spaces. The ﬁrst example (on an inﬁnite product of diﬀerent
spaces Ω1 × Ω2 × . . .) was constructed by Andersen and Jessen (1948). (See Halmos
(1950) p. 214 or Neveu (1965) p. 84.) For an example in which all the spaces Ωi are
the same see Wegner (1973). Fortunately, there is a class of spaces that is adequate for
all of our results and for which the generalization of Kolmogorov’s theorem is trivial.
(S, S ) is said to be nice if there is a 11 map ϕ from S into R so that ϕ and ϕ−1 are
both measurable.
Such spaces are often called standard Borel spaces, but we already have too many
things named after Borel. The next result shows that most spaces arising in applications are nice.
Theorem 1.4.14. If S is a Borel subset of a complete separable metric space M , and
S is the collection of Borel subsets of S , then (S, S ) is nice.
Proof. We begin with the special case S = [0, 1)N with metric
∞ xn − yn /2n ρ(x, y ) =
n=1 If x = (x1 , x2 , x3 , . . .), expand each component in binary xj = .xj xj xj . . . (taking the
123
expansion with an inﬁnite number of 0’s). Let
ϕo (x) = .x1 x1 x2 x1 x2 x3 x1 x2 x3 x4 . . .
1213214321
To treat the general case, we observe that by letting
d(x, y ) = ρ(x, y )/(1 + ρ(x, y ))
(for more details, see Exercise 1.4.10 we can suppose that the metric has d(x, y ) < 1
for all x, y . Let q1 , q2 , . . . be a countable dense set in S. Let
ψ (x) = (d(x, q1 ), d(x, q2 ), . . .).
ψ : S → [0, 1)N is continuous and 11. ϕo ◦ ψ gives the desired mapping.
Exercise 1.4.10. Let ρ(x, y ) be a metric. (i) Suppose h is diﬀerentiable with h(0) =
0, h (x) > 0 for x > 0 and h (x) decreasing on [0, ∞). Then h(ρ(x, y )) is a metric.
(ii) h(x) = x/(x + 1) satisﬁes the hypotheses in (i).
Caveat emptor. The proof above is somewhat light when it comes to details. For
a more comprehensive discussion, see Section 13.1 of Dudley (1989). An interesting
consequence of the analysis there is that for Borel subsets of a complete separable
metric space the continuum hypothesis is true: i.e., all sets are either ﬁnite, countably
inﬁnite, or have the cardinality of the real numbers.
Exercises
1.4.11. Prove directly from the deﬁnition that if X and Y are independent and f
and g are measurable functions then f (X ) and g (Y ) are independent. 1.4. INDEPENDENCE 27 1.4.12. Let K ≥ 3 be a prime and let X and Y be independent random variables
that are uniformly distributed on {0, 1, . . . , K − 1}. For 0 ≤ n < K , let Zn =
X + nY mod K . Show that Z0 , Z1 , . . . , ZK −1 are pairwise independent, i.e., each
pair is independent, but if we know the values of two of the variables then we know
the values of all the variables.
1.4.13. Find four random variables taking values in {−1, 1} so that any three are
independent but all four are not. Hint: Consider products of independent random
variables.
1.4.14. Let Ω = {1, 2, 3, 4}, F = all subsets of Ω, and P ({i}) = 1/4. Give an example
of two collections of sets A1 and A2 that are independent but whose generated σ ﬁelds
are not.
1.4.15. Show that if X and Y are independent, integervalued random variables, then
P (X = m)P (Y = n − m) P (X + Y = n) =
m 1.4.16. In Example 1.3.4, we introduced the Poisson distribution with parameter
λ, which is given by P (Z = k ) = e−λ λk /k ! for k = 0, 1, 2, . . . Use the previous
exercise to show that if X = Poisson(λ) and Y = Poisson(µ) are independent then
X + Y = Poisson(λ + µ).
1.4.17. X is said to have a Binomial(n, p) distribution if
P (X = m) = nm
p (1 − p)n−m
m (i) Show that if X = Binomial(n, p) and Y = Binomial(m, p) are independent then
X + Y = Binomial(n + m, p). (ii) Look at Example 1.3.3 and use induction to conclude
that the sum of n independent Bernoulli(p) random variables is Binomial(n, p).
1.4.18. It should not be surprising that the distribution of X + Y can be F ∗ G
without the random variables being independent. Suppose X, Y ∈ {0, 1, 2} and take
each value with probability 1/3. (a) Find the distribution of X + Y assuming X and
Y are independent. (b) Find all the joint distributions (X, Y ) so that the distribution
of X + Y is the same as the answer to (a).
1.4.19. Let X, Y ≥ 0 be independent with distribution functions F and G. Find the
distribution function of XY.
1.4.20. If we want an inﬁnite sequence of coin tossings, we do not have to use Kolmogorov’s theorem. Let Ω be the unit interval (0,1) equipped with the Borel sets F
and Lebesgue measure P. Let Yn (ω ) = 1 if [2n ω ] is odd and 0 if [2n ω ] is even. Show
that Y1 , Y2 , . . . are independent with P (Yk = 0) = P (Yk = 1) = 1/2. 28 1.5 CHAPTER 1. LAWS OF LARGE NUMBERS Weak Laws of Large Numbers In this section, we will prove several “weak laws of large numbers.” The ﬁrst order
of business is to deﬁne the mode of convergence that appears in the conclusions of
the theorems. We say that Yn converges to Y in probability if for all > 0,
P (Yn − Y  > ) → 0 as n → ∞. 1.5.1 L2 Weak Laws Our ﬁrst set of weak laws come from computing variances and using Chebyshev’s
inequality. Extending a deﬁnition given in Example 1.4.2 for two random variables,
2
a family of random variables Xi , i ∈ I with EXi < ∞ is said to be uncorrelated if
we have
E (Xi Xj ) = EXi EXj whenever i = j
The key to our weak law for uncorrelated random variables, Theorem 1.5.3, is:
2
Theorem 1.5.1. Let X1 , . . . , Xn have E (Xi ) < ∞ and be uncorrelated. Then var (X1 + · · · + Xn ) = var (X1 ) + · · · + var (Xn )
where var (Y ) = the variance of Y.
n n Proof. Let µi = EXi and Sn = i=1 Xi . Since ESn = i=1 µi , using the deﬁnition
of the variance, writing the square of the sum as the product of two copies of the sum,
and then expanding, we have
2 n var (Sn ) = E (Sn − ESn )2 = E (Xi − µi )
i=1 n n (Xi − µi )(Xj − µj ) =E
i=1 j =1
n n i−1 E (Xi − µi )2 + 2 =
i=1 E ((Xi − µi )(Xj − µj ))
i=1 j =1 where in the last equality we have separated out the diagonal terms i = j and used
the fact that the sum over 1 ≤ i < j ≤ n is the same as the sum over 1 ≤ j < i ≤ n.
The ﬁrst sum is var (X1 ) + . . . + var (Xn ) so we want to show that the second sum
is zero. To do this, we observe
E ((Xi − µi )(Xj − µj )) = EXi Xj − µi EXj − µj EXi + µi µj
= EXi Xj − µi µj = 0
since Xi and Xj are uncorrelated.
In words, Theorem 1.5.1 says that for uncorrelated random variables the variance
of the sum is the sum of the variances. The second ingredient in our proof of Theorem
1.5.3 is the following consequence of (1.3.3):
var (cY ) = c2 var (Y )
The third and ﬁnal ingredient is 1.5. WEAK LAWS OF LARGE NUMBERS 29 Lemma 1.5.2. If p > 0 and E Zn p → 0 then Zn → 0 in probability.
Proof. Chebyshev’s inequality, Theorem 1.3.4, with ϕ(x) = xp and X = Zn  implies
that if > 0 then P (Zn  ≥ ) ≤ −p E Zn p → 0.
We can now easily prove
Theorem 1.5.3. L2 weak law. Let X1 , X2 , . . . be uncorrelated random variables
with EXi = µ and var (Xi ) ≤ C < ∞. If Sn = X1 + . . . + Xn then as n → ∞,
Sn /n → µ in L2 and in probability.
Proof. To prove L2 convergence, observe that E (Sn /n) = µ, so
E (Sn /n − µ)2 = var (Sn /n) = Cn
1
( var (X1 ) + · · · + var (Xn )) ≤ 2 → 0
n2
n To conclude there is also convergence in probability, we apply the Lemma 1.5.2 to
Zn = Sn /n − µ.
The most important special case of Theorem 1.5.3 occurs when X1 , X2 , . . . are
independent random variables that all have the same distribution. In the jargon,
they are independent and identically distributed or i.i.d. for short. Theorem
2
1.5.3 tells us in this case that if EXi < ∞ then Sn /n converges to µ = EXi in
probability as n → ∞. In Theorem 1.5.9 below, we will see that E Xi  < ∞ is
suﬃcient for the last conclusion, but for the moment we will concern ourselves with
consequences of the weaker result.
Our ﬁrst application is to a situation that on the surface has nothing to do with
randomness.
Example 1.5.1. Polynomial approximation. Let f be a continuous function on
[0,1], and let
n fn (x) =
m=0 nm
x (1 − x)n−m f (m/n) where
m n
m = n!
m!(n − m)! be the Bernstein polynomial of degree n associated with f . Then as n → ∞
sup fn (x) − f (x) → 0
x∈[0,1] Proof. First observe that if Sn is the sum of n independent random variables with
P (Xi = 1) = p and P (Xi = 0) = 1 − p then EXi = p, var (Xi ) = p(1 − p) and
P (Sn = m) = nm
p (1 − p)n−m
m so Ef (Sn /n) = fn (p). Theorem 1.5.3 tells us that as n → ∞, Sn /n → p in probability.
The last two observations motivate the deﬁnition of fn (p), but to prove the desired
conclusion we have to use the proof of Theorem 1.5.3 rather than the result itself.
Combining the proof of Theorem 1.5.3 with our formula for the variance of Xi and
the fact that p(1 − p) ≤ 1/4 when p ∈ [0, 1], we have
P (Sn /n − p > δ ) ≤ p(1 − p)
1
var (Sn /n)
=
≤
δ2
nδ 2
4nδ 2 To conclude now that Ef (Sn /n) → f (p), let M = supx∈[0,1] f (x), let > 0, and
pick δ > 0 so that if x − y  < δ then f (x) − f (y ) < . (This is possible since a 30 CHAPTER 1. LAWS OF LARGE NUMBERS continuous function is uniformly continuous on each bounded interval.) Now, using
Jensen’s inequality, Theorem 1.3.2, gives
Ef (Sn /n) − f (p) ≤ E f (Sn /n) − f (p) ≤ + 2M P (Sn /n − p > δ )
Letting n → ∞, we have lim supn→∞ Ef (Sn /n) − f (p) ≤ , but is arbitrary so this
gives the desired result.
Our next result is for comic relief.
Example 1.5.2. A highdimensional cube is almost the boundary of a ball.
2
Let X1 , X2 , . . . be independent and uniformly distributed on (−1, 1). Let Yi = Xi ,
which are independent since they are functions of independent random variables.
EYi = 1/3 and var (Yi ) ≤ EYi2 ≤ 1, so Theorem 1.5.3 implies
2
2
(X1 + . . . + Xn )/n → 1/3 in probability as n → ∞ Let An, = {x ∈ Rn : (1 − ) n/3 < x < (1+ ) n/3} where x = (x2 + · · · + x2 )1/2 .
n
1
If we let S  denote the Lebesgue measure of S then the last conclusion implies that
for any > 0, An, ∩ (−1, 1)n /2n → 1, or, in words, most of the volume of the cube
(−1, 1)n comes from An, , which is almost the boundary of the ball of radius n/3. 1.5.2 Triangular Arrays Many classical limit theorems in probability concern arrays Xn,k , 1 ≤ k ≤ n of
random variables and investigate the limiting behavior of their row sums Sn =
Xn,1 + · · · + Xn,n . In most cases, we assume that the random variables on each
row are independent, but for the next trivial (but useful) result we do not need that
assumption. Indeed, here Sn can be any sequence of random variables.
2
2
Theorem 1.5.4. Let µn = ESn , σn = var (Sn ). If σn /b2 → 0 then
n Sn − µn
→0
bn in probability Proof. Our assumptions imply E ((Sn − µn )/bn )2 = b−2 var (Sn ) → 0, so the desired
n
conclusion follows from Lemma 1.5.2.
We will now give three applications of (5.4). For these three examples, the following calculation is useful:
n 1
≥
m
m=1 n
1 n dx
1
≥
x
m
m=2 n log n ≤ 1
≤ 1 + log n
m
m=1 (1.5.1) Example 1.5.3. Coupon collector’s problem. Let X1 , X2 , . . . be i.i.d. uniform on
{1, 2, . . . , n}. To motivate the name, think of collecting baseball cards (or coupons).
Suppose that the ith item we collect is chosen at random from the set of possibilities
n
and is independent of the previous choices. Let τk = inf {m : {X1 , . . . , Xm } = k }
be the ﬁrst time we have k diﬀerent items. In this problem, we are interested in the
n
asymptotic behavior of Tn = τn , the time to collect a complete set. It is easy to
n
n
see that τ1 = 1. To make later formulas work out nicely, we will set τ0 = 0. For 1.5. WEAK LAWS OF LARGE NUMBERS 31 n
n
1 ≤ k ≤ n, Xn,k ≡ τk − τk−1 represents the time to get a choice diﬀerent from our
ﬁrst k − 1, so Xn,k has a geometric distribution with parameter 1 − (k − 1)/n and is
independent of the earlier waiting times Xn,j , 1 ≤ j < k . Example 3.5 tells us that if
X has a geometric distribution with parameter p then EX = 1/p and var (X ) ≤ 1/p2 .
n
Using the linearity of expected value, bounds on m=1 1/m in 1.5.1, and Theorem
1.5.1 we see that
n k−1
n k=1
n k−1
1−
n var (Tn ) ≤
k=1 n −1 −2 1− ETn = m−1 ∼ n log n =n
=n m=1
n
2 ∞ m−2 ≤ n2 m=1 m−2
m=1 Taking bn = n log n and using Theorem 1.5.4, it follows that
n Tn − n m=1 m−1
→0
n log n in probability and hence Tn /(n log n) → 1 in probability.
For a concrete example, take n = 365, i.e., we are interested in the number of
people we need to meet until we have seen someone with every birthday. In this case
the limit theorem says it will take about 365 log 365 = 2153.46 tries to get a complete
set. Note that the number of trials is 5.89 times the number of birthdays.
Example 1.5.4. Random permutations. Let Ωn consist of the n! permutations
(i.e., onetoone mappings from {1, . . . , n} onto {1, . . . , n}) and make this into a probability space by assuming all the permutations are equally likely. This application of
the weak law concerns the cycle structure of a random permutation π , so we begin
by describing the decompostion of a permutation into cycles. Consider the sequence
1, π (1), π (π (1)), . . . Eventually, π k (1) = 1. When it does, we say the ﬁrst cycle is
completed and has length k . To start the second cycle, we pick the smallest integer
i not in the ﬁrst cycle and look at i, π (i), π (π (i)), . . . until we come back to i. We
repeat the construction until all the elements are accounted for. For example, if the
permutation is
i
π (i) 1
3 2
9 3
6 4
8 5
2 6
1 7
5 8
4 9
7 then the cycle decomposition is (136) (2975) (48).
Let Xn,k = 1 if a right parenthesis occurs after the k th number in the decomposition, Xn,k = 0 otherwise and let Sn = Xn,1 + . . . + Xn,n = the number of cycles. (In
the example, X9,3 = X9,7 = X9,9 = 1, and the other X9,m = 0.) I claim that
Lemma 1.5.5. Xn,1 , . . . , Xn,n are independent and P (Xn,j = 1) = 1
n−j +1 . Intuitively, this is true since, independent of what has happened so far, there are
n − j + 1 values that have not appeared in the range, and only one of them will
complete the cycle.
Proof. To prove this, it is useful to generate the permutation in a special way. Let
i1 = 1. Pick j1 at random from {1, . . . , n} and let π (i1 ) = j1 . If j1 = 1, let i2 = j1 .
If j1 = 1, let i2 = 2. In either case, pick j2 at random from {1, . . . , n} − {j1 }. In
general, if i1 , j1 , . . . , ik−1 , jk−1 have been selected and we have set π (i ) = j for
1 ≤ < k , then (a) if jk−1 ∈ {i1 , . . . , ik−1 } so a cycle has just been completed, we let 32 CHAPTER 1. LAWS OF LARGE NUMBERS ik = inf({1, . . . , n} − {i1 , . . . , ik−1 }) and (b) if jk−1 ∈ {i1 , . . . , ik−1 } we let ik = jk−1 .
/
In either case we pick jk at random from {1, . . . , n}−{j1 , . . . , jk−1 } and let π (ik ) = jk .
The construction above is tedious to write out, or to read, but now I can claim with
a clear conscience that Xn,1 , . . . , Xn,n are independent and P (Xn,k = 1) = 1/(n−j +1)
since when we pick jk there are n − j + 1 values in {1, . . . , n} − {j1 , . . . , jk−1 } and
only one of them will complete the cycle.
To check the conditions of Theorem 1.5.4, now note
ESn = 1/n + 1/(n − 1) + · · · + 1/2 + 1
n n n
2
E (Xn,k ) = var (Xn,k ) ≤ var (Sn ) =
k=1 E (Xn,k ) = ESn
k=1 k=1 where the results on the second line follow from Theorem 1.5.1, the fact that var (Y ) ≤
2
EY 2 , and Xn,k = Xn,k . Now ESn ∼ log n, so if bn = (log n).5+ with > 0, the
conditions of Theorem 1.5.4 are satisﬁed and it follows that
n Sn − m=1 m−1
→0
(log n).5+ in probability Taking = 0.5 we have that Sn / log n → 1 in probability, but (∗) says more. We will
see in Example 4.6 of Chapter 2 that (∗) is false if = 0.
Example 1.5.5. An occupancy problem. Suppose we put r balls at random in n
boxes, i.e., all nr assignments of balls to boxes have equal probability. Let Ai be the
event that the ith box is empty and Nn = the number of empty boxes. It is easy to
see that
P (Ai ) = (1 − 1/n)r
and
ENn = n(1 − 1/n)r
A little calculus (take logarithms) shows that if r/n → c, ENn /n → e−c . (For a
proof, see (1.3) in Chapter 2.) To compute the variance of Nn , we observe that
2 n
2
ENn =E 1A m P (Ak ∩ Am ) =
1≤k,m≤n m=1
2
var (Nn ) = ENn − (ENn )2 = P (Ak ∩ Am ) − P (Ak )P (Am )
1≤k,m≤n = n(n − 1){(1 − 2/n)r − (1 − 1/n)2r } + n{(1 − 1/n)r − (1 − 1/n)2r }
The ﬁrst term comes from k = m and the second from k = m. Since (1 − 2/n)r → e−2c
and (1 − 1/n)r → e−c , it follows easily from the last formula that var (Nn /n) =
var (Nn )/n2 → 0. Taking bn = n in Theorem 1.5.4 now we have
Nn /n → e−c 1.5.3 in probability Truncation To truncate a random variable X at level M means to consider
¯
X = X 1(X ≤M ) = X
0 if X  ≤ M
if X  > M To extend the weak law to random variables without a ﬁnite second moment, we will
truncate and then use Chebyshev’s inequality. We begin with a very general but also 1.5. WEAK LAWS OF LARGE NUMBERS 33 very useful result. Its proof is easy because we have assumed what we need for the
proof. Later we will have to work a little to verify the assumptions in special cases,
but the general result serves to identify the essential ingredients in the proof.
Theorem 1.5.6. Weak law for triangular arrays. For each n let Xn,k , 1 ≤ k ≤ n,
¯
be independent. Let bn > 0 with bn → ∞, and let Xn,k = Xn,k 1(Xn,k ≤bn ) . Suppose
that as n → ∞
(i) n
k=1 −
(ii) bn 2 P (Xn,k  > bn ) → 0, and
n
k=1 ¯2
E Xn,k → 0. If we let Sn = Xn,1 + . . . + Xn,n and put an = n
k=1 ¯
E Xn,k then (Sn − an )/bn → 0 in probability
¯
¯
¯
Proof. Let Sn = Xn,1 + · · · + Xn,n . Clearly,
P Sn − an
>
bn ¯
Sn − an
>
bn ¯
≤ P (Sn = Sn ) + P To estimate the ﬁrst term, we note that
n ¯
¯
P (Sn = Sn ) ≤ P ∪n=1 {Xn,k = Xn,k } ≤
k P (Xn,k  > bn ) → 0
k=1 ¯
by (i). For the second term, we note that Chebyshev’s inequality, an = E Sn , Theorem
2
1.5.1, and var (X ) ≤ EX imply
P ¯
Sn − an
>
bn ≤ −2 E ¯
Sn − an
bn 2 = −2 −2
bn ¯
var (Sn )
n n ¯
var (Xn,k ) ≤ (bn )−2 = (bn )−2 ¯
E (Xn,k )2 → 0
k=1 k=1 by (ii), and the proof is complete.
From Theorem 1.5.6, we get the following result for a single sequence.
Theorem 1.5.7. Weak law of large numbers. Let X1 , X2 , . . . be i.i.d. with
xP (Xi  > x) → 0 as x → ∞ Let Sn = X1 + · · · + Xn and let µn = E (X1 1(X1 ≤n) ). Then Sn /n − µn → 0 in
probability.
Remark. The assumption in the theorem is necessary for the existence of constants
an so that Sn /n − an → 0. See Feller, Vol. II (1971) p. 234–236 for a proof.
Proof. We will apply Theorem 1.5.6 with Xn,k = Xk and bn = n. To check (i), we
note
n P (Xn,k  > n) = nP (Xi  > n) → 0
k=1 ¯2
by assumption. To check (ii), we need to show n−2 · nE Xn,1 → 0. To do this, we
need the following result, which will be useful several times below. 34 CHAPTER 1. LAWS OF LARGE NUMBERS Lemma 1.5.8. If Y ≥ 0 and p > 0 then E (Y p ) = ∞
0 py p−1 P (Y > y ) dy . Proof. Using the deﬁnition of expected value, Fubini’s theorem (for nonnegative random variables), and then calculating the resulting integrals gives
∞ ∞ py p−1 P (Y > y ) dy = py p−1 1(Y >y) dP dy 0 0 Ω
∞ py p−1 1(Y >y) dy dP =
Ω 0
Y py p−1 dy dP = =
Ω 0 Y p dP = EY p
Ω which is the desired result.
Returning to the proof of Theorem 1.5.7, we observe that Lemma 1.5.8 and the
¯
fact that Xn,1 = X1 1(X1 ≤n) imply
∞ ¯2
E (Xn,1 ) = n ¯
2yP (Xn,1  > y ) dy ≤
0 2yP (X1  > y ) dy
0 ¯
since P (Xn,1  > y ) = 0 for y ≥ n and = P (X1  > y ) − P (X1  > n) for y ≤ n. We
claim that yP (X1  > y ) → 0 implies
¯2
E (Xn,1 )/n = 1
n n 2yP (X1  > y ) dy → 0
0 as n → ∞. Intuitively, this holds since the righthand side is the average of g (y ) =
2yP (X1  > y ) over [0, n] and g (y ) → 0 as y → ∞. To spell out the details, note that
0 ≤ g (y ) ≤ 2y and g (y ) → 0 as y → ∞, so we must have M = sup g (y ) < ∞. If we
let K = sup{g (y ) : y > K } then by considering the integrals over [0, K ] and [K, n]
separately
n 2yP (X1  > y ) dy ≤ KM + (n − K ) K 0 Dividing by n and letting n → ∞, we have
lim sup
n→∞ Since K is arbitrary and K 1
n n 2yP (X1  > y ) dy ≤ K 0 → 0 as K → ∞, the desired result follows. Finally, we have the weak law in its most familiar form.
Theorem 1.5.9. Let X1 , X2 , . . . be i.i.d. with E Xi  < ∞. Let Sn = X1 + · · · + Xn
and let µ = EX1 . Then Sn /n → µ in probability.
Remark. Applying Lemma 1.5.8 with p = 1 − and > 0, we see that xP (X1  >
x) → 0 implies E X1 1− < ∞, so the assumption in is not much weaker than ﬁnite
mean.
Proof. Two applications of the dominated convergence theorem imply
xP (X1  > x) ≤ E (X1 1(X1 >x) ) → 0 as x → ∞ µn = E (X1 1(X1 ≤n) ) → E (X1 ) = µ as n → ∞
Using Theorem 1.5.7, we see that if > 0 then P (Sn /n − µn  > /2) → 0. Since
µn → µ, it follows that P (Sn /n − µ > ) → 0. 1.5. WEAK LAWS OF LARGE NUMBERS 35 Example 1.5.6. For an example where the weak law does not hold, suppose X1 , X2 , . . .
are independent and have a Cauchy distribution:
x P (Xi ≤ x) =
−∞ dt
π (1 + t2 ) As x → ∞,
∞ P (X1  > x) = 2
x dt
2
∼
2)
π (1 + t
π ∞ t−2 dt =
x 2 −1
x
π From the necessity of the condition above, we can conclude that there is no sequence
of constants µn so that Sn /n − µn → 0. We will see later that Sn /n always has the
same distribution as X1 . (See Exercise 2.3.8.)
As the next example shows, we can have a weak law in some situations in which
E X  = ∞.
Example 1.5.7. The “St. Petersburg paradox.” Let X1 , X2 , . . . be independent
random variables with
P (Xi = 2j ) = 2−j for j ≥ 1
In words, you win 2j dollars if it takes j tosses to get a heads. The paradox here is
that EX1 = ∞, but you clearly wouldn’t pay an inﬁnite amount to play this game.
An application of Theorem 1.5.6 will tell us how much we should pay to play the game
n times.
In this example, Xn,k = Xk . To apply Theorem 1.5.6, we have to pick bn . To do
this, we are guided by the principle that in checking (ii) we want to take bn as small
as we can and have (i) hold. With this in mind, we observe that if m is an integer
∞ 2−j = 2−m+1 P (X1 ≥ 2m ) =
j =m Let m(n) = log2 n + K (n) where K (n) → ∞ and is chosen so that m(n) is an integer
(and hence the displayed formula is valid). Letting bn = 2m(n) , we have
nP (X1 ≥ bn ) = n2−m(n)+1 = 2−K (n)+1 → 0
¯
proving (i). To check (ii), we observe that if Xn,k = Xk 1(Xk ≤bn ) then
m(n) ¯2
E Xn,k = ∞ 22j · 2−j ≤ 2m(n)
j =1 2−k = 2bn
k=0 So the expression in (ii) is smaller than 2n/bn , which → 0 since
bn = 2m(n) = n2K (n) and K (n) → ∞
The last two steps are to evaluate an and to apply Theorem 1.5.6.
m(n) ¯
E Xn,k = 2j 2−j = m(n)
j =1 36 CHAPTER 1. LAWS OF LARGE NUMBERS so an = nm(n). We have m(n) = log n + K (n) (here and until the end of the example
all logs are base 2), so if we pick K (n)/ log n → 0 then an /n log n → 1 as n → ∞.
Using Theorem 1.5.6 now, we have
Sn − an
→0
n2K (n) in probability If we suppose that K (n) ≤ log log n for large n then the last conclusion holds with the
denominator replaced by n log n, and it follows that Sn /(n log n) → 1 in probability.
Returning to our original question, we see that a fair price for playing n times is
$ log2 n per play. When n = 1024, this is $10 per play. Nicolas Bernoulli wrote in
1713, “There ought not to exist any even halfway sensible person who would not sell
the right of playing the game for 40 ducates (per play).” If the wager were 1 ducat,
one would need 240 ≈ 1012 plays to start to break even.
Exercises
1.5.1. Let X1 , X2 , . . . be uncorrelated random variables with EXi = µi and var (Xi )/i →
0 as i → ∞. Let Sn = X1 + . . . + Xn and νn = ESn /n then as n → ∞, Sn /n − νn → 0
in L2 and in probability.
1.5.2. The L2 weak law generalizes immediately to certain dependent sequences.
Suppose EXn = 0 and EXn Xm ≤ r(n − m) for m ≤ n (no absolute value on the
lefthand side!) with r(k ) → 0 as k → ∞. Show that (X1 + . . . + Xn )/n → 0 in
probability.
1.5.3. Monte Carlo integration. (i) Let f be a measurable function on [0, 1] with
1
f (x)dx < ∞. Let U1 , U2 , . . . be independent and uniformly distributed on [0, 1],
0
and let
In = n−1 (f (U1 ) + . . . + f (Un ))
1 Show that In → I ≡ 0 f dx in probability. (ii) Suppose
Chebyshev’s inequality to estimate P (In − I  > a/n1/2 ). 1
0 f (x)2 dx < ∞. Use 1.5.4. Let X1 , X2 , . . . be i.i.d. with P (Xi = (−1)k k ) = C/k 2 log k for k ≥ 2 where C
is chosen to make the sum of the probabilities = 1. Show that E Xi  = ∞, but there
is a ﬁnite constant µ so that Sn /n → µ in probability.
1.5.5. Let X1 , X2 , . . . be i.i.d. with P (Xi > x) = e/x log x for x ≥ e. Show that
E Xi  = ∞, but there is a sequence of constants µn → ∞ so that Sn /n − µn → 0 in
probability.
1.5.6. (i) Show that if X ≥ 0 is integer valued EX =
similar expression for EX 2 .
1.5.7. Generalize Lemma 1.5.8 to conclude that if H (x) =
0, then n≥1 P (X ≥ n). (ii) Find a (−∞,x] h(y ) dy with h(y ) ≥ ∞ h(y )P (X ≥ y ) dy E H (X ) =
−∞ An important special case is H (x) = exp(θx) with θ > 0.
1.5.8. An unfair “fair game.” Let pk = 1/2k k (k + 1), k = 1, 2, . . . and p0 =
1 − k≥1 pk .
∞ k=1 11
1
2k pk = (1 − ) + ( − ) + . . . = 1
2
23 1.5. WEAK LAWS OF LARGE NUMBERS 37 so if we let X1 , X2 , . . . be i.i.d. with P (Xn = −1) = p0 and
P (Xn = 2k − 1) = pk for k ≥ 1 then EXn = 0. Let Sn = X1 + . . . + Xn . Use (5.5) with bn = 2m(n) where m(n) =
min{m : 2−m m−3/2 ≤ n−1 } to conclude that
Sn /(n/ log2 n) → −1 in probability
1.5.9. Weak law for positive variables. Suppose X1 , X2 , . . . are i.i.d., P (0 ≤
s
Xi < ∞) = 1 and P (Xi > x) > 0 for all x. Let µ(s) = 0 x dF (x) and ν (s) =
µ(s)/s(1 − F (s)). It is known that there exist constants an so that Sn /an → 1 in
probability, if and only if ν (s) → ∞ as s → ∞. Pick bn ≥ 1 so that nµ(bn ) = bn (this
works for large n), and use Theorem 1.5.6 to prove that the condition is suﬃcient. 38 CHAPTER 1. LAWS OF LARGE NUMBERS 1.6 BorelCantelli Lemmas If An is a sequence of subsets of Ω, we let
lim sup An = lim ∪∞ m An = {ω that are in inﬁnitely many An }
n=
m→∞ (the limit exists since the sequence is decreasing in m) and let
lim inf An = lim ∩∞ m An = {ω that are in all but ﬁnitely many An }
n=
m→∞ (the limit exists since the sequence is increasing in m). The names lim sup and lim inf
can be explained by noting that
lim sup 1An = 1(lim sup An )
n→∞ lim inf 1An = 1(lim inf An )
n→∞ It is common to write lim sup An = {ω : ω ∈ An i.o.}, where i.o. stands for inﬁnitely
often. An example which illustrates the use of this notation is: “Xn → 0 a.s. if and
only if for all > 0, P (Xn  > i.o.) = 0.” The reader will see many other examples
below. The next result should be familiar from measure theory even though its name
may not be.
Theorem 1.6.1. BorelCantelli lemma. If ∞
n=1 P (An ) < ∞ then P (An i.o.) = 0.
Proof. Let N = k 1Ak be the number of events that occur. Fubini’s theorem implies
EN = k P (Ak ) < ∞, so we must have N < ∞ a.s.
The next result is a typical application of the BorelCantelli lemma.
Theorem 1.6.2. Xn → X in probability if and only if for every subsequence Xn(m)
there is a further subsequence Xn(mk ) that converges almost surely to X .
Proof. Let k be a sequence of positive numbers that ↓ 0. For each k , there is an
n(mk ) > n(mk−1 ) so that P (Xn(mk ) − X  > k ) ≤ 2−k . Since
∞ P (Xn(mk ) − X  > k) <∞ k=1 the BorelCantelli lemma implies P (Xn(mk ) − X  > k i.o.) = 0, i.e., Xn(mk ) → X a.s.
To prove the second conclusion, we note that if for every subsequence Xn(m) there is
a further subsequence Xn(mk ) that converges almost surely to X then we can apply
the next lemma to the sequence of numbers yn = P (Xn − X  > δ ) for any δ > 0 to
get the desired result.
Theorem 1.6.3. Let yn be a sequence of elements of a topological space. If every
subsequence yn(m) has a further subsequence yn(mk ) that converges to y then yn → y .
Proof. If yn → y then there is an open set G containing y and a subsequence yn(m)
with yn(m) ∈ G for all m, but clearly no subsequence of yn(m) converges to y .
Remark. Since there is a sequence of random variables that converges in probability
but not a.s. (for an example, see Exercises 1.6.13 or 1.6.14), it follows from Theorem
1.6.3 that a.s. convergence does not come from a metric, or even from a topology. 1.6. BORELCANTELLI LEMMAS 39 Exercise 1.6.4 will give a metric for convergence in probability, and Exercise 1.6.5 will
show that the space of random variables is a complete space under this metric.
Theorem 1.6.2 allows us to upgrade convergence in probability to convergence
almost surely. An example of the usefulness of this is
Theorem 1.6.4. If f is continuous and Xn → X in probability then f (Xn ) → f (X )
in probability. If, in addition, f is bounded then Ef (Xn ) → Ef (X ).
Proof. If Xn(m) is a subsequence then Theorem 1.6.2 implies there is a further subsequence Xn(mk ) → X almost surely. Since f is continuous, Exercise 1.2.3 implies
f (Xn(mk ) ) → f (X ) almost surely and Theorem 1.6.2 implies f (Xn ) → f (X ) in probability. If f is bounded then the bounded convergence theorem implies Ef (Xn(mk ) ) →
Ef (X ), and applying Theorem 1.6.3 to yn = Ef (Xn ) gives the desired result.
Exercise 1.6.1. Prove the ﬁrst result in Theorem 1.6.4 directly from the deﬁnition.
Exercise 1.6.2. Fatou’s lemma. Suppose Xn ≥ 0 and Xn → X in probability.
Show that lim inf n→∞ EXn ≥ EX .
Exercise 1.6.3. Dominated convergence. Suppose Xn → X in probability and
(a) Xn  ≤ Y with EY < ∞ or (b) there is a continuous function g with g (x) > 0 for
large x with x/g (x) → 0 as x → ∞ so that Eg (Xn ) ≤ C < ∞ for all n. Show that
EXn → EX.
Exercise 1.6.4. Show (a) that d(X, Y ) = E (X − Y /(1 + X − Y )) deﬁnes a metric
on the set of random variables, i.e., (i) d(X, Y ) = 0 if and only if X = Y a.s., (ii)
d(X, Y ) = d(Y, X ), (iii) d(X, Z ) ≤ d(X, Y ) + d(Y, Z ) and (b) that d(Xn , X ) → 0 as
n → ∞ if and only if Xn → X in probability.
Exercise 1.6.5. Show that random variables are a complete space under the metric
deﬁned in the previous exercise, i.e., if d(Xm , Xn ) → 0 whenever m, n → ∞ then
there is a r.v. X∞ so that Xn → X∞ in probability.
As our second application of the BorelCantelli lemma, we get our ﬁrst strong law
of large numbers:
4
Theorem 1.6.5. Let X1 , X2 , . . . be i.i.d. with EXi = µ and EXi < ∞. If Sn =
X1 + · · · + Xn then Sn /n → µ a.s. Proof. By letting Xi = Xi − µ, we can suppose without loss of generality that µ = 0.
Now
4 n
4
ESn =E Xi
i=1 =E Xi Xj Xk X
1≤i,j,k, ≤n 3
2
Terms in the sum of the form E (Xi Xj ), E (Xi Xj Xk ), and E (Xi Xj Xk X ) are 0
(if i, j, k, are distinct) since the expectation of the product is the product of the
expectations, and in each case one of the terms has expectation 0. The only terms
4
22
2
that do not vanish are those of the form EXi and EXi Xj = (EXi )2 . There are n
and 3n(n − 1) of these terms, respectively. (In the second case we can pick the two
indices in n(n − 1)/2 ways, and with the indices ﬁxed, the term can arise in a total
of 6 ways.) The last observation implies
4
4
2
ESn = nEX1 + 3(n2 − n)(EX1 )2 ≤ Cn2 40 CHAPTER 1. LAWS OF LARGE NUMBERS where C < ∞. Chebyshev’s inequality gives us
4
P (Sn  > n ) ≤ E (Sn )/(n )4 ≤ C/(n2 4 ) Summing on n and using the BorelCantelli lemma gives P (Sn  > n i.o.) = 0. Since
is arbitrary, the proof is complete.
The converse of the BorelCantelli lemma is trivially false.
Example 1.6.1. Let Ω = (0, 1), F = Borel sets, P = Lebesgue measure. If An =
(0, an ) where an → 0 as n → ∞ then lim sup An = ∅, but if an ≥ 1/n, we have
an = ∞.
The example just given suggests that for general sets we cannot say much more than
the next result.
Exercise 1.6.6. Prove that P (lim sup An ) ≥ lim sup P (An ) and
P (lim inf An ) ≤ lim inf P (An )
For independent events, however, the necessary condition for P (lim sup An ) > 0 is
suﬃcient for P (lim sup An ) = 1.
Theorem 1.6.6. The second BorelCantelli lemma. If the events An are independent then
P (An ) = ∞ implies P (An i.o.) = 1.
Proof. Let M < N < ∞. Independence and 1 − x ≤ e−x imply
N P ∩N=M Ac =
n
n N (1 − P (An )) ≤
n=M exp(−P (An ))
n=M N = exp − P (An ) →0 as N → ∞ n=M So P (∪∞ M An ) = 1 for all M , and since ∪∞ M An ↓ lim sup An it follows that
n=
n=
P (lim sup An ) = 1.
A typical application of the second BorelCantelli lemma is:
Theorem 1.6.7. If X1 , X2 , . . . are i.i.d. with E Xi  = ∞, then P (Xn  ≥ n i.o.) = 1.
So if Sn = X1 + · · · + Xn then P (lim Sn /n exists ∈ (−∞, ∞)) = 0.
Proof. From (5.7), we get
∞ ∞ E X1  = P (X1  > x) dx ≤
0 P (X1  > n)
n=0 Since E X1  = ∞ and X1 , X2 , . . . are i.i.d., it follows from the second BorelCantelli
lemma that P (Xn  ≥ n i.o.) = 1. To prove the second claim, observe that
Sn+1
Sn
Xn+1
Sn
−
=
−
n
n+1
n(n + 1) n + 1
and on C ≡ {ω : limn→∞ Sn /n exists ∈ (−∞, ∞)}, Sn /(n(n + 1)) → 0. So, on
C ∩ {ω : Xn  ≥ n i.o.}, we have
Sn
Sn+1
−
> 2/3
n
n+1 i.o. 1.6. BORELCANTELLI LEMMAS 41 contradicting the fact that ω ∈ C . From the last observation, we conclude that
{ω : Xn  ≥ n i.o.} ∩ C = ∅
and since P (Xn  ≥ n i.o.) = 1, it follows that P (C ) = 0.
Theorem 1.6.7 shows that E Xi  < ∞ is necessary for the strong law of large
numbers. The reader will have to wait until Theorem 1.7.1 to see that condition is
also suﬃcient. The next result extends the second BorelCantelli lemma and sharpens
its conclusion.
Theorem 1.6.8. If A1 , A2 , . . . are pairwise independent and
as n → ∞
n P (An ) = ∞ then n P (Am ) → 1 1A m
m=1 ∞
n=1 a.s. m=1 Proof. Let Xm = 1Am and let Sn = X1 + · · · + Xn . Since the Am are pairwise
independent, the Xm are uncorrelated and hence Theorem 1.5.1 implies
var (Sn ) = var (X1 ) + · · · + var (Xn )
2
var (Xm ) ≤ E (Xm ) = E (Xm ), since Xm ∈ {0, 1}, so var (Sn ) ≤ E (Sn ). Chebyshev’s
inequality implies P (Sn − ESn  > δESn ) ≤ var (Sn )/(δESn )2 ≤ 1/(δ 2 ESn ) → 0 (∗) as n → ∞. (Since we have assumed ESn → ∞.)
The last computation shows that Sn /ESn → 1 in probability. To get almost
sure convergence, we have to take subsequences. Let nk = inf {n : ESn ≥ k 2 }. Let
Tk = Snk and note that the deﬁnition and EXm ≤ 1 imply k 2 ≤ ETk ≤ k 2 + 1.
Replacing n by nk in (∗) and using ETk ≥ k 2 shows
P (Tk − ETk  > δETk ) ≤ 1/(δ 2 k 2 )
∞ So k=1 P (Tk − ETk  > δETk ) < ∞, and the BorelCantelli lemma implies P (Tk −
ETk  > δETk i.o.) = 0. Since δ is arbitrary, it follows that Tk /ETk → 1 a.s. To show
Sn /ESn → 1 a.s., pick an ω so that Tk (ω )/ETk → 1 and observe that if nk ≤ n < nk+1
then
Sn (ω )
Tk+1 (ω )
Tk (ω )
≤
≤
ETk+1
ESn
ETk
To show that the terms at the left and right ends → 1, we rewrite the last inequalities
as
ETk
Tk (ω )
Sn (ω )
Tk+1 (ω ) ETk+1
·
≤
≤
·
ETk+1 ETk
ESn
ETk+1
ETk
From this, we see it is enough to show ETk+1 /ETk → 1, but this follows from
k 2 ≤ ETk ≤ ETk+1 ≤ (k + 1)2 + 1
and the fact that {(k + 1)2 + 1}/k 2 = 1 + 2/k + 2/k 2 → 1.
The moral of the proof of Theorem 1.6.8 is that if you want to show that Xn /cn → 1
a.s. for sequences cn , Xn ≥ 0 that are increasing, it is enough to prove the result for
a subsequence n(k ) that has cn(k+1) /cn(k) → 1. For practice with this technique, try
the following. 42 CHAPTER 1. LAWS OF LARGE NUMBERS Exercise 1.6.7. Let 0 ≤ X1 ≤ X2 . . . be random variables with EXn ∼ anα with
a, α > 0, and var (Xn ) ≤ Bnβ with β < 2α. Show that Xn /nα → a a.s.
Exercise 1.6.8. Let Xn be independent Poisson r.v.’s with EXn = λn , and let
Sn = X1 + · · · + Xn . Show that if
λn = ∞ then Sn /ESn → 1 a.s.
Example 1.6.2. Record values. Let X1 , X2 , . . . be a sequence of random variables
and think of Xk as the distance for an individual’s k th high jump or shotput toss so
that Ak = {Xk > supj<k Xj } is the event that a record occurs at time k . Ignoring
the fact that an athelete’s performance may get better with more experience or that
injuries may occur, we will suppose that X1 , X2 , . . . are i.i.d. with a distribution F (x)
that is continuous. Even though it may seem that the occurrence of a record at time
k will make it less likely that one will occur at time k + 1, we
Claim. The Ak are independent with P (Ak ) = 1/k .
To prove this, we start by observing that since F is continuous P (Xj = Xk ) = 0 for
n
any j = k (see Exercise 4.7), so we can let Y1n > Y2n > · · · > Yn be the random
variables X1 , . . . , Xn put into decreasing order and deﬁne a random permutation of
{1, . . . , n} by πn (i) = j if Xi = Yjn , i.e., if the ith random variable has rank j . Since
the distribution of (X1 , . . . , Xn ) is not aﬀected by changing the order of the random
variables, it is easy to see:
(a) The permutation πn is uniformly distributed over the set of n! possibilities.
Proof of (a) This is “obvious” by symmetry, but if one wants to hear more, we can
argue as follows. Let πn be the permutation induced by (X1 , . . . , Xn ), and let σn be
a randomly chosen permutation of {1, . . . , n} independent of the X sequence. Then
we can say two things about the permutation induced by (Xσ(1) , . . . , Xσ(n) ): (i) it is
πn ◦ σn , and (ii) it has the same distribution as πn . The desired result follows now by
noting that if π is any permutation, π ◦ σn , is uniform over the n! possibilities.
Once you believe (a), the rest is easy:
(b) P (An ) = P (πn (n) = 1) = 1/n.
(c) If m < n and im+1 , . . . in are distinct elements of {1, . . . , n} then
P (Am πn (j ) = ij for m + 1 ≤ j ≤ n) = 1/m
Intuitively, this is true since if we condition on the ranks of Xm+1 , . . . , Xn then this
determines the set of ranks available for X1 , . . . , Xm , but all possible orderings of the
ranks are equally likely and hence there is probability 1/m that the smallest rank will
end up at m.
Proof of (c) If we let σm be a randomly chosen permutation of {1, . . . , m} then (i)
πn ◦ σm has the same distribution as πn , and (ii) since the application of σm randomly
rearranges πn (1), . . . , πn (m) the desired result follows.
If we let m1 < m2 . . . < mk then it follows from (c) that
P (Am1 Am2 ∩ . . . ∩ Amk ) = P (Am1 )
and the claim follows by induction.
Using Theorem 1.6.8 and the by now familiar fact that
have n
m=1 1/m ∼ log n, we 1.6. BORELCANTELLI LEMMAS
n
m=1 Theorem 1.6.9. If Rn =
n → ∞, 43 1Am is the number of records at time n then as Rn / log n → 1 a.s. The reader should note that the last result is independent of the distribution F (as
long as it is continuous).
Remark. Let X1 , X2 , . . . be i.i.d. with a distribution that is continuous. Let Yi be
the number of j ≤ i with Xj > Xi . It follows from (a) that Yi are independent
random variables with P (Yi = j ) = 1/i for 0 ≤ j < i − 1.
Comic relief. Let X0 , X1 , . . . be i.i.d. and imagine they are the oﬀers you get for
a car you are going to sell. Let N = inf {n ≥ 1 : Xn > X0 }. Symmetry implies
P (N > n) ≥ 1/(n + 1). (When the distribution is continuous this probability is
exactly 1/(n + 1), but our distribution now is general and ties go to the ﬁrst person
who calls.) Using Exercise 5.6 now:
∞ ∞ P (N > n) ≥ EN =
n=0 1
=∞
n+1
n=0 so the expected time you have to wait until you get an oﬀer better than the ﬁrst one is
∞. To avoid lawsuits, let me hasten to add that I am not suggesting that you should
take the ﬁrst oﬀer you get!
Example 6.3. Head runs. Let Xn , n ∈ Z, be i.i.d. with P (Xn = 1) = P (Xn =
−1) = 1/2. Let n = max{m : Xn−m+1 = . . . = Xn = 1} be the length of the run of
+1’s at time n, and let Ln = max1≤m≤n m be the longest run at time n. We use a
twosided sequence so that for all n, P ( n = k ) = (1/2)k+1 for k ≥ 0. Since 1 < ∞,
the result we are going to prove
Ln / log2 n → 1 (6.10) a.s. is also true for a onesided sequence. To prove (6.10), we begin by observing
P( n ≥ (1 + ) log2 n) ≤ n−(1+ ) for any > 0, so it follows from the BorelCantelli lemma that
n ≥ N . Since is arbitrary, it follows that
lim sup Ln / log2 n ≤ 1 n ≤ (1 + ) log2 n for a.s. n→∞ To get a result in the other direction, we break the ﬁrst n trials into disjoint blocks of
length [(1− ) log2 n]+1, on which the variables are all 1 with probability 2−[(1− ) log2 n]−1 ≥
n−(1− ) /2, to conclude that if n is large enough so that [n/{[(1 − ) log2 n] + 1}] ≥
n/ log2 n
P (Ln ≤ (1 − ) log2 n) ≤ (1 − n−(1− ) /2)n/(log2 n) ≤ exp(−n /2 log2 n)
which is summable, so the BorelCantelli lemma implies
lim inf Ln / log2 n ≥ 1
n→∞ Exercise 1.6.9. Show that lim supn→∞ n / log2 a.s. n = 1, lim inf n→∞ n = 0 a.s. 44 CHAPTER 1. LAWS OF LARGE NUMBERS Exercises
1.6.10. If Xn is any sequence of random variables, there are constants cn → ∞ so
that Xn /cn → 0 a.s.
∞ 1.6.11. (i) If P (An ) → 0 and n=1 P (Ac ∩ An+1 ) < ∞ then P (An i.o.) = 0. (ii)
n
Find an example of a sequence An to which the result in (i) can be applied but the
BorelCantelli lemma cannot.
1.6.12. Let An be a sequence of independent events with P (An ) < 1 for all n. Show
that P (∪An ) = 1 implies P (An i.o.) = 1.
1.6.13. Let X1 , X2 , . . . be independent. Show that sup Xn < ∞ a.s. if and only if
n P (Xn > A) < ∞ for some A.
1.6.14. Let X1 , X2 , . . . be independent with P (Xn = 1) = pn and P (Xn = 0) =
1 − pn . Show that (i) Xn → 0 in probability if and only if pn → 0, and (ii) Xn → 0
a.s. if and only if
pn < ∞.
1.6.15. Let Y1 , Y2 , . . . be i.i.d. Find necessary and suﬃcient conditions for
(i) Yn /n → 0 almost surely, (ii) (maxm≤n Ym )/n → 0 almost surely,
(iii) (maxm≤n Ym )/n → 0 in probability, and (iv) Yn /n → 0 in probability.
1.6.16. The last two exercises give examples with Xn → X in probability without
Xn → X a.s. There is one situation in which the two notions are equivalent. Let
X1 , X2 , . . . be a sequence of r.v.’s on (Ω, F , P ) where Ω is a countable set and F
consists of all subsets of Ω. Show that Xn → X in probability implies Xn → X a.s.
1.6.17. Show that if Xn is the outcome of the nth play of the St. Petersburg game
(Example 1.5.7) then lim supn→∞ Xn /(n log2 n) = ∞ a.s. and hence the same result
holds for Sn . This shows that the convergence Sn /(n log2 n) → 1 in probability proved
in Section 5 does not occur a.s.
1.6.18. Let X1 , X2 , . . . be i.i.d. with P (Xi > x) = e−x , let Mn = max1≤m≤n Xm .
Show that (i) lim supn→∞ Xn / log n = 1 a.s. and (ii) Mn / log n → 1 a.s.
1.6.19. Let X1 , X2 , . . . be i.i.d. with distribution F , let λn ↑ ∞, and let An =
{max1≤m≤n Xm > λn }. Show that P (An i.o.) = 0 or 1 according as
n≥1 (1 −
F (λn )) < ∞ or = ∞.
1.6.20. KochenStone lemma. Suppose
to show that if n→∞ P (Ak )
k=1 2 n lim sup P (Ak ) = ∞. Use Exercises 3.8 and 6.6 P (Aj ∩ Ak ) = α > 0 1≤j,k≤n then P (An i.o.) ≥ α. The case α = 1 contains (6.6). 1.7. STRONG LAW OF LARGE NUMBERS 1.7 45 Strong Law of Large Numbers We are now ready to give Etemadi’s (1981) proof of
Theorem 1.7.1. Strong law of large numbers. Let X1 , X2 , . . . be pairwise independent identically distributed random variables with E Xi  < ∞. Let EXi = µ and
Sn = X1 + . . . + Xn . Then Sn /n → µ a.s. as n → ∞.
Proof. As in the proof of the weak law of large numbers, we begin by truncating.
Lemma 1.7.2. Let Yk = Xk 1(Xk ≤k) and Tn = Y1 + · · · + Yn . It is suﬃcient to prove
that Tn /n → µ a.s.
∞ ∞ Proof.
k=1 P (Xk  > k ) ≤ 0 P (X1  > t) dt = E X1  < ∞ so P (Xk = Yk i.o.) = 0.
This shows that Sn (ω ) − Tn (ω ) ≤ R(ω ) < ∞ a.s. for all n, from which the desired
result follows.
The second step is not so intuitive, but it is an important part of this proof and
the one given in Section 1.8.
∞
k=1 Lemma 1.7.3. var (Yk )/k 2 ≤ 4E X1  < ∞. Proof To bound the sum, we observe
∞
2
var (Yk ) ≤ E (Yk ) = k 2yP (Yk  > y ) dy ≤
0 2yP (X1  > y ) dy
0 so using Fubini’s theorem (since everything is ≥ 0 and the sum is just an integral with
respect to counting measure on {1, 2, . . .})
∞ ∞ k=1 ∞ k −2 2
E (Yk )/k 2 ≤ 1(y<k) 2y P (X1  > y ) dy
0 k=1
∞ ∞ k −2 1(y<k) =
0 Since E X1  = ∞
0 2yP (X1  > y ) dy k=1 P (X1  > y ) dy , we can complete the proof by showing Lemma 1.7.4. If y ≥ 0 then 2y k>y k −2 ≤ 4. Proof. We being with the observation that if m ≥ 2 then
∞ k −2 ≤ x−2 dx = (m − 1)−1
m−1 k≥m When y ≥ 1, the sum starts with k = [y ] + 1 ≥ 2, so
k −2 ≤ 2y/[y ] ≤ 4 2y
k>y since y/[y ] ≤ 2 for y ≥ 1 (the worst case being y close to 2). To cover 0 y < 1, we
note that in this case
∞ k −2 ≤ 2 1 + 2y
k>y k −2 ≤4 k=2 This establishes Lemma 1.7.4 which completes the proof of Lemma 1.7.3 and of the
theorem. 46 CHAPTER 1. LAWS OF LARGE NUMBERS The ﬁrst two steps, Lemmas 1.7.2 and 1.7.3 above, are standard. Etemadi’s in+
−
spiration was that since Xn , n ≥ 1, and Xn , n ≥ 1, satisfy the assumptions of the
+
−
theorem and Xn = Xn − Xn , we can without loss of generality suppose Xn ≥ 0. As
in the proof of Theorem 1.6.8, we will prove the result ﬁrst for a subsequence and
then use monotonicity to control the values in between. This time, however, we let
α > 1 and k (n) = [αn ]. Chebyshev’s inequality implies that if > 0
∞ ∞ P (Tk(n) − ETk(n)  > k (n)) ≤ −2 n=1 n=1
k ( n) ∞ = var (Tk(n) )/k (n)2 −2 k (n)−2
n=1 ∞ var (Ym ) = −2 m=1 k (n)−2 var (Ym )
m=1 n:k(n)≥m where we have used Fubini’s theorem to interchange the two summations of nonnegative terms. Now k (n) = [αn ] and [αn ] ≥ αn /2 for n ≥ 1, so summing the geometric
series and noting that the ﬁrst term is ≤ m−2 :
[αn ]−2 ≤ 4
n:αn ≥m α−2n ≤ 4(1 − α−2 )−1 m−2
n:αn ≥m Combining our computations shows
∞ ∞ P (Tk(n) − ETk(n)  > k (n)) ≤ 4(1 − α−2 )−1 2
E (Ym )m−2 < ∞ −2
m=1 n=1 by Lemma 1.7.3. Since is arbitrary (Tk(n) − ETk(n) )/k (n) → 0 a.s. The dominated
convergence theorem implies EYk → EX1 as k → ∞, so ETk(n) /k (n) → EX1 and we
have shown Tk(n) /k (n) → EX1 a.s. To handle the intermediate values, we observe
that if k (n) ≤ m < k (n + 1)
Tk(n+1)
T k ( n)
Tm
≤
≤
k (n + 1)
m
k (n)
(here we use Yi ≥ 0), so recalling k (n) = [αn ], we have k (n + 1)/k (n) → α and
1
EX1 ≤ lim inf Tm /m ≤ lim sup Tm /m ≤ αEX1
n→∞
α
m→∞
Since α > 1 is arbitrary, the proof is complete.
The next result shows that the strong law holds whenever EXi exists.
+
−
Theorem 1.7.5. Let X1 , X2 , . . . be i.i.d. with EXi = ∞ and EXi < ∞. If Sn =
X1 + · · · + Xn then Sn /n → ∞ a.s.
M
M
M
Proof. Let M > 0 and Xi = Xi ∧ M . The Xi are i.i.d. with E Xi  < ∞, so if
M
M
M
M
M
M
Si = X1 + · · · + Xn then Theorem 1.7.1 implies Sn /n → EXi . Since Xi ≥ Xi ,
it follows that
M
M
lim inf Sn /n ≥ lim Sn /n = EXi
n→∞ n→∞ +
M
The monotone convergence theorem implies E (Xi )+ ↑ EXi = ∞ as M ↑ ∞, so
M
M
M
EXi = E (Xi )+ − E (Xi )− ↑ ∞, and we have lim inf n→∞ Sn /n ≥ ∞, which
implies the desired result. 1.7. STRONG LAW OF LARGE NUMBERS 47 The rest of this section is devoted to applications of the strong law of large numbers.
Example 1.7.1. Renewal theory. Let X1 , X2 , . . . be i.i.d. with 0 < Xi < ∞. Let
Tn = X1 + . . . + Xn and think of Tn as the time of nth occurrence of some event. For
a concrete situation, consider a diligent janitor who replaces a light bulb the instant
it burns out. Suppose the ﬁrst bulb is put in at time 0 and let Xi be the lifetime of
the ith light bulb. In this interpretation, Tn is the time the nth light bulb burns out
and Nt = sup{n : Tn ≤ t} is the number of light bulbs that have burnt out by time t.
Theorem 1.7.6. If EX1 = µ ≤ ∞ then as t → ∞,
Nt /t → 1/µ a.s. (1/∞ = 0). Proof. By Theorems 1.7.1 and 1.7.5, Tn /n → µ a.s. From the deﬁnition of Nt , it
follows that T (Nt ) ≤ t < T (Nt + 1), so dividing through by Nt gives
t
T (Nt + 1) Nt + 1
T (Nt )
≤
≤
·
Nt
Nt
Nt + 1
Nt
To take the limit, we note that since Tn < ∞ for all n, we have Nt ↑ ∞ as t → ∞.
The strong law of large numbers implies that for ω ∈ Ω0 with P (Ω0 ) = 1, we have
Tn (ω )/n → µ, Nt (ω ) ↑ ∞, and hence
TNt (ω) (ω )
→µ
Nt (ω ) Nt (ω ) + 1
→1
Nt (ω ) From this it follows that for ω ∈ Ω0 that t/Nt (ω ) → µ a.s.
The last argument shows that if Xn → X∞ a.s. and N (n) → ∞ a.s. then
XN (n) → X∞ a.s. We have written this out with care because the analogous result for convergence in probability is false.
Exercise 1.7.1. Give an example with Xn ∈ {0, 1}, Xn → 0 in probability, N (n) ↑ ∞
a.s., and XN (n) → 1 a.s.
Exercise 1.7.2. Lazy janitor. Suppose the ith light bulb burns for an amount of
time Xi and then remains burned out for time Yi before being replaced. Suppose the
Xi , Yi are positive and independent with the X ’s having distribution F and the Y ’s
having distribution G, both of which have ﬁnite mean. Let Rt be the amount of time
in [0, t] that we have a working light bulb. Show that Rt /t → EXi /(EXi + EYi )
almost surely.
Example 1.7.2. Empirical distribution functions. Let X1 , X2 , . . . be i.i.d. with
distribution F and let
n Fn (x) = n−1 1(Xm ≤x)
m=1 Fn (x) = the observed frequency of values that are ≤ x , hence the name given above.
The next result shows that Fn converges uniformly to F as n → ∞.
Theorem 1.7.7. The GlivenkoCantelli theorem. As n → ∞,
sup Fn (x) − F (x) → 0
x a.s. 48 CHAPTER 1. LAWS OF LARGE NUMBERS Proof. Fix x and let Yn = 1(Xn ≤x) . Since the Yn are i.i.d. with EYn = P (Xn ≤ x) =
n
F (x), the strong law of large numbers implies that Fn (x) = n−1 m=1 Ym → F (x)
a.s. In general, if Fn is a sequence of nondecreasing functions that converges pointwise
to a bounded and continuous limit F then supx Fn (x) − F (x) → 0. However, the
distribution function F (x) may have jumps, so we have to work a little harder.
Again, ﬁx x and let Zn = 1(Xn <x) . Since the Zn are i.i.d. with EZn = P (Xn <
x) = F (x−) = limy↑x F (y ), the strong law of large numbers implies that Fn (x−) =
n
n−1 m=1 Zm → F (x−) a.s. For 1 ≤ j ≤ k − 1 let xj,k = inf {y : F (y ) ≥ j/k }. The
pointwise convergence of Fn (x) and Fn (x−) imply that we can pick Nk (ω ) so that if
n ≥ Nk (ω ) then
Fn (xj,k ) − F (xj,k ) < k −1 and Fn (xj,k −) − F (xj,k −) < k −1 for 1 ≤ j ≤ k − 1. If we let x0,k = −∞ and xk,k = ∞, then the last two inequalities
hold for j = 0 or k . If x ∈ (xj −1,k , xj,k ) with 1 ≤ j ≤ k and n ≥ Nk (ω ), then using
the monotonicity of Fn and F , and F (xj,k −) − F (xj −1,k ) ≤ k −1 , we have
Fn (x) ≤ Fn (xj,k −) ≤ F (xj,k −) + k −1 ≤ F (xj −1,k ) + 2k −1 ≤ F (x) + 2k −1
Fn (x) ≥ Fn (xj −1,k ) ≥ F (xj −1,k ) − k −1 ≥ F (xj,k −) − 2k −1 ≥ F (x) − 2k −1
so supx Fn (x) − F (x) ≤ 2k −1 , and we have proved the result.
Example 1.7.3. Shannon’s theorem. Let X1 , X2 , . . . ∈ {1, . . . , r} be independent
with P (Xi = k ) = p(k ) > 0 for 1 ≤ k ≤ r. Here we are thinking of 1, . . . , r as
the letters of an alphabet, and X1 , X2 , . . . are the successive letters produced by an
information source. In this i.i.d. case, it is the proverbial monkey at a typewriter. Let
πn (ω ) = p(X1 (ω )) · · · p(Xn (ω )) be the probability of the realization we observed in
the ﬁrst n trials. Since log πn (ω ) is a sum of independent random variables, it follows
from the strong law of large numbers that
r −n−1 log πn (ω ) → H ≡ − p(k ) log p(k ) a.s.
k=1 The constant H is called the entropy of the source and is a measure of how random
it is. The last result is the asymptotic equipartition property: If > 0 then as
n→∞
P {exp(−n(H + )) ≤ πn (ω ) ≤ exp(−n(H − )} → 1
We will give a more general version of this result in (5.1) of Chapter 6.
Exercises
1.7.3. Let X0 = (1, 0) and deﬁne Xn ∈ R2 inductively by declaring that Xn+1 is
chosen at random from the ball of radius Xn  centered at the origin, i.e., Xn+1 /Xn 
is uniformly distributed on the ball of radius 1 and independent of X1 , . . . , Xn . Prove
that n−1 log Xn  → c a.s. and compute c.
1.7.4. Investment problem. We assume that at the beginning of each year you can
buy bonds for $1 that are worth $ a at the end of the year or stocks that are worth
a random amount V ≥ 0. If you always invest a ﬁxed proportion p of your wealth
in bonds, then your wealth at the end of year n + 1 is Wn+1 = (ap + (1 − p)Vn )Wn .
2
−
Suppose V1 , V2 , . . . are i.i.d. with EVn < ∞ and E (Vn 2 ) < ∞. (i) Show that
−1
n log Wn → c(p) a.s. (ii) Show that c(p) is concave. [Use (9.1) in the Appendix
to justify diﬀerentiating under the expected value.] (iii) By investigating c (0) and
c (1), give conditions on V that guarantee that the optimal choice of p is in (0,1). (iv)
Suppose P (V = 1) = P (V = 4) = 1/2. Find the optimal p as a function of a. 1.8. CONVERGENCE OF RANDOM SERIES* 1.8 49 Convergence of Random Series* In this section, we will pursue a second approach to the strong law of large numbers
based on the convergence of random series. This approach has the advantage that it
leads to estimates on the rate of convergence under moment assumptions, Theorems
1.8.7 and 1.8.8, and to a negative result for the inﬁnite mean case, Theorem 1.8.9,
which is stronger than the one in Theorem 1.6.7. The ﬁrst two results in this section
are of considerable interest in their own right, although we will see more general
versions in (1.1) of Chapter 3 and (4.2) of Chapter 4.
To state the ﬁrst result, we need some notation. Let Fn = σ (Xn , Xn+1 , . . .) = the
future after time n = the smallest σ ﬁeld with respect to which all the Xm , m ≥ n are
measurable. Let T = ∩n Fn = the remote future, or tail σ ﬁeld. Intuitively, A ∈ T
if and only if changing a ﬁnite number of values does not aﬀect the occurrence of the
event. As usual, we turn to examples to help explain the deﬁnition.
Example 1.8.1. If Bn ∈ R then {Xn ∈ Bn i.o.} ∈ T . If we let Xn = 1An and
Bn = {1}, this example becomes {An i.o.}.
Example 1.8.2. Let Sn = X1 + . . . + Xn . It is easy to check that
{limn→∞ Sn exists } ∈ T ,
{lim supn→∞ Sn > 0} ∈ T ,
{lim supn→∞ Sn /cn > x} ∈ T if cn → ∞.
The next result shows that all examples are trivial.
Theorem 1.8.1. Kolmogorov’s 01 law. If X1 , X2 , . . . are independent and A ∈ T
then P (A) = 0 or 1.
Proof. We will show that A is independent of itself, that is, P (A ∩ A) = P (A)P (A),
so P (A) = P (A)2 , and hence P (A) = 0 or 1. We will sneak up on this conclusion in
two steps:
(a) A ∈ σ (X1 , . . . , Xk ) and B ∈ σ (Xk+1 , Xk+2 , . . .) are independent.
Proof of (a). If B ∈ σ (Xk+1 , . . . , Xk+j ) for some j , this follows from Theorem 1.4.5.
Since σ (X1 , . . . , Xk ) and ∪j σ (Xk+1 , . . . , Xk+j ) are π systems that contain Ω (a) follows from Theorem 1.4.3.
(b) A ∈ σ (X1 , X2 , . . .) and B ∈ T are independent.
Proof of (b). Since T ⊂ σ (Xk+1 , Xk+2 , . . .), if A ∈ σ (X1 , . . . , Xk ) for some k , this
follows from (a). ∪k σ (X1 , . . . , Xk ) and T are π systems that contain Ω, so (b) follows
from Theorem 1.4.3.
Since T ⊂ σ (X1 , X2 , . . .), (b) implies an A ∈ T is independent of itself and Theorem
1.8.1 follows.
If A1 , A2 , . . . are independent then Theorem 1.8.1 implies P (An i.o.) = 0 or 1.
Applying Theorem 1.8.1 to Example 1.8.2 gives P (limn→∞ Sn exists) = 0 or 1. The
next result will help us prove the probability is 1 in certain situations.
Theorem 1.8.2. Kolmogorov’s maximal inequality. Suppose X1 , . . . , Xn are
independent with EXi = 0 and var (Xi ) < ∞. If Sn = X1 + · · · + Xn then
P max Sk  ≥ x 1≤k≤n ≤ x−2 var (Sn ) 50 CHAPTER 1. LAWS OF LARGE NUMBERS Remark. Under the same hypotheses, Chebyshev’s inequality (Theorem 1.3.4) gives
only
P (Sn  ≥ x) ≤ x−2 var (Sn )
Proof Let Ak = {Sk  ≥ x but Sj  < x for j < k }, i.e., we break things down according
to the time that Sk  ﬁrst exceeds x. Since the Ak are disjoint and (Sn − Sk )2 ≥ 0,
n n
2
Sn dP = 2
ESn ≥
k=1
n Ak k=1 Ak 2
Sk + 2Sk (Sn − Sk ) + (Sn − Sk )2 dP
Ak k=1
n
2
Sk dP + ≥ 2Sk 1Ak · (Sn − Sk ) dP
k=1 Sk 1Ak ∈ σ (X1 , . . . , Xk ) and Sn − Sk ∈ σ (Xk+1 , . . . , Xn ) are independent by Theorem
1.4.6, so using Theorem 1.4.9 and E (Sn − Sk ) = 0 shows
2Sk 1Ak · (Sn − Sk ) dP = E (2Sk 1Ak ) · E (Sn − Sk ) = 0
Using now the fact that Sk  ≥ x on Ak and the Ak are disjoint,
n n
2
ESn ≥ 2
Sk dP ≥
k=1 Ak x2 P (Ak ) = x2 P
k=1 max Sk  x 1≤k≤n Exercise 1.8.1. Suppose X1 , X2 , . . . are i.i.d. with EXi = 0, var (Xi ) = C < ∞. Use
Theorem 1.8.2 with n = mα where α(2p − 1) > 1 to conclude that if Sn = X1 + · · · + Xn
and p > 1/2 then Sn /np → 0 almost surely.
We turn now to our results on convergence of series. To state them, we need a
N
∞
deﬁnition. We say that n=1 an converges if limN →∞ n=1 an exists.
Theorem 1.8.3. Suppose X1 , X2 , . . . are independent and have EXn = 0. If
∞ var (Xn ) < ∞
n=1 then with probability one
Proof. Let SN = N
n=1 ∞
n=1 Xn (ω ) converges. Xn . From Theorem 1.8.2, we get
N P max Sm − SM  > M ≤m≤N ≤ −2 var (SN − SM ) = −2 var (Xn )
n=M +1 Letting N → ∞ in the last result, we get
∞ P sup Sm − SM  > ≤ m≥M −2 var (Xn ) → 0 as M → ∞ n=M +1 If we let wM = supm,n≥M Sm − Sn  then wM ↓ as M ↑ and
P (wM > 2 ) ≤ P sup Sm − SM  > →0 m≥M as M → ∞ so wM ↓ 0 almost surely. But wM (ω ) ↓ 0 implies Sn (ω ) is a Cauchy
sequence and hence limn→∞ Sn (ω ) exists, so the proof is complete. 1.8. CONVERGENCE OF RANDOM SERIES* 51 Example 1.8.3. Let X1 , X2 , . . . be independent with
P (Xn = n−α ) = P (Xn = −n−α ) = 1/2
EXn = 0 and var (Xn ) = n−2α so if α > 1/2 it follows from Theorem 1.8.3 that
Xn converges. Theorem 1.8.4 below shows that α > 1/2 is also necessary for this
Xn  < ∞, if and only
conclusion. Notice that there is absolute convergence, i.e.,
if α > 1.
Theorem 1.8.3 is suﬃcient for all of our applications, but our treatment would not
be complete if we did not mention the last word on convergence of random series.
Theorem 1.8.4. Kolmogorov’s threeseries theorem. Let X1 , X2 , . . . be inde∞
pendent. Let A > 0 and let Yi = Xi 1(Xi ≤A) . In order that n=1 Xn converges a.s.,
it is necessary and suﬃcient that
∞ ∞ P (Xn  > A) < ∞, (ii) (i) ∞ var (Yn ) < ∞ EYn converges, and (iii) n=1 n=1 n=1 Proof. We will prove the necessity in Example 4.7 of Chapter 2 as an application of
the central limit theorem. To prove the suﬃciency, let µn = EYn . (iii) and Theorem
∞
∞
1.8.3 imply that n=1 (Yn − µn ) converges a.s. Using (ii) now gives that n=1 Yn
converges a.s. (i) and the BorelCantelli lemma imply P (Xn = Yn i.o.) = 0, so
∞
n=1 Xn converges a.s.
The link between convergence of series and the strong law of large numbers is
provided by
∞
n=1 Theorem 1.8.5. Kronecker’s lemma. If an ↑ ∞ and xn /an converges then n a−1
n xm → 0
m=1
m
k=1 Proof. Let a0 = 0, b0 = 0, and for m ≥ 1, let bm =
am (bm − bm−1 ) and so
n a−1
n n xm = a−1
n
m=1 n am bm −
m=1 am bm−1
m=1 n = a−1
n xk /ak . Then xm = n am−1 bm−1 − an bn +
m=2 am bm−1
m=1 n = bn − (am − am−1 )
bm−1
an
m=1 (Recall a0 = 0.) By hypothesis, bn → b∞ as n → ∞. Since am − am−1 ≥ 0, the last
sum is an average of b0 , . . . , bn . Intuitively, if > 0 and M < ∞ are ﬁxed and n is
large, the average assigns mass ≥ 1 − to the bm with m ≥ M , so
n (am − am−1 )
bm−1 → b∞
an
m=1 52 CHAPTER 1. LAWS OF LARGE NUMBERS To argue formally, let B = sup bn , pick M so that bm − b∞  < /2 for m ≥ M , then
pick N so that aM /an < /4B for n ≥ N . Now if n ≥ N , we have
n n (am − am−1 )
(am − am−1 )
bm−1 − b∞ ≤
bm−1 − b∞ 
an
an
m=1
m=1
≤
proving the desired result since aM
an − aM
· 2B +
·<
an
an
2 is arbitrary. Theorem 1.8.6. The strong law of large numbers. Let X1 , X2 , . . . be i.i.d. random variables with E Xi  < ∞. Let EXi = µ and Sn = X1 + . . . + Xn . Then
Sn /n → µ a.s. as n → ∞.
Proof. Let Yk = Xk 1(Xk ≤k) and Tn = Y1 + · · · + Yn . By (a) in the proof of Theorem
1.7.1 it suﬃces to show that Tn /n → µ. Let Zk = Yk − EYk , so EZk = 0. Now
2
var (Zk ) = var (Yk ) ≤ EYk and (b) in the proof of Theorem 1.7.1 imply
∞ ∞ var (Zk )/k 2 ≤ 2
EYk /k 2 < ∞
k=1 k=1 Applying Theorem 1.8.3 now, we conclude that
1.8.5 implies
n n −1 (Yk − EYk ) → 0 ∞
k=1 Zk /k converges a.s. so Theorem Tn
− n−1
n and hence k=1 n EYk → 0 a.s.
k=1 The dominated convergence theorem implies EYk → µ as k → ∞. From this, it
n
follows easily that n−1 k=1 EYk → µ and hence Tn /n → µ.
Rates of convergence
As mentioned earlier, one of the advantages of the random series proof is that it
provides estimates on the rate of convergence of Sn /n → µ. By subtracting µ from
each random variable, we can and will suppose without loss of generality that µ = 0.
2
Theorem 1.8.7. Let X1 , X2 , . . . be i.i.d. random variables with EXi = 0 and EXi =
2
σ < ∞. Let Sn = X1 + . . . + Xn . If > 0 then Sn /n1/2 (log n)1/2+ → 0 a.s. Remark. Kolmogorov’s test, (9.6) in Chapter 7, will show that
√
lim sup Sn /n1/2 (log log n)1/2 = σ 2 a.s.
n→∞ so the last result is not far from the best possible.
Proof. Let an = n1/2 (log n)1/2+ for n ≥ 2 and a1 > 0.
∞ ∞ var (Xn /an ) = σ 2
n=1 so applying Theorem 1.8.3 we get
follows from Theorem 1.8.5. ∞
n=1 1
1
2+
a1 n=2 n(log n)1+2 <∞ Xn /an converges a.s. and the indicated result 1.8. CONVERGENCE OF RANDOM SERIES* 53 The next result due to Marcinkiewicz and Zygmund treats the situation in which
2
EXi = ∞ but E Xi p < ∞ for some 1 < p < 2.
Theorem 1.8.8. Let X1 , X2 , . . . be i.i.d. with EX1 = 0 and E X1 p < ∞ where
1 < p < 2. If Sn = X1 + . . . + Xn then Sn /n1/p → 0 a.s.
Proof. Let Yk = Xk 1(Xk ≤k1/p ) and Tn = Y1 + · · · + Yn .
∞ ∞ P (Xk p > k ) ≤ E Xk p < ∞ P (Yk = Xk ) =
k=1 k=1 so the BorelCantelli lemma implies P (Yk = Xk i.o.) = 0, and it suﬃces to show
2
Tn /n1/p → 0. Using var (Ym ) ≤ E (Ym ), Lemma 1.5.8 with p = 2, P (Ym  > y ) ≤
P (X1  > y ), and Fubini’s theorem (everything is ≥ 0) we have
∞ ∞ var (Ym /m1/p ) ≤ 2
EYm /m2/p m=1 m=1
∞m n1/p ≤
(n−1)1/p m=1 n=1
∞
n1/p =
n=1 2y
P (X1  > y ) dy
m2/p ∞ 2y
P (X1  > y ) dy
2/p
(n−1)1/p m=n m To bound the integral, we note that for n ≥ 2 comparing the sum with the integral
of x−2/p
∞
p
(n − 1)(p−2)/p ≤ Cy p−2
m−2/p ≤
2−p
m=n
when y ∈ [(n − 1)1/p , n1/p ]. Since E Xi p =
that ∞
0 pxp−1 P (Xi  > x) dx < ∞, it follows ∞ var (Ym /m1/p ) < ∞
m=1 If we let µm = EYm and apply Theorem 1.8.3 and Theorem 1.8.5 it follows that
n n−1/p (Ym − µm ) → 0 a.s. m=1 To estimate µm , we note that since EXm = 0, µm = −E (Xi ; Xi  > m1/p ), so
µm  ≤ E (X ; Xi  > m1/p ) = m1/p E (X /m1/p ; Xi  > m1/p )
≤ m1/p E ((X /m1/p )p ; Xi  > m1/p )
≤ m−1+1/p p−1 E (Xi p ; Xi  > m1/p )
Now
n−1/p n
−1+1/p
≤ Cn1/p and
m=1 m
n
m=1 µm → 0 and the desired E (Xi p ; Xi  > m1/p ) → 0 as m → ∞, so
result follows. Exercise 1.8.2. The converse of the last result is much easier. Let p > 0. If
Sn /n1/p → 0 a.s. then E X1 p < ∞. 54 CHAPTER 1. LAWS OF LARGE NUMBERS Inﬁnite Mean
The St. Petersburg game, discussed in Example 1.5.7 and Exercise 1.6.17, is a
situation in which EXi = ∞, Sn /n log2 n → 1 in probability but
lim sup Sn /(n log2 n) = ∞ a.s.
n→∞ The next result, due to Feller (1946), shows that when E X1  = ∞, Sn /an cannot
converge almost surely to a nonzero limit. In Theorem 1.6.7 we considered the special
case an = n.
Theorem 1.8.9. Let X1 , X2 , . . . be i.i.d. with E X1  = ∞ and let Sn = X1 +
· · · + Xn . Let an be a sequence of positive numbers with an /n increasing. Then
lim supn→∞ Sn /an = 0 or ∞ according as n P (X1  ≥ an ) < ∞ or = ∞.
Proof. Since an /n ↑, akn ≥ kan for any integer k . Using this and an ↑,
∞ ∞ P (X1  ≥ kan ) ≥
n=1 P (X1  ≥ akn ) ≥
n=1 1
k ∞ P (X1  ≥ am )
m=k The last observation shows that if the sum is inﬁnite, lim supn→∞ Xn /an = ∞. Since
max{Sn−1 , Sn } ≥ Xn /2, it follows that lim supn→∞ Sn /an = ∞.
To prove the other half, we begin with the identity
∞ ∞ P (Xi  ≥ an−1 ) mP (am−1 ≤ Xi  < am ) = (∗) n=1 m=1
m
To see this, write m = n=1 1 and then use Fubini’s theorem. We now let Yn =
Xn 1(Xn <an ) , and Tn = Y1 + . . . + Yn . When the sum is ﬁnite, P (Yn = Xn i.o.) = 0,
and it suﬃces to investigate the behavior of the Tn . To do this, we let a0 = 0 and
compute
∞ ∞
2
EYn /a2
n var (Yn /an ) ≤
n=1
∞ n=1 n a−2
n =
n=1
∞ y 2 dF (y )
m=1 [am−1 ,am )
∞ m=1 Since an ≥ nam /m, we have a−2
n y 2 dF (y ) =
∞
n=m [am−1 ,am ) n=m a−2 ≤ (m2 /a2 )
n
m ∞
n=m n−2 ≤ Cma−2 , so
m ∞ ≤C m
m=1 dF (y )
[am−1 ,am ) ∞ Using (∗) now, we conclude n=1 var (Yn /an ) < ∞.
The last step is to show ETn /an → 0. To begin, we note that if E Xi  = ∞,
∞
n=1 P (Xi  > an ) < ∞, and an /n ↑ we must have an /n ↑ ∞. To estimate ETn /an
now, we observe that
n a−1
n n EYm ≤ a−1 n
n
m=1 E (Xm ; Xm  < am )
m=1 ≤ n
naN
+
E (Xi ; aN ≤ Xi  < an )
an
an 1.8. CONVERGENCE OF RANDOM SERIES* 55 where the last inequality holds for any ﬁxed N . Since an /n → ∞, the ﬁrst term
converges to 0. Since m/am ↓, the second is
n ≤
m=N +1
∞ ≤ m
E (Xi ; am−1 ≤ Xi  < am )
am
mP (am−1 ≤ Xi  < am ) m=N +1 (∗) shows that the sum is ﬁnite, so it is small if N is large and the desired result
follows.
Exercises
1.8.3. Let X1 , X2 , . . . be i.i.d. standard normals. Show that for any t
∞ Xn ·
n=1 sin(nπt)
n converges a.s. We will see this series again at the end of Section 7.1.
2
1.8.4. Let X1 , X2 , . . . be independent with EXn = 0, var (Xn ) = σn . (i) Show that if
n
−1
2
2
n σn /n < ∞ then
n Xn /n converges a.s. and hence n
m=1 Xm → 0 a.s. (ii)
2
2
Suppose
σn /n2 = ∞ and without loss of generality that σn ≤ n2 for all n. Show
2
that there are independent random variables Xn with EXn = 0 and var (Xn ) ≤ σn
−1
so that Xn /n and hence n
m≤n Xm does not converge to 0 a.s. 1.8.5. Let Xn ≥ 0 be independent for n ≥ 1. The following are equivalent:
∞
∞
(i) n=1 Xn < ∞ a.s. (ii) n=1 [P (Xn > 1) + E (Xn 1(Xn ≤1) )] < ∞
∞
(iii) n=1 E (Xn /(1 + Xn )) < ∞.
1.8.6. Let ψ (x) = x2 when x ≤ 1 and = x when x ≥ 1. Show that if X1 , X2 , . . .
∞
∞
are independent with EXn = 0 and n=1 Eψ (Xn ) < ∞ then n=1 Xn converges
a.s.
∞ 1.8.7. Let Xn be independent. Suppose n=1 E Xn p(n) < ∞ where 0 < p(n) ≤ 2
∞
for all n and EXn = 0 when p(n) > 1. Show that n=1 Xn converges a.s.
1.8.8. Let X1 , X2 , . . . be i.i.d. and not ≡ 0. Then the radius of convergence of the
Xn (ω )cn < ∞}) is 1 a.s. or 0
power series n≥1 Xn (ω )z n (i.e., r(ω ) = sup{c :
a.s., according as E log+ X1  < ∞ or = ∞ where log+ x = max(log x, 0).
1.8.9. Let X1 , X2 , . . . be independent and let Sm,n = Xm+1 + . . . + Xn . Then
() P max Sm,j  > 2a m<j ≤n min P (Sk,n  ≤ a) ≤ P (Sm,n  > a) m<k≤n 1.8.10. Use ( ) to prove a theorem of P. L´vy: Let X1 , X2 , . . . be independent and
e
let Sn = X1 + . . . + Xn . If limn→∞ Sn exists in probability then it also exists a.s.
1.8.11. Let X1 , X2 , . . . be i.i.d. and Sn = X1 + . . . + Xn . Use ( ) to conclude that if
Sn /n → 0 in probability then (max1≤m≤n Sm )/n → 0 in probability.
1.8.12. Let X1 , X2 , . . . be i.i.d. and Sn = X1 + . . . + Xn . Suppose an ↑ ∞ and
a(2n )/a(2n−1 ) is bounded. (i) Use ( ) to show that if Sn /a(n) → 0 in probability and
S2n /a(2n ) → 0 a.s. then Sn /a(n) → 0 a.s. (ii) Suppose in addition that EX1 = 0 and
2
EX1 < ∞. Use the previous exercise and Chebyshev’s inequality to conclude that
Sn /n1/2 (log2 n)1/2+ → 0 a.s. 56 CHAPTER 1. LAWS OF LARGE NUMBERS 1.9 Large Deviations* Let X1 , X2 , . . . be i.i.d. and let Sn = X1 + · · · + Xn . In this section, we will investigate
the rate at which P (Sn > na) → 0 for a > µ = EXi . We will ultimately conclude
that if the momentgenerating function ϕ(θ) = E exp(θXi ) < ∞ for some θ > 0,
P (Sn ≥ na) → 0 exponentially rapidly and we will identify
γ (a) = lim n→∞ 1
log P (Sn ≥ na)
n Our ﬁrst step is to prove that the limit exists. This is based on an observation
that will be useful several times below. Let πn = P (Sn ≥ na).
πm+n ≥ P (Sm ≥ ma, Sn+m − Sm ≥ na) = πm πn
since Sm and Sn+m − Sm are independent. Letting γn = log πn transforms multiplication into addition.
Lemma 1.9.1. If γm+n ≥ γm + γn then as n → ∞, γn /n → supm γm /m.
Proof. Clearly, lim sup γn /n ≤ sup γm /m. To complete the proof, it suﬃces to prove
that for any m liminf γn /n ≥ γm /m. Writing n = km + with 0 ≤ < m and making
repeated use of the hypothesis gives γn ≥ kγm + γ . Dividing by n = km + gives
γ (n)
≥
n km
km + γ (m) γ ( )
+
m
n Letting n → ∞ and recalling n = km + with 0 ≤ < m gives the desired result.
Lemma 1.9.1 implies that limn→∞
from the formula for the limit that 1
n log P (Sn ≥ na) = γ (a) exists ≤ 0. It follows P (Sn ≥ na) ≤ enγ (a) (1.9.1) The last two observations give us some useful information about γ (a).
Exercise 1.9.1. The following are equivalent: (a) γ (a) = −∞, (b) P (X1 ≥ a) = 0,
and (c) P (Sn ≥ na) = 0 for all n.
Exercise 1.9.2. Use the deﬁnition to conclude that if λ ∈ [0, 1] is rational then
γ (λa + (1 − λ)b) ≥ λγ (a) + (1 − λ)γ (b). Use monotonicity to conclude that the last
relationship holds for all λ ∈ [0, 1] so γ is concave and hence Lipschitz continuous on
compact subsets of γ (a) > −∞.
The conclusions above are valid for any distribution. For the rest of this section,
we will suppose:
(H1) ϕ(θ) = E exp(θXi ) < ∞ for some θ > 0 Let θ+ = sup{θ : φ(θ) < ∞}, θ− = inf {θ : φ(θ) < ∞} and note that φ(θ) < ∞ for
+
θ ∈ (θ− , θ+ ). (H1) implies that EXi < ∞ so µ = EX + − EX − ∈ [−∞, ∞). If θ > 0
Chebyshev’s inequality implies
eθna P (Sn ≥ na) ≤ E exp(θSn ) = ϕ(θ)n
or letting κ(θ) = log ϕ(θ)
P (Sn ≥ na) ≤ exp(−n{aθ − κ(θ)})
Our ﬁrst goal is to show: (1.9.2) 1.9. LARGE DEVIATIONS* 57 Lemma 1.9.2. If a > µ and θ > 0 is small then aθ − κ(θ) > 0.
Proof. κ(0) = log ϕ(0) = 0, so it suﬃces to show that (i) κ is continuous at 0, (ii)
diﬀerentiable on (0, θ+ ), and (iii) κ (θ) → µ as θ → 0. For then
θ aθ − κ(θ) = a − κ (x) dx > 0
0 for small θ.
Let F (x) = P (Xi ≤ x). To prove (i) we note that if 0 < θ < θ0 < θ−
eθx ≤ 1 + eθ0 x (∗) so by the dominated convergence theorem as θ → 0
eθx dF → 1 dF = 1 To prove (ii) we note that if h < h0 then
hx ehx − 1 = ey dy ≤ hxeh0 x
0 so an application of the dominated convergence theorem shows that
ϕ(θ + h) − ϕ(θ)
h
ehx − 1 θx
= lim
e dF (x)
h→0
h ϕ (θ) = lim h→0 = xeθx dF (x) for θ ∈ (0, θ+ ) From the last equation, it follows that κ(θ) = log φ(θ) has κ (θ) = φ (θ)/φ(θ). Using
(∗) and the dominated convergence theorem gives (iii) and the proof is complete.
Having found an upper bound on P (Sn ≥ na), it is natural to optimize it by
ﬁnding the maximum of θa − κ(θ):
d
{θa − log ϕ(θ)} = a − ϕ (θ)/ϕ(θ)
dθ
so (assuming things are nice) the maximum occurs when a = ϕ (θ)/ϕ(θ). To turn the
parenthetical clause into a mathematical hypothesis we begin by deﬁning
Fθ (x) = x 1
ϕ(θ) eθy dF (y )
−∞ whenever φ(θ) < ∞. It follows from the proof of Lemma 1.9.2 that if θ ∈ (θ− , θ+ ),
Fθ is a distribution function with mean
x dFθ (x) = 1
ϕ(θ) ∞ xeθx dF (x) =
−∞ ϕ (θ)
ϕ(θ) Repeating the proof in Lemma 1.9.2, it is easy to see that if θ ∈ (θ− , θ+ ) then
∞ x2 eθx dF (x) φ (θ) =
−∞ 58 CHAPTER 1. LAWS OF LARGE NUMBERS So we have
d ϕ (θ)
ϕ (θ)
=
−
dθ ϕ(θ)
ϕ(θ) ϕ (θ)
ϕ(θ) 2 2 x2 dFθ (x) − = x dFθ (x) ≥0 since the last expression is the variance of Fθ . If we assume
(H2) the distribution F is not a point mass at µ then ϕ (θ)/ϕ(θ) is strictly increasing and aθ − log φ(θ) is concave. Since we have
ϕ (0)/ϕ(0) = µ, this shows that for each a > µ there is at most one θa ≥ 0 that solves
a = ϕ (θa )/(θa ), and this value of θ maximizes aθ − log ϕ(θ). Before discussing the
existence of θa we will consider some examples.
Example 1.9.1. Normal distribution.
eθx (2π )−1/2 exp(−x2 /2) dx = exp(θ2 /2) (2π )−1/2 exp(−(x − θ)2 /2) dx The integrand in the last integral is the density of a normal distribution with mean θ
and variance 1, so ϕ(θ) = exp(θ2 /2), θ ∈ (−∞, ∞). In this case, ϕ (θ)/ϕ(θ) = θ and
Fθ (x) = e−θ 2 x eθy (2π )−1/2 e−y /2 2 /2 dy −∞ is a normal distribution with mean θ and variance 1.
Example 1.9.2. Exponential distribution with parameter λ. If θ < λ
∞ eθx λe−λx dx = λ/(λ − θ)
0 ϕ (θ)ϕ(θ) = 1/(λ − θ) and
Fθ (x) = λ
λ−θ x eθy λe−λy dy
0 is an exponential distribution with parameter λ − θ and hence mean 1/(λ − θ).
Example 1.9.3. Coin ﬂips. P (Xi = 1) = P (Xi = −1) = 1/2
ϕ(θ) = (eθ + e−θ )/2
ϕ (θ)/ϕ(θ) = (eθ − e−θ )/(eθ + e−θ )
Fθ ({x})/F ({x}) = eθx /φ(θ) so
Fθ ({1}) = eθ /(eθ + e−θ ) and Fθ ({−1}) = e−θ /(eθ + e−θ ) Example 1.9.4. Perverted exponential. Let g (x) = Cx−3 e−x for x ≥ 1, g (x) = 0
otherwise, and choose C so that g is a probability density. In this case,
eθx g (x)dx < ∞ ϕ(θ) = if and only if θ ≤ 1, and when θ ≤ 1, we have
ϕ (1)
ϕ (θ)
≤
=
ϕ(θ)
ϕ(1) ∞ ∞ Cx−2 dx
1 Cx−3 dx = 2
1 1.9. LARGE DEVIATIONS* 59 Recall θ+ = sup{θ : ϕ(θ) < ∞}. In Examples 1.9.1 and 1.9.2, we have φ (θ)/φ(θ) ↑
∞ as θ ↑ θ+ so we can solve a = φ (θ)/φ(θ) for any a > µ. In Example 1.9.3,
φ (θ)/φ(θ) ↑ 1 as θ → ∞, but we cannot hope for much more since F and hence Fθ is
supported on {−1, 1}.
Exercise 1.9.3. Let xo = sup{x : F (x) < 1}. Show that if xo < ∞ then φ(θ) < ∞
for all θ > 0 and φ (θ)/φ(θ) → xo as θ ↑ ∞.
Example 1.9.4 presents a problem since we cannot solve a = ϕ (θ)/ϕ(θ) when a > 2.
Theorem 1.9.5 will cover this problem case, but ﬁrst we will treat the cases in which
we can solve the equation.
Theorem 1.9.3. Suppose in addition to (H1) and (H2) that there is a θa ∈ (0, θ+ )
so that a = ϕ (θa )/ϕ(θa ). Then, as n → ∞,
n−1 log P (Sn ≥ na) → −aθa + log ϕ(θa )
Proof. The fact that the limsup of the lefthand side ≤ the righthand side follows
λ
λ
from (1.9.2). To prove the other inequality, pick λ ∈ (θa , θ+ ), let X1 , X2 , . . . be
λ
λ
λ
i.i.d. with distribution Fλ and let Sn = X1 + · · · + Xn . Writing dF/dFλ for the RadonNikodym derivative of the associated measures, it is immediate from the deﬁnition
λ
n
that dF/dFλ = e−λx ϕ(λ). If we let Fλ and F n denote the distributions of Sn and
Sn , then
Lemma 1.9.4. dF n
−λx
ϕ(λ)n .
n =e
dFλ Proof. We will prove this by induction. The result holds when n = 1. For n > 1, we
note that
∞ F n = F n−1 ∗ F (z ) = z −x dF n−1 (x)
−∞ n
dFλ −1 (x) = dF (y )
−∞ dFλ (y ) 1(x+y≤z) e−λ(x+y) ϕ(λ)n
λ λ = E 1(Sn−1 +Xn ≤z) e−λ(Sn−1 +Xn ) ϕ(λ)n
λ
λ
z
n
dFλ (u)e−λu ϕ(λ)n =
−∞ λ
λ
where in the last two equalities we have used Theorem 1.3.9 for (Sn−1 , Xn ) and
λ
Sn . If ν > a, then the lemma and monotonicity imply
nν (∗) n
n
n
e−λx ϕ(λ)n dFλ (x) ≥ ϕ(λ)n e−λnν (Fλ (nν ) − Fλ (na)) P (Sn ≥ na) ≥
na Fλ has mean ϕ (λ)/ϕ(λ), so if we have a < ϕ (λ)/ϕ(λ) < ν , then the weak law of
large numbers implies
n
n
Fλ (nν ) − Fλ (na) → 1 as n → ∞ From the last conclusion and (∗) it follows that
lim inf n−1 log P (Sn > na) ≥ −λν + log φ(λ)
n→∞ Since λ > θa and ν > a are arbitrary, the proof is complete. 60 CHAPTER 1. LAWS OF LARGE NUMBERS To get a feel for what the answers look like, we consider our examples. To prepare
for the computations, we recall some important information:
κ(θ) = log φ(θ) κ (θ) = φ (θ)/φ(θ) θa solves κ (θa ) = a
γ (a) = lim (1/n) log P (Sn ≥ na) = −aθa + κ(θa )
n→∞ Normal distribution (Example 1.9.1)
κ(θ) = θ2 /2 κ (θ) = θ θa = a
2 γ (a) = −aθa + κ(θa ) = −a /2
Exercise 1.9.4. Check the last result by observing that Sn has a normal distribution
with mean 0 and variance n, and then using Theorem 1.1.4.
Exponential distribution (Example 1.9.2) with λ = 1
κ(θ) = − log(1 − θ) κ (θ) = 1/(1 − θ) θa = 1 − 1/a γ (a) = −aθa + κ(θa ) = −a + 1 + log a
With these two examples as models, the reader should be able to do
Exercise 1.9.5. Let X1 , X2 , . . . be i.i.d. Poisson with mean 1, and let Sn = X1 +
· · · + Xn . Find limn→∞ (1/n) log P (Sn ≥ na) for a > 1. The answer and another
proof can be found in Exercise 1.4 of Chapter 2.
Coin ﬂips (Example 1.9.3). Here we take a diﬀerent approach. To ﬁnd the θ that
makes the mean of Fθ = a, we set Fθ ({1}) = eθ /(eθ + e−θ ) = (1 + a)/2. Letting
x = eθ gives
2x = (1 + a)(x + x−1 )
(a − 1)x2 + (1 + a) = 0
So x = (1 + a)/(1 − a) and θa = log x = {log(1 + a) − log(1 − a)}/2.
φ(θa ) = eθa
eθa + e−θa
=
=
2
1+a 1
(1 + a)(1 − a) γ (a) = −aθa + κ(θa ) = −{(1 + a) log(1 + a) + (1 − a) log(1 − a)}/2
In Exercise 2.1.3, this result will be proved by a direct computation. Since the formula
for γ (a) is rather ugly, the following simpler bound is useful.
Exercise 1.9.6. Show that for coin ﬂips ϕ(θ) ≤ exp(ϕ(θ) − 1) ≤ exp(βθ2 ) for θ ≤ 1
∞
where β = n=1 1/(2n)! ≈ 0.586, and use (1.9.2) to conclude that P (Sn ≥ an) ≤
exp(−na2 /4β ) for all a ∈ [0, 1]. It is customary to simplify this further by using
∞
β ≤ n=1 2−n = 1.
Turning now to the problematic values for which we cannot solve a = φ (θa )/φ(θa ),
we begin by observing that if xo = sup{x : F (x) < 1} and F is not a point mass at
xo then φ (θ)/φ(θ) ↑ x0 as θ ↑ ∞ but φ (θ)/φ(θ) < x0 for all θ < ∞. However, the
result for a = xo is trivial:
1
log P (Sn ≥ nxo ) = log P (Xi = xo )
n for all n Exercise 1.9.7. Show that as a ↑ xo , γ (a) ↓ log P (Xi = xo ).
When xo = ∞, φ (θ)/φ(θ) ↑ ∞ as θ ↑ ∞, so the only case that remains is covered by 1.9. LARGE DEVIATIONS* 61 Theorem 1.9.5. Suppose xo = ∞, θ+ < ∞, and ϕ (θ)/ϕ(θ) increases to a ﬁnite
limit a0 as θ ↑ θ+ . If a0 ≤ a < ∞
n−1 log P (Sn ≥ na) → −aθ+ + log ϕ(θ+ )
i.e., γ (a) is linear for a ≥ a0 .
Proof. Since (log ϕ(θ)) = ϕ (θ)/ϕ(θ), integrating from 0 to θ+ shows that log(ϕ(θ+ )) <
∞. Letting θ = θ+ in (1.9.2) shows that the limsup of the lefthand side ≤ the righthand side. To get the other direction we will use the transformed distribution Fλ , for
λ = θ+ . Letting θ ↑ θ+ and using the dominated convergence theorem for x ≤ 0 and
the monotone convergence theorem for x ≥ 0, we see that Fλ has mean a0 . From (∗)
in the proof of Theorem 1.9.3, we see that if a0 ≤ a < ν = a + 3
n
n
P (Sn ≥ na) ≥ ϕ(λ)n e−nλν (Fλ (nν ) − Fλ (na)) and hence
1
1
λ
log P (Sn ≥ na) ≥ log ϕ(λ) − λν + log P (Sn ∈ (na, nν ])
n
n
λ
λ
λ
λ
λ
Letting X1 , X2 , . . . be i.i.d. with distribution Fλ and Sn = X1 + · · · + Xn , we have
λ
λ
P (Sn ∈ (na, nν ]) ≥ P {Sn−1 ∈ ((a0 − )n, (a0 + )n]}
λ
· P {Xn ∈ ((a − a0 + )n, (a − a0 + 2 )n]} ≥ 1
λ
P {Xn ∈ ((a − a0 + )n, (a − a0 + )(n + 1)]}
2 for large n by the weak law of large numbers. To get a lower bound on the righthand
side of the last equation, we observe that
lim sup
n→∞ 1
λ
log P (X1 ∈ ((a − a0 + )n, (a − a0 + )(n + 1)]) = 0
n λ
for if the lim sup was < 0, we would have E exp(ηX1 ) < ∞ for some η > 0 and hence
E exp((λ + η )X1 ) < ∞, contradicting the deﬁnition of λ = θ+ . To ﬁnish the argument
now, we recall that Theorem 1.9.1 implies that lim n→∞ 1
log P (Sn ≥ na) = γ (a)
n exists, so our lower bound on the lim sup is good enough.
By adapting the proof of the last result, you can show that (H1) is necessary for
exponential convergence:
Exercise 1.9.8. Suppose EXi = 0 and E exp(θXi ) = ∞ for all θ > 0. Then
1
log P (Sn ≥ na) → 0 for all a > 0
n
Exercise 1.9.9. Suppose EXi = 0. Show that if > 0 then lim inf P (Sn ≥ na)/nP (X1 ≥ n(a + )) ≥ 1
n→∞ Hint: Let Fn = {Xi ≥ n(a + ) for exactly one i ≤ n}. 62 CHAPTER 1. LAWS OF LARGE NUMBERS Chapter 2 Central Limit Theorems
The ﬁrst four sections of this chapter develop the central limit theorem. The last
ﬁve treat various extensions and complements. We begin this chapter by considering
special cases of these results that can be treated by elementary computations. 2.1 The De MoivreLaplace Theorem Let X1 , X2 , . . . be i.i.d. with P (X1 = 1) = P (X1 = −1) = 1/2 and let Sn = X1 + · · · +
Xn . In words, we are betting $1 on the ﬂipping of a fair coin and Sn is our winnings
at time n. If n and k are integers
P (S2n = 2k ) = 2n
2−2n
n+k since S2n = 2k if and only if there are n + k ﬂips that are +1 and n − k ﬂips that
are −1 in the ﬁrst 2n. The ﬁrst factor gives the number of such outcomes and the
second the probability of each one. Stirling’s formula (see Feller, Vol. I. (1968),
p. 52) tells us
√
(2.1.1)
n! ∼ nn e−n 2πn as n → ∞
where an ∼ bn means an /bn → 1 as n → ∞, so
= (2n)!
(n + k )!(n − k )! ∼ 2n
n+k (2n)2n
(2π (2n))1/2
·
(n + k )n+k (n − k )n−k (2π (n + k ))1/2 (2π (n − k ))1/2 and we have
2n
2−2n ∼
n+k 1+ k
n −n−k · 1− · (πn)−1/2 · 1 + k
n k
n −n+k −1/2 · 1− k
n The ﬁrst two terms on the right are
= 1− k2
n2 −n · 1+ A little calculus shows that:
63 k
n −k · 1− k
n k −1/2 (2.1.2) 64 CHAPTER 2. CENTRAL LIMIT THEOREMS Lemma 2.1.1. If cj → 0, aj → ∞ and aj cj → λ then (1 + cj )aj → eλ .
Proof. As x → 0, log(1 + x)/x → 1, so aj log(1 + cj ) → λ and the desired result
follows.
Exercise 2.1.1. Generalize the last proof to conclude that if max1≤j ≤n cj,n  → 0,
n
n
n
λ
j =1 cj,n  < ∞ then
j =1 (1 + cj,n ) → e .
j =1 cj,n → λ, and supn
√
Using Lemma 2.1.1 now, we see that if 2k = x 2n, i.e., k = x n/2, then
1− −n k2
n2 1+ = 1 − x2 /2n
−k k
n k
1−
n k −n 2 → ex
√ √
= 1 + x/ 2n −x √
= 1 − x/ 2n x √ n/2 n/2 /2 2 → e−x
2 → e−x /2 /2 For this choice of k , k/n → 0, so
1+ k
n −1/2 · 1− −1/2 k
n →1 and putting things together gives:
√
2
Theorem 2.1.2. If 2k/ 2n → x then P (S2n = 2k ) ∼ (πn)−1/2 e−x /2 .
Our next step is to compute
√
√
P (a 2n ≤ S2n ≤ b 2n) = √
√
m∈[a 2n,b 2n]∩2Z P (S2n = m) √
Changing variables m = x 2n, we have that the above is
2 (2π )−1/2 e−x ≈ /2 · (2/n)1/2 √
x∈[a,b]∩(2Z/ 2n) √
√
√
where 2Z/ 2n = {2z/ 2n : z ∈ Z}. We have multiplied and divided by 2 since
the space between points in the sum is (2/n)1/2 , so if n is large the sum above is
b 2 (2π )−1/2 e−x ≈ /2 dx a The integrand is the density of the (standard) normal distribution, so changing notation we can write the last quantity as P (a ≤ χ ≤ b) where χ is a random variable
with that distribution.
It is not hard to ﬁll in the details to get:
Theorem 2.1.3. The De MoivreLaplace Theorem. If a < b then as m → ∞
b √
P (a ≤ Sm / m ≤ b) → 2 (2π )−1/2 e−x
a /2 dx 2.1. THE DE MOIVRELAPLACE THEOREM 65 (To remove the restriction to even integers observe S2n+1 = S2n ± 1.) The last result
is a special case of the central limit theorem given in Section 2.4, so further details
are left to the reader.
Another special case that can be treated with Stirling’s formula is
Exercise 2.1.2. Let X1 , X2 , . . . be independent and have a Poisson distribution with
mean 1. Then Sn = X1 + · · · + Xn has a Poisson distribution with mean n, i.e.,
√
P (Sn = k ) = e−n nk /k ! Use Stirling’s formula to show that if (k − n)/ n → x then
√
2πnP (Sn = k ) → exp(−x2 /2)
As in the case of coin ﬂips it follows that
b √
P (a ≤ (Sn − n)/ n ≤ b) → 2 (2π )−1/2 e−x /2 dx a but proving the last conclusion is not part of the exercise.
Stirling’s formula can also be used to compute some large deviations probabilities
considered in Section 1.9. In the next two exercises, X1 , X2 , . . . are i.i.d. and Sn =
X1 + · · · + Xn . In each case you should begin by considering P (Sn = k ) when k/n → a
and then relate P (Sn = j + 1) to P (Sn = j ) to show P (Sn ≥ k ) ≤ CP (Sn = k ).
Exercise 2.1.3. Suppose P (Xi = 1) = P (Xi = −1) = 1/2. Show that if a ∈ (0, 1)
1
log P (S2n ≥ 2na) → −γ (a)
2n
1
where γ (a) = 2 {(1 + a) log(1 + a) + (1 − a) log(1 − a)}. Exercise 2.1.4. Suppose P (Xi = k ) = e−1 /k ! for k = 0, 1, . . . Show that if a > 1
1
log P (Sn ≥ na) → a − 1 − a log a
n 66 2.2 CHAPTER 2. CENTRAL LIMIT THEOREMS Weak Convergence In this section, we will deﬁne the type of convergence that appears in the central limit
theorem and explore some of its properties. A sequence of distribution functions is
said to converge weakly to a limit F (written Fn ⇒ F ) if Fn (y ) → F (y ) for all y that
are continuity points of F . A sequence of random variables Xn is said to converge
weakly or converge in distribution to a limit X∞ (written Xn ⇒ X∞ ) if their
distribution functions Fn (x) = P (Xn ≤ x) converge weakly. To see that convergence
at continuity points is enough to identify the limit, observe that F is right continuous
and by Exercise 1.1.8 in Chapter 1, the discontinuities of F are at most a countable
set. 2.2.1 Examples Two examples of weak convergence that we have seen earlier are:
Example 2.2.1. Let X1 , X2 , . . . be i.i.d. with P (Xi = 1) = P (Xi = −1) = 1/2 and
let Sn = X1 + · · · + Xn . Then (1.5) implies
√
Fn (y ) = P (Sn / n ≤ y ) → y 2 (2π )−1/2 e−x /2 dx −∞ Example 2.2.2. Let X1 , X2 , . . . be i.i.d. with distribution F . The GlivenkoCantelli
theorem ((7.4) in Chapter 1) implies that for almost every ω ,
n Fn (y ) = n−1 1(Xm (ω)≤y) → F (y ) for all y
m=1 In the last two examples convergence occurred for all y , even though in the second
case the distribution function could have discontinuities. The next example shows
why we restrict our attention to continuity points.
Example 2.2.3. Let X have distribution F . Then X + 1/n has distribution
Fn (x) = P (X + 1/n ≤ x) = F (x − 1/n)
As n → ∞, Fn (x) → F (x−) = limy↑x F (y ) so convergence only occurs at continuity
points.
Example 2.2.4. Waiting for rare events. Let Xp be the number of trials needed
to get a success in a sequence of independent trials with success probability p. Then
P (Xp ≥ n) = (1 − p)n−1 for n = 1, 2, 3, . . . and it follows from Lemma 2.1.1 that as
p → 0,
P (pXp > x) → e−x for all x ≥ 0
In words, pXp converges weakly to an exponential distribution.
Example 2.2.5. Birthday problem. Let X1 , X2 , . . . be independent and uniformly
distributed on {1, . . . , N }, and let TN = min{n : Xn = Xm for some m < n}.
n 1− P (TN > n) =
m=2 m−1
N 2.2. WEAK CONVERGENCE 67 When N = 365 this is the probability that two people in a group of size n do not have
the same birthday (assuming all birthdays are equally likely). Using Exercise 1.1, it
is easy to see that
P (TN /N 1/2 > x) → exp(−x2 /2) for all x ≥ 0
√
Taking N = 365 and noting 22/ 365 = 1.1515 and (1.1515)2 /2 = 0.6630, this says
that
P (T365 > 22) ≈ e−0.6630 ≈ 0.515
This answer is 2% smaller than the true probability 0.524.
Before giving our sixth example, we need a simple result called Scheﬀ´’s Theoe
rem. Suppose we have probability densities fn , 1 ≤ n ≤ ∞, and fn → f∞ pointwise
as n → ∞. Then for all Borel sets B
fn (x)dx −
B f∞ (x)dx ≤ fn (x) − f∞ (x)dx B =2 (f∞ (x) − fn (x))+ dx → 0 by the dominated convergence theorem, the equality following from the fact that the
fn ≥ 0 and have integral = 1. Writing µn for the corresponding measures, we have
shown that the total variation norm
µn − µ∞ ≡ sup µn (B ) − µ∞ (B ) → 0
B a conclusion stronger than weak convergence. (Take B = (−∞, x].) The example
µn = a point mass at 1/n (with 1/∞ = 0) shows that we may have µn ⇒ µ∞ with
µn − µ∞ = 1 for all n.
Exercise 2.2.1. Give an example of random variables Xn with densities fn so that
Xn ⇒ a uniform distribution on (0,1) but fn (x) does not converge to 1 for any
x ∈ [0, 1].
Example 2.2.6. Central order statistic. Put (2n + 1) points at random in (0,1),
i.e., with locations that are independent and uniformly distributed. Let Vn+1 be the
(n + 1)th largest point. It is easy to see that
Lemma 2.2.1. Vn+1 has density function
fVn+1 (x) = (2n + 1) 2n n
x (1 − x)n
n Proof. There are 2n + 1 ways to pick the observation that falls at x, then we have to
n
pick n indices for observations < x, which can be done in 2n ways. Once we have
decided on the indices that will land < x and > x, the probability the corresponding
random variables will do what we want is xn (1 − x)n , and the probability density
that the remaining one will land at x is 1. If you don’t like the previous sentence
compute the probability X1 < x − , . . . , Xn < x − , x − < Xn+1 < x + , Xn+2 >
x + , . . . X2n+1 > x + then let → 0. 68 CHAPTER 2. CENTRAL LIMIT THEOREMS √
To compute the density function of Yn = 2(Vn+1 − 1√ 2n, we use √
/2)
Exercise 1.10
in Chapter 1, or simply change variables x = 1/2 + y/2 2n, dx = dy/2 2n to get = n 1
y
−√
2 2 2n
2n −2n
2n + 1
n
2
· (1 − y 2 /2n)n ·
·
n
2n
2 fYn (y ) = (2n + 1) 2n
n 1
y
+√
2 2 2n n 1
√
2 2n The ﬁrst factor is P (S2n = 0) for a simple random walk so Theorem 2.1.2 and Lemma
2.1.1 imply that
fYn (y ) → (2π )−1/2 exp(−y 2 /2) as n → ∞
Here and in what follows we write P (Yn = y ) for the density function of Yn . Using
Scheﬀ´’s theorem now, we conclude that Yn converges weakly to a standard normal
e
distribution.
Exercise 2.2.2. Convergence of maxima. Let X1 , X2 , . . . be independent with
distribution F , and let Mn = maxm≤n Xm . Then P (Mn ≤ x) = F (x)n . Prove the
following limit laws for Mn :
(i) If F (x) = 1 − x−α for x ≥ 1 where α > 0 then for y > 0
P (Mn /n1/α ≤ y ) → exp(−y −α )
(ii) If F (x) = 1 − xβ for −1 ≤ x ≤ 0 where β > 0 then for y < 0
P (n1/β Mn ≤ y ) → exp(−y β )
(iii) If F (x) = 1 − e−x for x ≥ 0 then for all y ∈ (−∞, ∞)
P (Mn − log n ≤ y ) → exp(−e−y )
The limits that appear above are called the extreme value distributions. The last
one is called the double exponential or Gumbel distribution. Necessary and
suﬃcient conditions for (Mn − bn )/an to converge to these limits were obtained by
Gnedenko (1943). For a recent treatment, see Resnick (1987).
Exercise 2.2.3. Let X1 , X2 , . . . be i.i.d. and have the standard normal distribution.
(i) From Theorem 1.1.4, we know
P (Xi > x) ∼ √ 2
1
e−x /2
2π x as x → ∞ Use this to conclude that for any real number θ
P (Xi > x + (θ/x))/P (Xi > x) → e−θ
(ii) Show that if we deﬁne bn by P (Xi > bn ) = 1/n
P (bn (Mn − bn ) ≤ x) → exp(−e−x )
(iii) Show that bn ∼ (2 log n)1/2 and conclude Mn /(2 log n)1/2 → 1 in probability. 2.2. WEAK CONVERGENCE 2.2.2 69 Theory The next result is useful for proving things about weak convergence.
Theorem 2.2.2. If Fn ⇒ F∞ then there are random variables Yn , 1 ≤ n ≤ ∞, with
distribution Fn so that Yn → Y∞ a.s.
Proof. Let Ω = (0, 1), F = Borel sets, P = Lebesgue measure, and let Yn (x) =
sup{y : Fn (y ) < x}. By (1.1) in Chapter 1, Yn has distribution Fn . We will now
show that Yn (x) → Y∞ (x) for all but a countable number of x. To do this, it is
−
convenient to write Yn (x) as Fn 1 (x) and drop the subscript when n = ∞. We begin by
identifying the exceptional set. Let ax = sup{y : F (y ) < x}, bx = inf {y : F (y ) > x},
and Ω0 = {x : (ax , bx ) = ∅} where (ax , bx ) is the open interval with the indicated
endpoints. Ω − Ω0 is countable since the (ax , bx ) are disjoint and each nonempty
interval contains a diﬀerent rational number. If x ∈ Ω0 then F (y ) < x for y < F −1 (x)
−
and F (z ) > x for z > F −1 (x). To prove that Fn 1 (x) → F −1 (x) for x ∈ Ω0 , there are
two things to show:
−
(a) lim inf n→∞ Fn 1 (x) ≥ F −1 (x) Proof of (a). Let y < F −1 (x) be such that F is continuous at y . Since x ∈ Ω0 ,
−
F (y ) < x and if n is suﬃciently large Fn (y ) < x, i.e., Fn 1 (x) ≥ y . Since this holds
for all y satisfying the indicated restrictions, the result follows.
−
(b) lim supn→∞ Fn 1 (x) ≤ F −1 (x) Proof of (b). Let y > F −1 (x) be such that F is continuous at y . Since x ∈ Ω0 ,
−
F (y ) > x and if n is suﬃciently large Fn (y ) > x, i.e., Fn 1 (x) ≤ y . Since this holds
for all y satisfying the indicated restrictions, the result follows and we have completed
the proof.
Theorem 2.2.2 allows us to immediately generalize some of our earlier results.
Exercise 2.2.4. Fatou’s lemma. Let g ≥ 0 be continuous. If Xn ⇒ X∞ then
lim inf Eg (Xn ) ≥ Eg (X∞ )
n→∞ Exercise 2.2.5. Integration to the limit. Suppose g, h are continuous with g (x) >
0, and h(x)/g (x) → 0 as x → ∞. If Fn ⇒ F and g (x) dFn (x) ≤ C < ∞ then
h(x) dFn (x) → h(x)dF (x) The next result illustrates the usefulness of Theorem 2.2.2 and gives an equivalent
deﬁnition of weak convergence that makes sense in any topological space.
Theorem 2.2.3. Xn ⇒ X∞ if and only if for every bounded continuous function g
we have Eg (Xn ) → Eg (X∞ ).
Proof. Let Yn have the same distribution as Xn and converge a.s. Since g is continuous
g (Yn ) → g (Y∞ ) a.s. and the bounded convergence theorem implies
Eg (Xn ) = Eg (Yn ) → Eg (Y∞ ) = Eg (X∞ )
To prove the converse let gx, 1
y≤x (y ) = 0
y ≥x+ linear x ≤ y ≤ x + 70 CHAPTER 2. CENTRAL LIMIT THEOREMS Since gx, (y ) = 1 for y ≤ x, gx, is continuous, and gx, (y ) = 0 for y > x + ,
lim sup P (Xn ≤ x) ≤ lim sup Egx, (Xn ) = Egx, (X∞ ) ≤ P (X∞ ≤ x + )
n→∞ n→∞ Letting → 0 gives lim supn→∞ P (Xn ≤ x) ≤ P (X∞ ≤ x). The last conclusion is
valid for any x. To get the other direction, we observe
lim inf P (Xn ≤ x) ≥ lim inf Egx− , (Xn ) = Egx− , (X∞ ) ≥ P (X∞ ≤ x − )
n→∞ n→∞ Letting → 0 gives lim inf n→∞ P (Xn ≤ x) ≥ P (X∞ < x) = P (X∞ ≤ x) if x is
a continuity point. The results for the lim sup and the lim inf combine to give the
desired result.
The next result is a trivial but useful generalization of Theorem 2.2.3.
Theorem 2.2.4. Continuous mapping theorem. Let g be a measurable function
and Dg = {x : g is discontinuous at x}. If Xn ⇒ X∞ and P (X∞ ∈ Dg ) = 0 then
g (Xn ) ⇒ g (X ). If in addition g is bounded then Eg (Xn ) → Eg (X∞ ).
Remark. Dg is always a Borel set. See Exercise 1.2.6.
Proof. Let Yn =d Xn with Yn → Y∞ a.s. If f is continuous then Df ◦g ⊂ Dg so
P (Y∞ ∈ Df ◦g ) = 0 and it follows that f (g (Yn )) → f (g (Y∞ ) a.s. If, in addition, f
is bounded then the bounded convergence theorem implies Ef (g (Yn )) → Ef (g (Y∞ ).
Since this holds for all bounded continuous functions, it follows from Theorem 2.2.3
that g (Xn ) ⇒ g (X∞ ).
The second conclusion is easier. Since P (Y∞ ∈ Dg ) = 0, g (Yn ) → g (Y∞ ) a.s., and
the desired result follows from the bounded convergence theorem.
The next result provides a number of useful alternative deﬁnitions of weak convergence.
Theorem 2.2.5. The following statements are equivalent: (i) Xn ⇒ X∞
(ii) For all open sets G, lim inf n→∞ P (Xn ∈ G) ≥ P (X∞ ∈ G).
(iii) For all closed sets K , lim supn→∞ P (Xn ∈ K ) ≤ P (X∞ ∈ K ).
(iv) For all sets A with P (X∞ ∈ ∂A) = 0, limn→∞ P (Xn ∈ A) = P (X∞ ∈ A).
Remark. To help remember the directions of the inequalities in (ii) and (iii), consider
the special case in which P (Xn = xn ) = 1. In this case, if xn ∈ G and xn → x∞ ∈ ∂G,
then P (Xn ∈ G) = 1 for all n but P (X∞ ∈ G) = 0. Letting K = Gc gives an example
for (iii).
Proof. We will prove four things and leave it to the reader to check that we have
proved the result given above.
(i) implies (ii): Let Yn have the same distribution as Xn and Yn → Y∞ a.s. Since G
is open
lim inf 1G (Yn ) ≥ 1G (Y∞ )
n→∞ so Fatou’s Lemma implies
lim inf P (Yn ∈ G) ≥ P (Y∞ ∈ G)
n→∞ 2.2. WEAK CONVERGENCE 71 (ii) is equivalent to (iii): This follows easily from: A is open if and only if Ac is closed
and P (A) + P (Ac ) = 1.
¯
(ii) and (iii) imply (iv): Let K = A and G = Ao be the closure and interior of A
¯
respectively. The boundary of A, ∂A = A − Ao and P (X∞ ∈ ∂A) = 0 so
P (X∞ ∈ K ) = P (X∞ ∈ A) = P (X∞ ∈ G)
Using (ii) and (iii) now
lim sup P (Xn ∈ A) ≤ lim sup P (Xn ∈ K ) ≤ P (X∞ ∈ K ) = P (X∞ ∈ A)
n→∞ n→∞ lim inf P (Xn ∈ A) ≥ lim inf P (Xn ∈ G) ≥ P (X∞ ∈ G) = P (X∞ ∈ A)
n→∞ n→∞ (iv) implies (i): Let x be such that P (X∞ = x) = 0, i.e., x is a continuity point of F ,
and let A = (−∞, x].
The next result is useful in studying limits of sequences of distributions.
Theorem 2.2.6. Helly’s selection theorem. For every sequence Fn of distribution
functions, there is a subsequence Fn(k) and a right continuous nondecreasing function
F so that limk→∞ Fn(k) (y ) = F (y ) at all continuity points y of F .
Remark. The limit may not be a distribution function. For example if a + b + c = 1
and Fn (x) = a 1(x≥n) + b 1(x≥−n) + c G(x) where G is a distribution function, then
Fn (x) → F (x) = b + cG(x),
lim F (x) = b x↓−∞ and lim F (x) = b + c = 1 − a x↑∞ In words, an amount of mass a escapes to +∞, and mass b escapes to −∞. The type
of convergence that occurs in Theorem 2.2.6 is sometimes called vague convergence,
and will be denoted here by ⇒v .
Proof. The ﬁrst step is a diagonal argument. Let q1 , q2 , . . . be an enumeration of the
rationals. Since for each k , Fm (qk ) ∈ [0, 1] for all m, there is a sequence mk (i) → ∞
that is a subsequence of mk−1 (j ) (let m0 (j ) ≡ j ) so that
Fmk (i) (qk ) converges to G(qk ) as i → ∞
Let Fn(k) = Fmk (k) . By construction Fn(k) (q ) → G(q ) for all rational q . The function
G may not be right continuous but F (x) = inf {G(q ) : q ∈ Q, q > x} is since
lim F (xn ) = inf {G(q ) : q ∈ Q, q > xn for some n} xn ↓x = inf {G(q ) : q ∈ Q, q > x} = F (x)
To complete the proof, let x be a continuity point of F . Pick rationals r1 , r2 , s with
r1 < r2 < x < s so that
F (x) − < F (r1 ) ≤ F (r2 ) ≤ F (x) ≤ F (s) < F (x) +
Since Fn(k) (r2 ) → G(r2 ) ≥ F (r1 ), and Fn(k) (s) → G(s) ≤ F (s) it follows that if k is
large
F (x) − < Fn(k) (r2 ) ≤ Fn(k) (x) ≤ Fn(k) (s) < F (x) +
which is the desired conclusion. 72 CHAPTER 2. CENTRAL LIMIT THEOREMS The last result raises a question: When can we conclude that no mass is lost in
the limit in Theorem 2.2.6?
Theorem 2.2.7. Every subsequential limit is the distribution function of a probability
measure if and only if the sequence Fn is tight, i.e., for all > 0 there is an M so
that
lim sup 1 − Fn (M ) + Fn (−M ) ≤
n→∞ Proof. Suppose the sequence is tight and Fn(k) ⇒v F . Let r < −M and s > M be
continuity points of F . Since Fn (r) → F (r) and Fn (s) → F (s), we have
1 − F (s) + F (r) = lim 1 − Fn(k) (s) + Fn(k) (r)
k→∞ ≤ lim sup 1 − Fn (M ) + Fn (−M ) ≤
n→∞ The last result implies lim supx→∞ 1 − F (x) + F (−x) ≤ . Since is arbitrary it
follows that F is the distribution function of a probability measure.
To prove the converse now suppose Fn is not tight. In this case, there is an > 0
and a subsequence n(k ) → ∞ so that
1 − Fn(k) (k ) + Fn(k) (−k ) ≥
for all k . By passing to a further subsequence Fn(kj ) we can suppose that Fn(kj ) ⇒v F .
Let r < 0 < s be continuity points of F .
1 − F (s) + F (r) = lim 1 − Fn(kj ) (s) + Fn(kj ) (r)
j →∞ ≥ lim inf 1 − Fn(kj ) (kj ) + Fn(kj ) (−kj ) ≥
j →∞ Letting s → ∞ and r → −∞, we see that F is not the distribution function of a
probability measure.
The following suﬃcient condition for tightness is often useful.
Theorem 2.2.8. If there is a ϕ ≥ 0 so that ϕ(x) → ∞ as x → and
C = sup ϕ(x)dFn (x) < ∞ n then Fn is tight.
Proof. 1 − Fn (M ) + Fn (−M ) ≤ C/ inf x≥M ϕ(x).
Exercise 2.2.6. The L´vy Metric. Show that
e
ρ(F, G) = inf { : F (x − ) − ≤ G(x) ≤ F (x + ) + for all x}
deﬁnes a metric on the space of distributions and ρ(Fn , F ) → 0 if and only if Fn ⇒ F.
Exercise 2.2.7. The Ky Fan metric on random variables is deﬁned by
α(X, Y ) = inf { ≥ 0 : P (X − Y  > ) ≤ }
Show that if α(X, Y ) = α then the corresponding distributions have L´vy distance
e
ρ(F, G) ≤ α. 2.2. WEAK CONVERGENCE 73 Exercise 2.2.8. Let α(X, Y ) be the metric in the previous exercise and let β (X, Y ) =
E (X − Y /(1 + X − Y )) be the metric of Exercise 6.4 in Chapter 1. If α(X, Y ) = a
then
a2 /(1 + a) ≤ β (X, Y ) ≤ a + (1 − a)a/(1 + a)
The fact that convergence in distribution comes from a metric immediately implies
Theorem 2.2.9. If each subsequence of Xn has a further subsequence that converges
to X then Xn ⇒ X .
We will prove this again at the end of the proof of Theorem 2.3.6.
Exercises
2.2.9. If Fn ⇒ F and F is continuous then supx Fn (x) − F (x) → 0.
2.2.10. If F is any distribution function there is a sequence of distribution functions
n
of the form m=1 an,m 1(xn,m ≤x) with Fn ⇒ F . Hint: use Theorem 1.7.7.
2.2.11. Let Xn , 1 ≤ n ≤ ∞, be integer valued. Show that Xn ⇒ X∞ if and only if
P (Xn = m) → P (X∞ = m) for all m.
2.2.12. Show that if Xn → X in probability then Xn ⇒ X and that, conversely, if
Xn ⇒ c, where c is a constant then Xn → c in probability.
2.2.13. Converging together lemma. If Xn ⇒ X and Yn ⇒ c, where c is a
constant then Xn + Yn ⇒ X + c. A useful consequence of this result is that if
Xn ⇒ X and Zn − Xn ⇒ 0 then Zn ⇒ X.
2.2.14. Suppose Xn ⇒ X , Yn ≥ 0, and Yn ⇒ c, where c > 0 is a constant then
Xn Yn ⇒ cX. This result is true without the assumptions Yn ≥ 0 and c > 0. We have
imposed these only to make the proof less tedious.
1
n
2.2.15. Show that if√ n = (Xn , . . . , Xn ) is uniformly distributed over the surface of
X
n
1
the sphere of radius n in R then Xn ⇒ a standard normal. Hint: Let Y1 , Y2 , . . .
n
i
2
be i.i.d. standard normals and let Xn = Yi (n/ m=1 Ym )1/2 .
α
β
2.2.16. Suppose Yn ≥ 0, EYn → 1 and EYn → 1 for some 0 < α < β . Show that
Yn → 1 in probability. 2.2.17. For each K < ∞ and y < 1 there is a cy,K > 0 so that EX 2 = 1 and
EX 4 ≤ K implies P (X  > y ) ≥ cy,K . 74 CHAPTER 2. CENTRAL LIMIT THEOREMS 2.3 Characteristic Functions This long section is divided into ﬁve parts. The ﬁrst three are required reading, the last
two are optional. In part a we show that the characteristic function ϕ(t) = E exp(itX )
determines F (x) = P (X ≤ x), and we give recipes for computing F from ϕ. In part
b we relate weak convergence of distributions to the behavior of the corresponding
characteristic functions. In part c we relate the behavior of ϕ(t) at 0 to the moments
of X . In part d we prove Polya’s criterion and use it to construct some famous and
some strange examples of characteristic functions. Finally in part e we consider the
moment problem, i.e., when is a distribution characterized by its moments. 2.3.1 Deﬁnition, Inversion Formula If X is a random variable we deﬁne its characteristic function (ch.f.) by
ϕ(t) = EeitX = E cos tX + iE sin tX
The last formula requires taking the expected value of a complex valued random
variable but as the second equality may suggest no new theory is required. If Z is
complex valued we deﬁne EZ = E ( Re Z ) + iE ( Im Z ) where Re (a + bi) = a is the
real part and Im (a + bi) = b is the imaginary part. Some other deﬁnitions we will
need are: the modulus of the complex number z = a + bi is a + bi = (a2 + b2 )1/2 ,
and the complex conjugate of z = a + bi, z = a − bi.
¯
Theorem 2.3.1. All characteristic functions have the following properties:
(a) ϕ(0) = 1,
(b) ϕ(−t) = ϕ(t),
(c) ϕ(t) = EeitX  ≤ E eitX  = 1
(d) ϕ(t + h) − ϕ(t) ≤ E eihX − 1, so ϕ(t) is uniformly continuous on (−∞, ∞).
(e) Eeit(aX +b) = eitb ϕ(at)
Proof. (a) is obvious. For (b) we note that
ϕ(−t) = E (cos(−tX ) + i sin(−tX )) = E (cos(tX ) − i sin(tX ))
(c) follows from Exercise 1.3.2 in Chapter 1 since φ(x, y ) = (x2 + y 2 )1/2 is convex.
ϕ(t + h) − ϕ(t) = E (ei(t+h)X − eitX )
≤ E ei(t+h)X − eitX  = E eihX − 1
so uniform convergence follows from the bounded convergence theorem. For (e) we
note Eeit(aX +b) = eitb Eei(ta)X = eitb ϕ(at).
The main reason for introducing charactersitic functions is the following:
Theorem 2.3.2. If X1 and X2 are independent and have ch.f.’s ϕ1 and ϕ2 then
X1 + X2 has ch.f. ϕ1 (t)ϕ2 (t).
Proof.
Eeit(X1 +X2 ) = E (eitX1 eitX2 ) = EeitX1 EeitX2
since eitX1 and eitX2 are independent. 2.3. CHARACTERISTIC FUNCTIONS 75 The next order of business is to give some examples.
Example 2.3.1. Coin ﬂips. If P (X = 1) = P (X = −1) = 1/2 then
EeitX = (eit + e−it )/2 = cos t
Example 2.3.2. Poisson distribution. If P (X = k ) = e−λ λk /k ! for k = 0, 1, 2, . . .
then
∞
λk eitk
= exp(λ(eit − 1))
EeitX =
e−λ
k!
k=0 Example 2.3.3. Normal distribution
(2π )−1/2 exp(−x2 /2)
exp(−t2 /2) Density
Ch.f. Combining this result with (e) of Theorem 2.3.1, we see that a normal distribution
with mean µ and variance σ 2 has ch.f. exp(iµt − σ 2 t2 /2). Similar scalings can be
applied to other examples so we will often just give the ch.f. for one member of the
family.
Physics Proof
2 eitx (2π )−1/2 e−x /2 2 dx = e−t /2 (2π )−1/2 e−(x−it) 2 /2 dx The integral is 1 since the integrand is the normal density with mean it and variance
1.
Math Proof. Now that we have cheated and ﬁgured out the answer we can verify it
by a formal calculation that gives very little insight into why it is true. Let
ϕ(t) = 2 eitx (2π )−1/2 e−x /2 dx = 2 cos tx (2π )−1/2 e−x /2 dx since i sin tx is an odd function. Diﬀerentiating with respect to t (referring to Theorem
A.9.1 for the justiﬁcation) and then integrating by parts gives
ϕ (t) = 2 −x sin tx (2π )−1/2 e−x 2 t cos tx (2π )−1/2 e−x =−
This implies /2 d
2
dt {ϕ(t) exp(t /2)} dx /2 dx = −tϕ(t) = 0 so ϕ(t) exp(t2 /2) = ϕ(0) = 1. In the next three examples, the density is 0 outside the indicated range.
Example 2.3.4. Uniform distribution on (a, b)
Density
Ch.f. 1/(b − a)
x ∈ (a, b)
(eitb − eita )/ it(b − a) In the special case a = −c, b = c the ch.f. is (eitc − e−itc )/2cit = (sin ct)/ct.
Proof. Once you recall that
immediate. b λx
e
a dx = (eλb − eλa )/λ holds for complex λ, this is 76 CHAPTER 2. CENTRAL LIMIT THEOREMS Example 2.3.5. Triangular distribution
1 − x
x ∈ (−1, 1)
2(1 − cos t)/t2 Density
Ch.f. Proof. To see this, notice that if X and Y are independent and uniform on (−1/2, 1/2)
then X + Y has a triangular distribution. Using Example 2.3.4 now and Theorem
2.3.2 it follows that the desired ch.f. is
{(eit/2 − e−it/2 )/it}2 = {2 sin(t/2)/t}2
Using the trig identity cos 2θ = 1 − 2 sin2 θ with θ = t/2 converts the answer into the
form given above.
Example 2.3.6. Exponential distribution
e−x
x ∈ (0, ∞)
1/(1 − it) Density
Ch.f.
Proof. Integrating gives
∞
itx −x e e 0 e(it−1)x
dx =
it − 1 ∞ =
0 1
1 − it since exp((it − 1)x) → 0 as x → ∞.
For the next result we need the following fact which follows from the fact that
f d(µ + ν ) = f dµ + f dν .
Lemma 2.3.3. If F1 , . . . , Fn have ch.f. ϕ1 , . . . , ϕn and λi ≥ 0 have λ1 + . . . + λn = 1
n
n
then i=1 λi Fi has ch.f.
i=1 λi ϕi .
Example 2.3.7. Bilateral exponential
Density
Ch.f. 1 −x
2e x ∈ (−∞, ∞) 1/(1 + t2 ) Proof This follows from Lemma 2.3.3 with F1 the distribution of an exponential random variable X , F2 the distribution of −X , and λ1 = λ2 = 1/2 then using (b) of
Theorem 2.3.1 we see the desired ch.f. is
1
(1 + it) + (1 − it)
1
1
+
=
=
2(1 − it) 2(1 + it)
2(1 + t2 )
(1 + t2 )
Exercise 2.3.1. Show that if ϕ is a ch.f. then Re ϕ and 2 are also.
The ﬁrst issue to be settled is that the characteristic function uniquely determines
the distribution. This and more is provided by
Theorem 2.3.4. The inversion formula. Let ϕ(t) =
probability measure. If a < b then
T lim (2π )−1 T →∞ −T eitx µ(dx) where µ is a e−ita − e−itb
1
ϕ(t) dt = µ(a, b) + µ({a, b})
it
2 2.3. CHARACTERISTIC FUNCTIONS 77 Remark. The existence of the limit is part of the conclusion. If µ = δ0 , a point mass
at 0, ϕ(t) ≡ 1. In this case, if a = −1 and b = 1, the integrand is (2 sin t)/t and the
integral does not converge absolutely.
Proof. Let
T IT =
−T e−ita − e−itb
ϕ(t) dt =
it T e−ita − e−itb itx
e µ(dx) dt
it −T The integrand may look bad near t = 0 but if we observe that
e−ita − e−itb
=
it b e−ity dy
a we see that the modulus of the integrand is bounded by b − a. Since µ is a probability
measure and [−T, T ] is a ﬁnite interval it follows from Fubini’s theorem, cos(−x) =
cos x, and sin(−x) = − sin x that
T IT =
−T e−ita − e−itb itx
e dt µ(dx)
it
T =
−T Introducing R(θ, T ) =
(∗)
If we let S (T ) = sin(t(x − a))
dt −
t T
(sin θt)/t dt,
−T IT = T
−T sin(t(x − b))
dt µ(dx)
t we can write the last result as {R(x − a, T ) − R(x − b, T )}µ(dx) T
(sin x)/x dx
0 then for θ > 0 changing variables t = x/θ shows that
Tθ R(θ, T ) = 2
0 sin x
dx = 2S (T θ)
x while for θ < 0, R(θ, T ) = −R(θ, T ). Introducing the function sgn x, which is 1 if
x > 0, −1 if x < 0, and 0 if x = 0, we can write the last two formulas together as
R(θ, T ) = 2( sgn θ)S (T θ)
As T → ∞, S (T ) → π/2 (see Exercise A.6.6), so we have R(θ, T ) → π sgn θ and 2π R(x − a, T ) − R(x − b, T ) → π 0 a<x<b
x = a or x = b
x < a or x > b R(θ, T ) ≤ 2 supy S (y ) < ∞, so using the bounded convergence theorem with (∗)
implies
1
(2π )−1 IT → µ(a, b) + µ({a, b})
2
proving the desired result. 78 CHAPTER 2. CENTRAL LIMIT THEOREMS Exercise 2.3.2. (i) Imitate the proof of Theorem 2.3.4 to show that
T 1
T →∞ 2T e−ita ϕ(t) dt µ({a}) = lim −T (ii) If P (X ∈ hZ) = 1 where h > 0 then its ch.f. has ϕ(2π/h + t) = ϕ(t) so
P (X = x) = π /h h
2π e−itx ϕ(t) dt for x ∈ hZ −π/h (iii) If X = Y + b then E exp(itX ) = eitb E exp(itY ). So if P (X ∈ b + hZ) = 1, the
inversion formula in (ii) is valid for x ∈ b + hZ.
Two trivial consequences of the inversion formula are:
Exercise 2.3.3. If ϕ is real then X and −X have the same distribution.
Exercise 2.3.4. If Xi , i = 1, 2 are independent and have normal distributions with
2
mean 0 and variance σi , then X1 + X2 has a normal distribution with mean 0 and
2
2
variance σ1 + σ2 .
The inversion formula is simpler when φ is integrable, but as the next result shows
this only happens when the underlying measure is nice.
Theorem 2.3.5. If ϕ(t) dt < ∞ then µ has bounded continuous density
f (y ) = 1
2π e−ity ϕ(t) dt Proof. As we observed in the proof of Theorem 2.3.4
e−ita − e−itb
=
it b e−ity dy ≤ b − a
a so the integral in Theorem 2.3.4 converges absolutely in this case and
1
1
µ(a, b) + µ({a, b}) =
2
2π ∞ (b − a)
e−ita − e−itb
ϕ(t) dt ≤
it
2π −∞ ∞ ϕ(t)dt
−∞ The last result implies µ has no point masses and
µ(x, x + h) =
= e−itx − e−it(x+h)
ϕ(t) dt
it 1
2π
1
2π x+h e−ity dy ϕ(t) dt
x x+h =
x 1
2π e−ity ϕ(t) dt dy by Fubini’s theorem, so the distribution µ has density function
f (y ) = 1
2π e−ity ϕ(t) dt The dominated convergence theorem implies f is continuous and the proof is complete. 2.3. CHARACTERISTIC FUNCTIONS 79 Exercise 2.3.5. Give an example of a measure µ with a density but for which
ϕ(t)dt = ∞. Hint: Two of the examples above have this property.
Exercise 2.3.6. Show that if X1 , . . . , Xn are independent and uniformly distributed
on (−1, 1), then for n ≥ 2, X1 + · · · + Xn has density
f (x) = 1
π ∞ (sin t/t)n cos tx dt
0 Although it is not obvious from the formula, f is a polynomial in each interval (k, k +
1), k ∈ Z and vanishes on [−n, n]c .
Theorem 2.3.5 and the next result show that the behavior of ϕ at inﬁnity is related
to the smoothness of the underlying measure.
Exercise 2.3.7. Suppose X and Y are independent and have ch.f. φ and distribution
µ. Apply Exercise 2.3.2 to X − Y and use Exercise 1.4.7 in Chapter 1 to get
1
T →∞ 2T T µ({x})2 ϕ(t)2 dt = P (X − Y = 0) = lim −T x Remark. The last result implies that if ϕ(t) → 0 as t → ∞, µ has no point masses.
Exercise 2.3.13 gives an example to show that the converse is false. The RiemannLebesgue Lemma (Exercise A.4.4) shows that if µ has a density, ϕ(t) → 0 as t → ∞.
Applying the inversion formula Theorem 2.3.5 to the ch.f. in Examples 2.3.5 and
2.3.7 gives us two more examples of ch.f. The ﬁrst one does not have an oﬃcial name
so we gave it one to honor its role in the proof of Polya’s criterion, see Theorem 2.3.10.
Example 2.3.8. Polya’s distribution
Density
Ch.f. (1 − cos x)/πx2
(1 − t)+ Proof. Theorem 2.3.5 implies
2(1 − cos s) −isy
e
ds = (1 − y )+
s2 1
2π
Now let s = x, y = −t. Example 2.3.9. The Cauchy distribution
Density
Ch.f. 1/π (1 + x2 )
exp(−t) Proof. Theorem 2.3.5 implies
1
2π 1
1
e−isy ds = e−y
2
1+s
2 Now let s = x, y = −t and multiply each side by 2.
Exercise 2.3.8. Use the last result to conclude that if X1 , X2 , . . . are independent
and have the Cauchy distribution, then (X1 + · · · + Xn )/n has the same distribution
as X1 . 80 2.3.2 CHAPTER 2. CENTRAL LIMIT THEOREMS Weak Convergence Our next step toward the central limit theorem is to relate convergence of characteristic functions to weak convergence.
Theorem 2.3.6. Continuity theorem. Let µn , 1 ≤ n ≤ ∞ be probability measures
with ch.f. ϕn . (i) If µn ⇒ µ∞ then ϕn (t) → ϕ∞ (t) for all t. (ii) If ϕn (t) converges
pointwise to a limit ϕ(t) that is continuous at 0, then the associated sequence of
distributions µn is tight and converges weakly to the measure µ with characteristic
function ϕ.
Remark. To see why continuity of the limit at 0 is needed in (ii), let µn have a normal
distribution with mean 0 and variance n. In this case ϕn (t) = exp(−nt2 /2) → 0 for
t = 0, and ϕn (0) = 1 for all n, but the measures do not converge weakly since
µn ((−∞, x]) → 1/2 for all x.
Proof. (i) is easy. eitx is bounded and continuous so if µn ⇒ µ∞ then Theorem 2.2.3
implies ϕn (t) → ϕ∞ (t). To prove (ii), our ﬁrst goal is to prove tightness. We begin
with some calculations that may look mysterious but will prove to be very useful.
u u 1 − eitx dt = 2u − (cos tx + i sin tx) dt = 2u − −u −u 2 sin ux
x Dividing both sides by u, integrating µn (dx), and using Fubini’s theorem on the
lefthand side gives
u u−1 (1 − ϕn (t)) dt = 2
−u 1− sin ux
ux µn (dx) To bound the righthand side, we note that
x  sin x = cos(y ) dy ≤ x for all x
0 so we have 1 − (sin ux/ux) ≥ 0. Discarding the integral over (−2/u, 2/u) and using
 sin ux ≤ 1 on the rest, the righthand side is
≥2 1−
x≥2/u 1
ux µn (dx) ≥ µn ({x : x > 2/u}) Since ϕ(t) → 1 as t → 0,
u u−1 (1 − ϕ(t)) dt → 0 as u → 0
−u Pick u so that the integral is < . Since ϕn (t) → ϕ(t) for each t, it follows from the
bounded convergence theorem that for n ≥ N
u 2 ≥ u−1 (1 − ϕn (t)) dt ≥ µn {x : x > 2/u}
−u Since is arbitrary, the sequence µn is tight.
To complete the proof now we observe that if µn(k) ⇒ µ, then it follows from
the ﬁrst sentence of the proof that µ has ch.f. ϕ. The last observation and tightness
imply that every subsequence has a further subsequence that converges to µ. I claim 2.3. CHARACTERISTIC FUNCTIONS 81 that this implies the whole sequence converges to µ. To see this, observe that we
have shown that if f is bounded and continuous then every subsequence of f dµn
has a further subsequence that converges to f dµ, so Theorem 1.6.3 implies that the
whole sequence converges to that limit. This shows f dµn → f dµ for all bounded
continuous functions f so the desired result follows from Theorem 2.2.3.
Exercise 2.3.9. Suppose that Xn ⇒ X and Xn has a normal distribution with mean
2
2
0 and variance σn . Prove that σn → σ 2 ∈ [0, ∞).
Exercise 2.3.10. Show that if Xn and Yn are independent for 1 ≤ n ≤ ∞, Xn ⇒ X∞ ,
and Yn ⇒ Y∞ , then Xn + Yn ⇒ X∞ + Y∞ .
Exercise 2.3.11. Let X1 , X2 , . . . be independent and let Sn = X1 + · · · + Xn . Let
∞
φj be the ch.f. of Xj and suppose that Sn → S∞ a.s. Then S∞ has ch.f. j =1 φj (t).
Exercise 2.3.12. Using the identity sin t = 2 sin(t/2) cos(t/2) repeatedly leads to
∞
(sin t)/t = m=1 cos(t/2m ). Prove the last identity by interpreting each side as a
characteristic function.
Exercise 2.3.13. Let X1 , X2 , . . . be independent taking values 0 and 1 with probability 1/2 each. X = 2 j ≥1 Xj /3j has the Cantor distribution. Compute the ch.f. ϕ
of X and notice that ϕ has the same value at t = 3k π for k = 0, 1, 2, . . . 2.3.3 Moments and Derivatives In the proof of Theorem 2.3.6, we derived the inequality
u µ{x : x > 2/u} ≤ u−1 (1 − φ(t)) dt (2.3.1) −u which shows that the smoothness of the characteristic function at 0 is related to the
decay of the measure at ∞. The next result continues this theme. We leave the proof
to the reader. (Use (9.1) in the Appendix.)
Exercise 2.3.14. If xn µ(dx) < ∞ then its characteristic function ϕ has a continuous derivative of order n given by ϕ(n) (t) = (ix)n eitx µ(dx).
2 Exercise 2.3.15. Use the last exercise and the series expansion for e−t
that the standard normal distribution has /2 to show EX 2n = (2n)!/2n n! = (2n − 1)(2n − 3) · · · 3 · 1 ≡ (2n − 1)!!
The result in Exercise 2.3.14 shows that if E X n < ∞, then its characteristic
function is n times diﬀerentiable at 0, and ϕn (0) = E (iX )n . Expanding ϕ in a Taylor
series about 0 leads to
n
E (itX )m
+ o(tn )
ϕ(t) =
m!
m=0
where o(tn ) indicates a quantity g (t) that has g (t)/tn → 0 as t → 0. For our purposes
below, it will be important to have a good estimate on the error term, so we will now
derive the last result. The starting point is a little calculus.
Lemma 2.3.7. n eix − (ix)m
≤ min
m!
m=0 xn+1 2xn
,
(n + 1)! n! (2.3.2) 82 CHAPTER 2. CENTRAL LIMIT THEOREMS The ﬁrst term on the right is the usual order of magnitude we expect in the correction
term. The second is better for large x and will help us prove the central limit theorem
without assuming ﬁnite third moments.
Proof. Integrating by parts gives
x (x − s)n eis ds =
0 x xn+1
i
+
n+1 n+1 (x − s)n+1 eis ds
0 When n = 0, this says
x x eis ds = x + i
0
ix The lefthand side is (e (x − s)eis ds
0 − 1)/i, so rearranging gives
x eix = 1 + ix + i2 (x − s)eis ds
0 Using the result for n = 1 now gives
eix = 1 + ix + i3
i2 x2
+
2
2 x (x − s)2 eis ds
0 and iterating we arrive at
n eix − (a) (ix)m
in+1
=
m!
n!
m=0 x (x − s)n eis ds
0 To prove the result now it only remains to estimate the “error term” on the righthand
side. Since eis  ≤ 1 for all s,
x in+1
n! (b) (x − s)n eis ds ≤ xn+1 /(n + 1)!
0 The last estimate is good when x is small. The next is designed for large x. Integrating
by parts
x
ix
xn
(x − s)n eis ds = −
+
(x − s)n−1 eis ds
n0
n
0
x
(x
0 Noticing xn /n = x in+1
n!
ix and since e
(c) − s)n−1 ds now gives
in
(n − 1)! (x − s)n eis ds =
0 x (x − s)n−1 (eis − 1)ds
0 − 1 ≤ 2, it follows that in+1
n! x (x − s)n eis ds ≤
0 2
(n − 1)! x (x − s)n−1 ds ≤ 2xn /n!
0 Combining (a), (b), and (c) we have the desired result.
Taking expected values, using Jensen’s inequality, applying Theorem 2.3.2 to x =
tX , gives
n E eitX − n E
m=0 (itX )m
(itX )m
≤ E eitX −
m!
m!
m=0
≤ E min tX n+1 , 2tX n (2.3.3) 2.3. CHARACTERISTIC FUNCTIONS 83 where in the second step we have dropped the denominators to make the bound
simpler.
In the next section, the following special case will be useful.
Theorem 2.3.8. If E X 2 < ∞ then
ϕ(t) = 1 + itEX − t2 E (X 2 )/2 + o(t2 )
Proof. The error term is ≤ t2 E (t · X 3 ∧ 2X 2 ). The variable in parentheses is
smaller than 2X 2 and converges to 0 as t → 0, so the desired conclusion follows from
the dominated convergence theorem.
Remark. The point of the estimate in (2.3.3) which involves the minimum of two
terms rather than just the ﬁrst one which would result from a naive application of
Taylor series, is that we get the conclusion in Theorem 2.3.8 under the assumption
E X 2 < ∞, i.e., we do not have to assume E X 3 < ∞.
Exercise 2.3.16. (i) Suppose that the family of measures {µi , i ∈ I } is tight, i.e.,
supi µi ([−M, M ]c ) → 0 as M → ∞. Use (d) in Theorem 2.3.1 and (2.3.3) with n = 0
to show that their ch.f.’s φi are equicontinuous, i.e., if > 0 we can pick δ > 0 so
that if h < δ then φi (t + h) − φi (t) < . (ii) Suppose µn ⇒ µ∞ . Use Theorem 2.3.6
and equicontinuity to conclude that the ch.f.’s φn → φ∞ uniformly on compact sets.
[Argue directly. You don’t need to go to AA.] (iii) Give an example to show that the
convergence need not be uniform on the whole real line.
Exercise 2.3.17. Let X1 , X2 , . . . be i.i.d. with characteristic function ϕ. (i) If ϕ (0) =
ia and Sn = X1 +· · ·+Xn then Sn /n → a in probability. (ii) If Sn /n → a in probability
then ϕ(t/n)n → eiat as n → ∞ through the integers. (iii) Use (ii) and the uniform
continuity established in (d) of Theorem ?? to show that (φ(h) − 1)/h → −ia as
h → 0 through the positive reals. Thus the weak law holds if and only if ϕ (0) exists.
This result is due to E.J.G. Pitman (1956), with a little help from John Walsh who
pointed out that we should prove (iii).
The last exercise in combination with Exercise 1.5.4 from Chapter 1 shows that
ϕ (0) may exist when E X  = ∞.
∞ Exercise 2.3.18. 2 0 (1 − Re ϕ(t))/(πt2 ) dt = y dF (y ). Hint: Change variables
x = y t in the density function of Example 2.3.8, which integrates to 1.
The next result shows that the existence of second derivatives implies the existence
of second moments.
Theorem 2.3.9. If lim suph↓0 {ϕ(h) − 2ϕ(0) + ϕ(−h)}/h2 > −∞, then E X 2 < ∞.
Proof. (eihx − 2 + e−ihx )/h2 = −2(1 − cos hx)/h2 ≤ 0 and 2(1 − cos hx)/h2 → x2 as
h → 0 so Fatou’s lemma and Fubini’s theorem imply
1 − cos hx
dF (x)
h2
ϕ(h) − 2ϕ(0) + ϕ(−h)
= − lim sup
<∞
h2
h→0 x2 dF (x) ≤ 2 lim inf
h→0 which proves the desired result.
Exercise 2.3.19. Show that if limt↓0 (ϕ(t) − 1)/t2 = c > −∞ then EX = 0 and
E X 2 = −2c < ∞. In particular, if ϕ(t) = 1 + o(t2 ) then ϕ(t) ≡ 1. 84 CHAPTER 2. CENTRAL LIMIT THEOREMS Exercise 2.3.20. If Yn are r.v.’s with ch.f.’s ϕn then Yn ⇒ 0 if and only if there is
a δ > 0 so that ϕn (t) → 1 for t ≤ δ .
Exercise 2.3.21. Let X1 , X2 , . . . be independent. If Sn = m≤n Xm converges in
distribution then it converges in probability (and hence a.s. by Exercise 1.8.10). Hint:
The last exercise implies that if m, n → ∞ then Sm − Sn → 0 in probability. Now
use Exercise 1.8.11. 2.3.4 Polya’s Criterion* The next result is useful for constructing examples of ch.f.’s.
Theorem 2.3.10. Polya’s criterion. Let ϕ(t) be real nonnegative and have ϕ(0) =
1, ϕ(t) = ϕ(−t), and ϕ is decreasing and convex on (0, ∞) with
lim φ(t) = 1, lim φ(t) = 0 t↓0 t↑∞ Then there is a probability measure ν on (0, ∞), so that
∞ (∗) t
s 1− ϕ(t) =
0 + ν (ds) and hence ϕ is a characteristic function.
Remark. Before we get lost in the details of the proof, the reader should note that
(∗) displays φ as a convex combination of ch.f.’s of the form given in Example 2.3.8,
so an extension of Lemma 2.3.3 (to be proved below) implies that this is a ch.f.
The assumption that limt→0 φ(t) = 1 is necessary because the function φ(t) =
1{0} (t) which is 1 at 0 and 0 otherwise satisﬁes all the other hypotheses. We could
allow limt→∞ φ(t) = c > 0 by having a point mass of size c at 0, but we leave this
extension to the reader.
Proof. Let φ be the right derivative of φ, i.e.,
φ (t) = lim
h↓0 φ(t + h) − φ(t)
h Since φ is convex this exists and is right continuous and increasing. So we can let µ
be the measure on (0, ∞) with µ(a, b] = φ (b) − φ (a) for all 0 ≤ a < b < ∞, and let
ν be the measure on (0, ∞) with dν/dµ = s.
Now φ (t) → 0 as t → ∞ (for if φ (t) ↓ − we would have φ(t) ≤ 1 − t for all t),
so Exercise 8.7 in the Appendix implies
∞ r−1 ν (dr) −ϕ (s) =
s Integrating again and using Fubini’s theorem we have for t ≥ 0
∞ ∞ ∞ r−1 ν (dr) ds = φ(t) =
t s
∞ 1− =
t t
r r r−1
t
∞ 1− ν (dr) =
0 ds ν (dr)
t t
r + ν (dr) Using φ(−t) = φ(t) to extend the formula to t ≤ 0 we have (∗). Setting t = 0 in (∗)
shows ν has total mass 1. 2.3. CHARACTERISTIC FUNCTIONS 85 If ϕ is piecewise linear, ν has a ﬁnite number of atoms and the result follows from
Example 2.3.8 and Lemma 2.3.3. To prove the general result, let νn be a sequence
of measures on (0, ∞) with a ﬁnite number of atoms that converges weakly to ν (see
Exercise 2.2.10) and let
∞ 1− ϕn (t) =
0 t
s + νn (ds) Since s → (1 − t/s)+ is bounded and continuous, φn (t) → φ(t) and the desired result
follows from part (ii) of Theorem 2.3.6.
A classic application of Polya’s criterion is:
Exercise 2.3.22. Show that exp(−tα ) is a characteristic function for 0 < α ≤ 1.
(The case α = 1 corresponds to the Cauchy distribution.) The next argument, which
we learned from Frank Spitzer, proves that this is true for 0 < α ≤ 2. The case α = 2
corresponds to a normal distribution, so that case can be safely ignored in the proof.
Example 2.3.10. exp(−tα ) is a characteristic function for 0 < α < 2.
Proof. A little calculus shows that for any β and x < 1
∞ (1 − x)β =
n=0 where β
n Let ψ (t) = 1 − (1 − cos t)α/2 = = β
(−x)n
n β (β − 1) · · · (β − n + 1)
1 · 2···n
∞
n
n=1 cn (cos t) cn = where α /2
(−1)n+1
n ∞ cn ≥ 0 (here we use α < 2), and n=1 cn = 1 (take t = 0 in the deﬁnition of ψ ). cos t
is a characteristic function (see Example 2.3.1) so an easy extension of Lemma 2.3.3
shows that ψ is a ch.f. We have 1 − cos t ∼ t2 /2 as t → 0, so
1 − cos(t · 21/2 · n−1/α ) ∼ n−2/α t2
Using Lemma 2.1.1 and (ii) of Theorem 2.3.6 now, it follows that
exp(−tα ) = lim {ψ (t · 21/2 · n−1/α )}n
n→∞ is a ch.f.
Exercise 2.3.19 shows that exp(−tα ) is not a ch.f. when α > 2. A reason for interest
in these characteristic functions is explained by the following generalization of Exercise
2.3.8.
Exercise 2.3.23. If X1 , X2 , . . . are independent and have characteristic function
exp(−tα ) then (X1 + · · · + Xn )/n1/α has the same distribution as X1 .
We will return to this topic in Section 2.7. Polya’s criterion can also be used to
construct some “pathological examples.” 86 CHAPTER 2. CENTRAL LIMIT THEOREMS Exercise 2.3.24. Let ϕ1 and ϕ2 be ch.f’s. Show that A = {t : ϕ1 (t) = ϕ2 (t)} is
closed, contains 0, and is symmetric about 0. Show that if A is a set with these
properties and ϕ1 (t) = e−t there is a ϕ2 so that {t : ϕ1 (t) = ϕ2 (t)} = A.
Example 2.3.11. For some purposes, it is nice to have an explicit example of two
ch.f.’s that agree on [−1, 1]. From Example 2.3.8, we know that (1 −t)+ is the ch.f. of
the density (1 − cos x)/πx2 . Deﬁne ψ (t) to be equal to ϕ on [−1, 1] and periodic with
period 2, i.e., ψ (t) = ψ (t + 2). The Fourier series for ψ is
∞ ψ (u) = 1
2
+
exp(i(2n − 1)πu)
2 (2n − 1)2
2 n=−∞ π The righthand side is the ch.f. of a discrete distribution with
P (X = 0) = 1/2 and P (X = (2n − 1)π ) = 2π −2 (2n − 1)−2 n ∈ Z. Exercise 2.3.25. Find independent r.v.’s X , Y , and Z so that Y and Z do not have
the same distribution but X + Y and X + Z do.
Exercise 2.3.26. Show that if X and Y are independent and X + Y and X have the
same distribution then Y = 0 a.s.
For more curiosities, see Feller, Vol. II (1971), Section XV.2a. 2.3.5 The Moment Problem* Suppose xk dFn (x) has a limit µk for each k . Then the sequence of distributions is
tight by Theorem 2.2.8 and every subsequential limit has the moments µk by Exercise
2.2.5, so we can conclude the sequence converges weakly if there is only one distribution with these moments. It is easy to see that this is true if F is concentrated
on a ﬁnite interval [−M, M ] since every continuous function can be approximated
uniformly on [−M, M ] by polynomials. The result is false in general.
Counterexample 1. Heyde (1963) Consider the lognormal density
f0 (x) = (2π )−1/2 x−1 exp(−(log x)2 /2) x≥0 and for −1 ≤ a ≤ 1 let
fa (x) = f0 (x){1 + a sin(2π log x)}
To see that fa is a density and has the same moments as f0 , it suﬃces to show that
∞ xr f0 (x) sin(2π log x) dx = 0 for r = 0, 1, 2, . . .
0 Changing variables x = exp(s + r), s = log x − r, ds = dx/x the integral becomes
∞ (2π )−1/2 exp(rs + r2 ) exp(−(s + r)2 /2) sin(2π (s + r)) ds
−∞
∞ = (2π )−1/2 exp(r2 /2) exp(−s2 /2) sin(2πs) ds = 0
−∞ 2.3. CHARACTERISTIC FUNCTIONS 87 The two equalities holding because r is an integer and the integrand is odd. From the
proof, it should be clear that we could let
∞ ∞ g (x) = f0 (x) 1 + ak sin(kπ log x) ak  ≤ 1 if k=1 k=1 to get a large family of densities having the same moments as the lognormal.
The moments of the lognormal are easy to compute. Recall that if χ has the
standard normal distribution, then Exercise 1.1.11 implies exp(χ) has the lognormal
distribution.
EX n = E exp(nχ) =
= en 2 /2 2 enx (2π )−1/2 e−x (2π )−1/2 e−(x−n) 2 /2 /2 dx dx = exp(n2 /2) since the last integrand is the density of the normal with mean n and variance 1.
Somewhat remarkably there is a family of discrete random variables with these moments. Let a > 0 and
P (Ya = aek ) = a−k exp(−k 2 /2)/ca for k ∈ Z where ca is chosen to make the total mass 1.
(aek )n a−k exp(−k 2 /2)/ca n
exp(−n2 /2)EYa = exp(−n2 /2)
k
−(k−n) = a exp(−(k − n)2 /2)/ca = 1 k by the deﬁnition of ca .
The lognormal density decays like exp(−(log x)2 /2) as x → ∞. The next counterexample has more rapid decay. Since the exponential distribution, e−x for x ≥ 0,
is determined by its moments (see Exercise 2.3.28 below) we cannot hope to do much
better than this.
Counterexample 2. Let λ ∈ (0, 1) and for −1 ≤ a ≤ 1 let
fa,λ (x) = cλ exp(−xλ ){1 + a sin(β xλ sgn (x))}
where β = tan(λπ/2) and 1/cλ = exp(−xλ ) dx. To prove that these are density
functions and that for a ﬁxed value of λ they have the same moments, it suﬃces to
show
xn exp(−xλ ) sin(β xλ sgn (x)) dx = 0 for n = 0, 1, 2, . . .
This is clear for even n since the integrand is odd. To prove the result for odd n, it
suﬃces to integrate over [0, ∞). Using the identity
∞ tp−1 e−qt dt = Γ(p)/q p when Re q > 0 0 with p = (n + 1)/λ, q = 1 + βi, and changing variables t = xλ , we get
Γ((n + 1)/λ)/(1 + β i)(n+1)/λ
∞ xλ{(n+1)/λ−1} exp(−(1 + βi)xλ )λ xλ−1 dx =
0 ∞ ∞ xn exp(−xλ ) cos(βxλ )dx − iλ =λ
0 xn exp(−xλ ) sin(βxλ ) dx
0 88 CHAPTER 2. CENTRAL LIMIT THEOREMS Since β = tan(λπ/2)
(1 + βi)(n+1)/λ = (cos λπ/2)−(n+1)/λ (exp(iλπ/2))(n+1)/λ
The righthand side is real since λ < 1 and (n + 1) is even, so
∞ xn exp(−xλ ) sin(βxλ ) dx = 0
0 A useful suﬃcient condition for a distribution to be determined by its moments is
1 /2 k Theorem 2.3.11. If lim supk→∞ µ2k /2k = r < ∞ then there is at most one d.f. F
with µk = xk dF (x) for all positive integers k .
Remark. This is slightly stronger than Carleman’s condition
∞
1 /2 k 1/µ2k =∞ k=1 which is also suﬃcient for the conclusion of Theorem 2.3.11.
Proof. Let F be any d.f. with the moments µk and let νk =
2
Schwarz inequality implies ν2k+1 ≤ µ2k µ2k+2 so xk dF (x). The Cauchy 1/k lim sup(νk )/k = r < ∞
k→∞ Taking x = tX in Lemma 2.3.2 and multiplying by eiθX , we have
n−1 eiθX eitX − (itX )m
m!
m=0 ≤ tX n
n! Taking expected values and using Exercise 2.3.14 gives
ϕ(θ + t) − ϕ(θ) − tϕ (θ) . . . − tn−1
tn
ϕ(n−1) (θ) ≤
νn
(n − 1)!
n! Using the last result, the fact that νk ≤ (r + )k k k for large k , and the trivial bound
ek ≥ k k /k ! (expand the lefthand side in its power series), we see that for any θ
∞ (∗) ϕ(θ + t) = ϕ(θ) + tm (m)
ϕ (θ)
m!
m=1 for t < 1/er Let G be another distribution with the given moments and ψ its ch.f. Since ϕ(0) =
ψ (0) = 1, it follows from (∗) and induction that ϕ(t) = ψ (t) for t ≤ k/3r for all k ,
so the two ch.f.’s coincide and the distributions are equal.
Combining Theorem 2.3.11 with the discussion that began our consideration of
the moment problem.
Theorem 2.3.12. Suppose xk dFn (x) has a limit µk for each k and
1 /2 k lim sup µ2k /2k < ∞
k→∞ then Fn converges weakly to the unique distribution with these moments. 2.3. CHARACTERISTIC FUNCTIONS 89 Exercise 2.3.27. Let G(x) = P (X  < x), λ = sup{x : G(x) < 1}, and νk = E X k .
1/k
Show that νk → λ, so the assumption of Theorem 2.3.12 holds if λ < ∞.
Exercise 2.3.28. Suppose X  has density Cxα exp(−xλ ) on (0, ∞). Changing variables y = xλ , dy = λxλ−1 dx
∞ Cλy (n+α)/λ exp(−y )y 1/λ−1 dy = CλΓ((n + α + 1)/λ) E X n =
0 Use the identity Γ(x + 1) = xΓ(x) for x ≥ 0 to conclude that the assumption of
Theorem 2.3.12 is satisﬁed for λ ≥ 1 but not for λ < 1. This shows the normal
(λ = 2) and gamma (λ = 1) distributions are determined by their moments.
Our results so far have been for the socalled Hamburger moment problem.
If we assume a priori that the distribution is concentrated on [0, ∞), we have the
Stieltjes moment problem. There is a 11 correspondence between X ≥ 0 and
symmetric distributions on R given by X → ξX 2 where ξ ∈ {−1, 1} is independent
of X and takes its two values with equal probability. From this we see that
1 /2 k lim sup νk /2k < ∞ k→∞ is suﬃcient for there to be a unique distribution on [0, ∞) with the given moments.
The next example shows that for nonnegative random variables, the last result is close
to the best possible.
Counterexample 3. Let λ ∈ (0, 1/2), β = tan(λπ ), −1 ≤ a ≤ 1 and
fa (x) = cλ exp(−xλ )(1 + a sin(βxλ ))
where 1/cλ = ∞
0 for x ≥ 0 exp(−xλ ) dx. By imitating the calculations in Counterexample 2, it is easy to see that the fa are
probability densities that have the same moments. This example seems to be due to
Stoyanov (1987) p. 92–93. The special case λ = 1/4 is widely known. 90 CHAPTER 2. CENTRAL LIMIT THEOREMS 2.4 Central Limit Theorems We are now ready for the main business of the chapter. We will ﬁrst prove the central
limit theorem for 2.4.1 i.i.d. Sequences Theorem 2.4.1. Let X1 , X2 , . . . be i.i.d. with EXi = µ, var (Xi ) = σ 2 ∈ (0, ∞). If
Sn = X1 + · · · + Xn then
(Sn − nµ)/σn1/2 ⇒ χ
where χ has the standard normal distribution.
This notation is nonstandard but convenient. To see the logic note that the square
of a normal has a chisquared distribution.
Proof By considering Xi = Xi − µ, it suﬃces to prove the result when µ = 0. From
Theorem 2.3.8
σ 2 t2
+ o(t2 )
ϕ(t) = E exp(itX1 ) = 1 −
2
so
n
t2
1 /2
−1
E exp(itSn /σn ) = 1 −
+ o(n )
2n
From Lemma 2.1.1 it should be clear that the last quantity → exp(−t2 /2) as n → ∞,
which with Theorem 2.3.6 completes the proof. However, Lemma 2.1.1 is a fact about
real numbers, so we need to extend it to the complex case to complete the proof.
Theorem 2.4.2. If cn → c ∈ C then (1 + cn /n)n → ec .
Proof. The proof is based on two simple facts:
Lemma 2.4.3. Let z1 , . . . , zn and w1 , . . . , wn be complex numbers of modulus ≤ θ.
Then
n n n wm ≤ θn−1 zm −
m=1 m=1 zm − wm 
m=1 Proof. The result is true for n = 1. To prove it for n > 1 observe that
n n n zm −
m=1 m=1 zm − z1
m=2
n ≤θ m=2 n wm − w1 wm + z1
m=2 wm
m=2 n wm + θn−1 z1 − w1  zm −
m=2 n n wm ≤ z1 m=2 and use induction.
Lemma 2.4.4. If b is a complex number with b ≤ 1 then eb − (1 + b) ≤ b2 .
Proof. eb − (1 + b) = b2 /2! + b3 /3! + b4 /4! + . . . so if b ≤ 1 then
eb − (1 + b) ≤ b2
(1 + 1/2 + 1/22 + . . .) = b2
2 2.4. CENTRAL LIMIT THEOREMS 91 Proof of Theorem 2.4.2. Let zm = (1 + cn /n), wm = exp(cn /n), and γ > c. For
large n, cn  < γ . Since 1 + γ/n ≤ exp(γ/n), it follows from Lemmas 2.4.3 and 2.4.4
that
n−1
γ2
cn 2
≤ eγ
→0
n
(1 + cn /n)n − ecn  ≤ eγ/n
n
n
as n → ∞.
To get a feel for what the central limit theorem says, we will look at some concrete
cases.
Example 2.4.1. Roulette. A roulette wheel has slots numbered 1–36 (18 red and
18 black) and two slots numbered 0 and 00 that are painted green. Players can bet
$1 that the ball will land in a red (or black) slot and win $1 if it does. If we let Xi be
the winnings on the ith play then X1 , X2 , . . . are i.i.d. with P (Xi = 1) = 18/38 and
P (Xi = −1) = 20/38.
EXi = −1/19 and var (X ) = EX 2 − (EX )2 = 1 − (1/19)2 = 0.9972 We are interested in
P (Sn ≥ 0) = P Sn − nµ
−nµ
√
≥√
σn
σn Taking n = 361 = 192 and replacing σ by 1 to keep computations simple,
−nµ
361 · (1/19)
√=
√
=1
σn
361
So the central limit theorem and our table of the normal distribution in the back of
the book tells us that
P (Sn ≥ 0) ≈ P (χ ≥ 1) = 1 − 0.8413 = 0.1587
In words, after 361 spins of the roulette wheel the casino will have won $19 of your
money on the average, but there is a probability of about 0.16 that you will be ahead.
Example 2.4.2. Coin ﬂips. Let X1 , X2 , . . . be i.i.d. with P (Xi = 0) = P (Xi = 1) =
1/2. If Xi = 1 indicates that a heads occured on the ith toss then Sn = X1 + · · · + Xn
is the total number of heads at time n.
EXi = 1/2 and var (X ) = EX 2 − (EX )2 = 1/2 − 1/4 = 1/4 So the central limit theorem tells us (Sn − n/2)/ n/4 ⇒ χ. Our table of the normal
distribution tells us that
P (χ > 2) = 1 − 0.9773 = 0.0227
so P (χ ≤ 2) = 1 − 2(0.0227) = 0.9546, or plugging into the central limit theorem
√√
.95 ≈ P ((Sn − n/2)/ n/4 ∈ [−2, 2]) = P (Sn − n/2 ∈ [− n, n])
Taking n = 10, 000 this says that 95% of the time the number of heads will be between
4900 and 5100. 92 CHAPTER 2. CENTRAL LIMIT THEOREMS Example 2.4.3. Normal approximation to the binomial. Let X1 , X2 , . . . and
Sn be as in the previous example. To estimate P (S16 = 8) using the central limit
√
theorem, we regard 8 as the interval [7.5, 8.5]. Since µ = 1/2, and σ n = 2 for n = 16
Sn − nµ
√
≤ 0.25
σn
≈ P (χ ≤ 0.25) = 2(0.5987 − 0.5) = 0.1974 P (S16 − 8 ≤ 0.5) = P Even though n is small, this agrees well with the exact probability
16 −16
13 · 11 · 10 · 9
= 0.1964.
2
=
65, 536
8
The computations above motivate the histogram correction, which is important
in using the normal approximation for small n. For example, if we are going to
approximate P (S16 ≤ 11), then we regard this probability as P (S16 ≤ 11.5). One
obvious reason for doing this is to get the same answer if we regard P (S16 ≤ 11) =
1 − P (S16 ≥ 12).
Exercise 2.4.1. Suppose you roll a die 180 times. Use the normal approximation
(with the histogram correction) to estimate the probability you will get fewer than 25
sixes.
Example 2.4.4. Normal approximation to the Poisson. Let Zλ have a Poisson
distribution with mean λ. If X1 , X2 , . . . are independent and have Poisson distributions with mean 1, then Sn = X1 + · · · + Xn has a Poisson distribution with mean n.
Since var (Xi ) = 1, the central limit theorem implies:
(Sn − n)/n1/2 ⇒ χ as n → ∞
To deal with values of λ that are not integers, let N1 , N2 , N3 be independent
Poisson with means [λ], λ − [], and [λ] + 1 − λ. If we let S[λ] = N1 , Zλ = N1 + N2
and S[λ]+1 = N1 + N2 + N3 then S[λ] ≤ Zλ ≤ S[λ]+1 and using the limit theorem for
the Sn it follows that
(Zλ − λ)/λ1/2 ⇒ χ as λ → ∞
Example 2.4.5. Pairwise independence is good enough for the strong law of large
numbers (see Theorem 1.7.1). It is not good enough for the central limit theorem.
Let ξ1 , ξ2 , . . . be i.i.d. with P (ξi = 1) = P (ξi = −1) = 1/2. We will arrange things so
that for n ≥ 1
S2n = ξ1 (1 + ξ2 ) · · · (1 + ξn+1 ) = ±2n
0 with prob 2−n−1
with prob 1 − 2−n To do this we let X1 = ξ1 , X2 = ξ1 ξ2 , and for m = 2n−1 + j , 0 < j ≤ 2n−1 , n ≥ 2
let Xm = Xj ξn+1 . Each Xm is a product of a diﬀerent set of ξj ’s so they are pairwise
independent.
Exercises
2.4.2. Let X1 , X2 , . . . be i.i.d. with EXi = 0, 0 < var (Xi ) < ∞, and let Sn =
X1 + · · · + Xn . (a) Use the central limit theorem and Kolmogorov’s zeroone law to
√
conclude √
that limsup Sn / n = ∞ a.s. (b) Use an argument by contradiction to show
that Sn / n does not converge in probability. Hint: Consider n = m!. 2.4. CENTRAL LIMIT THEOREMS 93 √
2.4.3. Let X1 , X2 , . . . be i.i.d. and let Sn = X1 + · · · + Xn . Assume that Sn / n ⇒
2
2
a limit and conclude that EXi < ∞. Sketch: Suppose EXi = ∞. Let X1 , X2 , . . .
be an independent copy of the original sequence. Let Yi = Xi − Xi , Ui = Yi 1(Yi ≤A) ,
Vi = Yi 1(Yi >A) , and observe that for any K
n P n √
Ym ≥ K n m=1 ≥ 1
P
2 n √
Um ≥ K n, ≥P
m=1
n Vm ≥ 0
m=1 √
Um ≥ K n ≥ m=1 1
5 for large n if A is large enough. Since K is arbitrary, this is a contradiction.
2.4.4. Let X√ X2 , . √ be i.i.d. with Xi ≥ 0, EXi = 1, and var (Xi ) = σ 2 ∈ (0, ∞).
..
1,
Show that 2( Sn − n) ⇒ σχ.
2
2.4.5. Selfnormalized sums. Let X1 , X2 , . . . be i.i.d. with EXi = 0 and EXi =
2
σ ∈ (0, ∞). Then
1 /2 n n 2
Xm Xm ⇒χ m=1 m=1 2.4.6. Random index central limit theorem. Let X1 , X2 , . . . be i.i.d. with EXi =
2
0 and EXi = σ 2 ∈ (0, ∞), and let Sn = X1 + · · · + Xn . Let Nn be a sequence
of nonnegative integervalued random variables and an a sequence of integers with
an → ∞ and Nn /an → 1 in probability. Show that
√
SNn /σ an ⇒ χ
√
Hint: Use Kolmogorov’s inequality (Theorem 1.8.2) to conclude that if Yn = SNn /σ an
√
and Zn = San /σ an , then Yn − Zn → 0 in probability.
2.4.7. A central limit theorem in renewal theory. Let Y1 , Y2 , . . . be i.i.d. positive
random variables with EYi = µ and var (Yi ) = σ 2 ∈ (0, ∞). Let Sn = Y1 + · · · + Yn
and Nt = sup{m : Sm ≤ t}. Apply the previous exercise to Xi = Yi − µ to prove that
as t → ∞
(µNt − t)/(σ 2 t/µ)1/2 ⇒ χ
2.4.8. A second proof of the renewal CLT. Let Y1 , Y2 , . . ., Sn , and Nt be as in
the last exercise. Let u = [t/µ], Dt = Su − t. Use Kolmogorov’s inequality to show
P (Su+m − (Su + mµ) > t2/5 for some m ∈ [−t3/5 , t3/5 ]) → 0 as t → ∞ Conclude Nt − (t − Dt )/µ/ t1/2 → 0 in probability and then obtain the result in the
previous exercise.
Our next step is to generalize the central limit theorem to: 2.4.2 Triangular Arrays Theorem 2.4.5. The LindebergFeller theorem. For each n, let Xn,m , 1 ≤ m ≤
n, be independent random variables with EXn,m = 0. Suppose
(i) n
m=1 2
EXn,m → σ 2 > 0 (ii) For all > 0, limn→∞ n
m=1 E (Xn,m 2 ; Xn,m  > ) = 0. Then Sn = Xn,1 + · · · + Xn,n ⇒ σχ as n → ∞. 94 CHAPTER 2. CENTRAL LIMIT THEOREMS Remarks. In words, the theorem says that a sum of a large number of small independent eﬀects has approximately a normal distribution. To see that Theorem
2.4.5 contains our ﬁrst central limit theorem, let Y1 , Y2 . . . be i.i.d. with EYi = 0 and
n
2
EYi2 = σ 2 ∈ (0, ∞), and let Xn,m = Ym /n1/2 . Then m=1 EXn,m = σ 2 and if > 0
n E (Xn,m 2 ; Xn,m  > ) = nE (Y1 /n1/2 2 ; Y1 /n1/2  > )
m=1 = E (Y1 2 ; Y1  > n1/2 ) → 0
by the dominated convergence theorem since EY12 < ∞.
2
2
Proof. Let ϕn,m (t) = E exp(itXn,m ), σn,m = EXn,m . By Theorem 2.3.6, it suﬃces
to show that
n ϕn,m (t) → exp(−t2 σ 2 /2)
m=1
2
Let zn,m = ϕn,m (t) and wn,m = (1 − t2 σn,m /2). By (2.3.3) zn,m − wn,m  ≤ E (tXn,m 3 ∧ 2tXn,m 2 )
≤ E (tXn,m 3 ; Xn,m  ≤ ) + E (2tXn,m 2 ; Xn,m  > )
≤ t3 E (Xn,m 2 ; Xn,m  ≤ ) + 2t2 E (Xn,m 2 ; Xn,m  > )
Summing m = 1 to n, letting n → ∞, and using (i) and (ii) gives
n zn,m − wn,m  ≤ t3 σ 2 lim sup
n→∞ m=1 Since > 0 is arbitrary, it follows that the sequence converges to 0. Our next step is
to use Lemma 2.4.3 with θ = 1 to get
n n
2
(1 − t2 σn,m /2) → 0 ϕn,m (t) −
m=1 m=1 To check the hypotheses of Lemma 2.4.3, note that since φn,m is a ch.f. φn,m (t) ≤ 1
for all n, m. For the terms in the second product we note that
2
σn,m ≤ 2 + E (Xn,m 2 ; Xn,m  > ) 2
and is arbitrary so (ii) implies supm σn,m → 0 and thus if n is large 1 ≥ 1 −
22
t σn,m /2 > −1 for all m.
2
To complete the proof now, we apply Exercise 2.1.1 with cm,n = −t2 σn,m /2. We
2
have just shown supm σn,m → 0. (i) implies
n cm,n → −σ 2 t2 /2
m=1 so n
m=1 (1 2
− t2 σn,m /2) → exp(−t2 σ 2 /2) and the proof is complete. Example 2.4.6. Cycles in a random permutation and record values. Continuing the analysis of Examples 1.5.4 and 1.6.2, let Y1 , Y2 , . . . be independent with 2.4. CENTRAL LIMIT THEOREMS 95 P (Ym = 1) = 1/m, and P (Ym = 0) = 1 − 1/m. EYm = 1/m and var (Ym ) =
1/m − 1/m2 . So if Sn = Y1 + · · · + Yn then ESn ∼ log n and var (Sn ) ∼ log n. Let
Xn,m = (Ym − 1/m)/(log n)1/2
EXn,m = 0, n
m=1 2
EXn,m → 1, and for any >0 n E (Xn,m 2 ; Xn,m  > ) → 0
m=1 since the sum is 0 as soon as (log n)−1/2 < . Applying Theorem 2.4.5 now gives
n (log n)−1/2 Sn − 1
m
m=1 ⇒χ Observing that
n−1 shows log n − n
m=1 n n 1
≥
m
m=1 x−1 dx = log n ≥
1 1
m
m=2 1/m ≤ 1 and the conclusion can be written as
(Sn − log n)/(log n)1/2 ⇒ χ Example 2.4.7. The converse of the three series theorem. Recall the set up of
Theorem 1.8.4. Let X1 , X2 , . . . be independent, let A > 0, and let Ym = Xm 1(Xm ≤A) .
N
∞
In order that n=1 Xn converges (i.e., limN →∞ n=1 Xn exists) it is necessary that:
∞ ∞ P (Xn  > A) < ∞, (ii) (i)
n=1 ∞ var (Yn ) < ∞ EYn converges, and (iii)
n=1 n=1 Proof. The necessity of the ﬁrst condition is clear. For if that sum is inﬁnite, P (Xn  >
n
A i.o.) > 0 and limn→∞ m=1 Xm cannot exist. Suppose next that the sum in (i) is
ﬁnite but the sum in (iii) is inﬁnite. Let
n var (Ym ) cn = and Xn,m = (Ym − EYm )/c1/2
n m=1 EXn,m = 0, n
m=1 2
EXn,m = 1, and for any >0 n E (Xn,m 2 ; Xn,m  > ) → 0
m=1
1 /2 since the sum is 0 as soon as 2A/cn
Xn,1 + · · · + Xn,n then Sn ⇒ χ. Now
(i) if limn→∞ n
m=1 (ii) if we let Tn = ( < . Applying (4.5) now gives that if Sn = Xm exists, limn→∞
1 /2 m≤n Ym )/cn n
m=1 Ym exists. then Tn ⇒ 0. The last two results and Exercise 2.2.13 imply (Sn − Tn ) ⇒ χ. Since EYm /c1/2
n Sn − T n = − m≤n 96 CHAPTER 2. CENTRAL LIMIT THEOREMS is not random, this is absurd.
Finally, assume the series in (i) and (iii) are ﬁnite. Theorem 1.8.3 implies that
n
n
n
limn→∞ m=1 (Ym −EYm ) exists, so if limn→∞ m=1 Xm and hence limn→∞ m=1 Ym
does, taking diﬀerences shows that (ii) holds.
Example 2.4.8. Inﬁnite variance. Suppose X1 , X2 , . . . are i.i.d. and have P (X1 >
x) = P (X1 < −x) and P (X1  > x) = x−2 for x ≥ 1.
∞ E X1 2 = 2xP (X1  > x) dx = ∞
0 but it turns out that when Sn = X1 + · · · + Xn is suitably normalized it converges to
a normal distribution. Let
Yn,m = Xm 1(Xm ≤n1/2 log log n)
The truncation level cn = n1/2 log log n is chosen large enough to make
n P (Yn,m = Xm ) ≤ nP (X1  > cn ) → 0
m=1 However, we want the variance of Yn,m to be as small as possible, so we keep the
truncation close to the lowest possible level.
2
Our next step is to show EYn,m ∼ log n. For this we need upper and lower bounds.
Since P (Yn,m  > x) ≤ P (X1  > x) and is 0 for x > cn , we have
cn
2
EYn,m ≤ cn 2yP (X1  > y ) dy = 1 +
0 2/y dy
1 = 1 + 2 log cn = 1 + log n + 2 log log log n ∼ log n
In the other direction, we observe P (Yn,m  > x) = P (X1  > x) − P (X1  > cn ) and
√
the righthand side is ≥ (1 − (log log n)−2 )P (X1  > x) when x ≤ n so
√
n
2
EYn,m −2 ≥ (1 − (log log n) 2/y dy ∼ log n )
1 If Sn = Yn,1 + · · · + Yn,n then var (Sn ) ∼ n log n, so we apply
to Xn,m = Yn,m /(n log n)1/2 . Things have been arranged so that
Since Yn,m  ≤ n1/2 log log n, the sum in (ii) is 0 for large n, and
Sn /(n log n)1/2 ⇒ χ. Since the choice of cn guarantees P (Sn = Sn )
result holds for Sn . Theorem 2.4.5
(i) is satisﬁed.
it follows that
→ 0, the same Remark. In Section 2.6, we will see that if we replace P (X1  > x) = x−2 in Example
2.4.8 by P (X1  > x) = x−α where 0 < α < 2, then Sn /n1/α ⇒ to a limit which is
not χ. The last word on convergence to the normal distribution is the next result due
to L´vy.
e
Theorem 2.4.6. Let X1 , X2 , . . . be i.i.d. and Sn = X1 + · · · + Xn . In order that there
exist constants an and bn > 0 so that (Sn − an )/bn ⇒ χ, it is necessary and suﬃcient
that
y 2 P (X1  > y )/E (X1 2 ; X1  ≤ y ) → 0. 2.4. CENTRAL LIMIT THEOREMS 97 A proof can be found in Gnedenko and Kolmogorov (1954), a reference that contains
the last word on many results about sums of independent random variables.
Exercises
In the next ﬁve problems X1 , X2 , . . . are independent and Sn = X1 + · · · + Xn .
2.4.9. Suppose P (Xm = m) = P (Xm = −m) = m−2 /2, and for m ≥ 2
P (Xm = 1) = P (Xm = −1) = (1 − m−2 )/2
√
√
Show that var (Sn )/n → 2 but Sn / n ⇒ χ. The trouble here is that Xn,m = Xm / n
does not satisfy (ii) of Theorem 2.4.5.
2.4.10. Show that if Xi  ≤ M and n var (Xn ) = ∞ then (Sn − ESn )/ var (Sn ) ⇒ χ 2
2.4.11. Suppose EXi = 0, EXi = 1 and E Xi 2+δ ≤ C for some 0 < δ, C < ∞.
√
Show that Sn / n ⇒ χ. 2.4.12. Prove Lyapunov’s Theorem. Let αn = { var (Sn )}1/2 . If there is a δ > 0
so that
n −
lim αn (2+δ) n→∞ E (Xm − EXm 2+δ ) = 0
m=1 then (Sn − ESn )/αn ⇒ χ. Note that the previous exercise is a special case of this
result.
2.4.13. Suppose P (Xj = j ) = P (Xj = −j ) = 1/2j β and P (Xj = 0) = 1 − j −β where
β > 0. Show that (i) If β > 1 then Sn → S∞ a.s. (ii) if β < 1 then Sn /n(3−β )/2 ⇒ cχ.
(iii) if β = 1 then Sn /n ⇒ ℵ where
1 x−1 (1 − cos xt) dx E exp(itℵ) = exp −
0 2.4.3 Prime Divisors (Erd¨sKac)*
o Our aim here is to prove that an integer picked at random from {1, 2, . . . , n} has about
log log n + χ(log log n)1/2
prime divisors. Since exp(e4 ) = 5.15 × 1023 , this result does not apply to most numbers
we encounter in “everyday life.” The ﬁrst step in deriving this result is to give a
Second proof of Theorem 2.4.5. The ﬁrst step is to let
n
2
E (Xn,m ; Xn,m  > ) hn ( ) =
m=1 and observe
Lemma 2.4.7. hn ( ) → 0 for each ﬁxed
hn ( n ) → 0. > 0 so we can pick n → 0 so that 98 CHAPTER 2. CENTRAL LIMIT THEOREMS Proof. Let Nm be chosen so that hn (1/m) ≤ 1/m for n ≥ Nm and m → Nm is
increasing. Let n = 1/m for Nm ≤ n < Nm+1 , and = 1 for n < N1 . When
Nm ≤ n < Nm+1 , n = 1/m, so hn ( n ) = hn (1/m) ≤ 1/m and the desired result
follows.
Let Xn,m = Xn,m 1(Xn,m > n ) , Yn,m = Xn,m 1(Xn,m ≤ n ) , and Zn,m = Yn,m −
EYn,m . Clearly Zn,m  ≤ 2 n . Using Xn,m = Xn,m + Yn,m , Zn,m = Yn,m − EYn,m ,
EYn,m = −EXn,m , the variance of the sum is the sum of the variances, and var (W ) ≤
EW 2 , we have
n 2 n Xn,m − E
m=1 Zn,m 2 n Xn,m − EXn,m =E m=1
n m=1
n
2 E (Xn,m )2 → 0 E (Xn,m − EXn,m ) ≤ =
m=1 m=1 as n → ∞, by the choice of n .
n
n
Let Sn = m=1 Xn,m and Tn = m=1 Zn,m . The last computation shows Sn −
2
Tn → 0 in L and hence in probability by Lemma 1.5.2. Thus, by Exercise 2.2.13,
2
it suﬃces to show Tn ⇒ σχ. (i) implies ESn → σ 2 . We have just shown that
2
2
E (Sn − Tn ) → 0, so the triangle inequality for the L2 norm implies ETn → σ 2 . To
compute higher moments, we observe
r
r
Tn =
k=1 ri r!
1
r1 ! · · · rk ! k ! rk
r1
Zn,i1 · · · Zn,ik
ij where ri extends over all k tuples of positive integers with r1 + · · · + rk = r and
ij extends over all k tuples of distinct integers with 1 ≤ i ≤ n. If we let
rk
r1
EZn,i1 · · · EZn,ik An (r1 , ..., rk ) =
ij then r
r
ETn =
k=1 ri To evaluate the limit of r
ETn r!
1
An (r1 , ...rk )
r1 ! · · · rk ! k ! we observe: (a) If some rj = 1, then An (r1 , ...rk ) = 0 since EZn,ij = 0.
(b) If all rj = 2 then
k n
2
EZn,i1 2
· · · EZn,ik 2
EZn,m ≤ → σ 2k m=1 ij To argue the other inequality, we note that for any 1 ≤ a < b ≤ k we can estimate
2
the sum over all the i1 , . . . , ik with ia = ib by replacing EZn,ia by (2 n )2 to get (the
k
factor 2 giving the number of ways to pick 1 ≤ a < b ≤ k )
k n
2
EZn,m
m=1 2
EZn,i1 −
ij 2
· · · EZn,ik ≤ k
(2 n )2
2 k−1 n
2
EZn,m
m=1 →0 2.4. CENTRAL LIMIT THEOREMS 99 (c) If all the ri ≥ 2 but some rj > 2 then using
2
E Zn,ij rj ≤ (2 n )rj −2 EZn,ij we have
E Zn,i1 r1 · · · E Zn,ik rk An (r1 , ...rk ) ≤
ij ≤ (2 n )r−2k An (2, ...2) → 0
r
When r is odd, some rj must be = 1 or ≥ 3 so ETn → 0 by (a) and (c). If r = 2k is
even, (a)–(c) imply
σ 2k (2k )!
r
= E (σχ)r
ETn →
2k k ! and the result follows from Theorem 2.3.12.
Turning to the result for prime divisors, let Pn denote the uniform distribution on
{1, . . . , n}. If P∞ (A) ≡ lim Pn (A) exists the limit is called the density of A ⊂ Z. Let
Ap be the set of integers divisible by p. Clearly, if p is a prime P∞ (Ap ) = 1/p and
q = p is another prime
P∞ (Ap ∩ Aq ) = 1/pq = P∞ (Ap )P∞ (Aq )
Even though P∞ is not a probability measure (since P ({i}) = 0 for all i), we can
interpret this as saying that the events of being divisible by p and q are independent.
Let δp (n) = 1 if n is divisible by p, and = 0 otherwise, and
g (n) = δp (n) be the number of prime divisors of n p≤n this and future sums on p being over the primes. Intuitively, the δp (n) behave like
Xp that are i.i.d. with
P (Xp = 1) = 1/p
The mean and variance of p≤n Xp are 1/p
p≤n and P (Xp = 0) = 1 − 1/p 1/p(1 − 1/p) and
p≤n respectively. It is known that
(∗) 1/p = log log n + O(1)
p≤n (see Hardy and Wright (1959), Chapter XXII), while anyone can see p 1/p2 < ∞,
so applying Theorem 2.4.5 to Xp and making a small leap of faith gives us:
Theorem 2.4.8. Erd¨sKac central limit theorem. As n → ∞
o
Pn m ≤ n : g (m) − log log n ≤ x(log log n)1/2 → P (χ ≤ x) 100 CHAPTER 2. CENTRAL LIMIT THEOREMS Proof. We begin by showing that we can ignore the primes “near” n. Let
αn = n1/ log log n
log αn = log n/ log log n
log log αn = log log n − log log log n
The sequence αn has two nice properties:
(a) αn <p≤n 1/p /(log log n)1/2 → 0 by (∗) Proof of (a). By (∗)
1/p − 1/p =
αn <p≤n p≤n 1/p
p≤αn = log log n − log log αn + O(1)
= log log log n + O(1)
(b) If r
> 0 then αn ≤ n for large n and hence αn /n → 0 for all r < ∞. Proof of (b). 1/ log log n → 0 as n → ∞.
Let gn (m) = p≤αn δp (m) and let En denote expected value w.r.t. Pn . En Pn (m : δp (m) = 1) ≤ δp =
αn <p≤n αn <p≤n 1/p
αn <p≤n so by (a) it is enough to prove the result for gn . Let
Sn = Xp
p≤αn where the Xp are the independent random variables introduced above. Let bn = ESn
and a2 = var (Sn ). (a) tells us that bn and a2 are both
n
n
log log n + o((log log n)1/2 )
so it suﬃces to show
Pn (m : gn (m) − bn ≤ xan ) → P (χ ≤ x)
An application of (4.5) shows (Sn − bn )/an ⇒ χ, and since Xp  ≤ 1 it follows from
the second proof of Theorem 2.4.5 that
r E ((Sn − bn )/an ) → Eχr for all r Using notation from that proof (and replacing ij by pj )
r
r
ESn =
k=1 ri r!
1
r1 ! · · · rk ! k ! r1
rk
E (Xp1 · · · Xpk )
pj Since Xp ∈ {0, 1}, the summand is
E (Xp1 · · · Xpk ) = 1/(p1 · · · pk ) 2.4. CENTRAL LIMIT THEOREMS 101 A little thought reveals that
En (δp1 · · · δpk ) ≤ 1
[n/(p1 · · · pk )]
n The two moments diﬀer by ≤ 1/n, so
r 1
r!
r1 ! · · · rk ! k ! r
r
E (Sn ) − En (gn ) =
k=1 ri r ≤ 13n 1 ≤ p≤αn r
αn n pj 1
n →0 by (b). Now
r
r E (Sn − bn ) =
m=0
r E (gn − bn )r =
m=0 r
m
E Sn (−bn )r−m
m
r
m
E gn (−bn )r−m
m r
r
so subtracting and using our bound on E (Sn ) − En (gn ) with r = m
r E (Sn − bn )r − E (gn − bn )r  ≤
m=0 r 1 m r−m
αb
= (αn + bn )r /n → 0
mnnn since bn ≤ αn . This is more than enough to conclude that
r E ((gn − bn )/an ) → Eχr
and the desired result follows from Theorem 2.3.12. 2.4.4 Rates of Convergence (BerryEsseen)* 2
Theorem 2.4.9. Let X1 , X2 , . . . be i.i.d. with EXi = 0, EXi = σ 2 , and E Xi 3 =
√
ρ < ∞. If Fn (x) is the distribution of (X1 + · · · + Xn )/σ n and N (x) is the standard
normal distribution, then
√
Fn (x) − N (x) ≤ 3ρ/σ 3 n Remarks. The reader should note that the inequality holds for all n and x, but since
ρ ≥ σ 3 it only has nontrivial content for n ≥ 10. It is easy to see that the rate cannot
be faster than n−1/2 . When P (Xi = 1) = P (Xi = −1) = 1/2, symmetry and (1.4)
imply
1
1
F2n (0) = {1 + P (S2n = 0)} = (1 + (πn)−1/2 ) + o(n−1/2 )
2
2
The constant 3 is not the best known (van Beek (1972) gets 0.8), but as Feller brags,
“our streamlined method yields a remarkably good bound even though it avoids the
usual messy numerical calculations.” The hypothesis E X 3 is needed to get the rate
n−1/2 . Heyde (1967) has shown that for 0 < δ < 1
∞ n−1+δ/2 sup Fn (x) − N (x) < ∞
n=1 x if and only if E X 2+δ < ∞. For this and more on rates of convergence, see Hall
(1982). 102 CHAPTER 2. CENTRAL LIMIT THEOREMS Proof. Since neither side of the inequality is aﬀected by scaling, we can suppose
without loss of generality that σ 2 = 1. The ﬁrst phase of the argument is to derive an
inequality, Lemma 2.4.11, that relates the diﬀerence between the two distributions to
the distance between their ch.f.’s. Polya’s density (see Example 2.3.8 and use (e) of
Theorem 2.3.1)
1 − cos Lx
hL (x) =
πLx2
+
has ch.f. ωL (θ) = (1 −θ/L) for θ ≤ L. We will use HL for its distribution function.
We will convolve the distributions under consideration with HL to get ch.f. that have
compact support. The ﬁrst step is to show that convolution with HL does not reduce
the diﬀerence between the distributions too much.
Lemma 2.4.10. Let F and G be distribution functions with G (x) ≤ λ < ∞. Let
∆(x) = F (x) − G(x), η = sup ∆(x), ∆L = ∆ ∗ HL , and ηL = sup ∆L (x). Then
ηL ≥ η 12λ
−
2
πL or η ≤ 2ηL + 24λ
πL Proof. ∆ goes to 0 at ±∞, G is continuous, and F is a d.f., so there is an x0 with
∆(x0 ) = η or ∆(x0 −) = −η . By looking at the d.f.’s of (−1) times the r.v.’s in
the second case, we can suppose without loss of generality that ∆(x0 ) = η . Since
G (x) ≤ λ and F is nondecreasing, ∆(x0 + s) ≥ η − λs. Letting δ = η/2λ, and
t = x0 + δ , we have
(η/2) + λx for x ≤ δ
−η
otherwise ∆(t − x) ≥ To estimate the convolution ∆L , we observe
∞ ∞ 2/(πLx2 )dx = 4/(πLδ ) hL (x) dx ≤ 2 2
δ δ Looking at (−δ, δ ) and its complement separately and noticing symmetry implies
δ
xhL (x) dx = 0, we have
−δ
ηL ≥ ∆L (t) ≥ η
2 1− 4
πLδ −η 4
η
6η
η 12λ
=−
=−
πLδ
2 πLδ
2
πL which proves the lemma.
Lemma 2.4.11. Let K1 and K2 be d.f. with mean 0 whose ch.f. κi are integrable
K1 (x) − K2 (x) = (2π )−1 −e−itx κ1 (t) − κ2 (t)
dt
it Proof. Since the κi are integrable, the inversion formula, Theorem 2.3.4, implies that
the density ki (x) has
ki (y ) = (2π )−1 e−ity κi (t) dt Subtracting the last expression with i = 2 from the one with i = 1 then integrating
from a to x and letting ∆K = K1 − K2 gives
x ∆K (x) − ∆K (a) = (2π )−1 e−ity {κ1 (t) − κ2 (t)} dt dy
a −1 = (2π ) {e−ita − e−itx } κ1 (t) − κ2 (t)
dt
it 2.4. CENTRAL LIMIT THEOREMS 103 the application of Fubini’s theorem being justiﬁed since the κi are integrable in t and
we are considering a bounded interval in y .
The factor 1/it could cause problems near zero, but we have supposed that the
Ki have mean 0, so {1 − κi (t)}/t → 0 by Exercise 2.3.14, and hence (κ1 (t) − κ2 (t))/it
is bounded and continuous. The factor 1/it improves the integrability for large t so
(κ1 (t) − κ2 (t))/it is integrable. Letting a → −∞ and using the RiemannLebesgue
lemma (Exercise A.4.4) proves the result.
Let φF and φG be the ch.f.’s of F and G. Applying Lemma 2.4.11 to FL = F ∗ HL
and GL = G ∗ HL , gives
FL (x) − GL (x) ≤ 1
2π ≤ 1
2π φF (t)ωL (t) − φG (t)ωL (t)
L φF (t) − φG (t)
−L dt
t dt
t since ωL (t) ≤ 1. Using Lemma 2.4.10 now, we have
F (x) − G(x) ≤ 1
π L φF (θ) − φG (θ)
−L dθ
24λ
+
θ
πL where λ = supx G (x). Plugging in F = Fn and G = N gives
Fn (x) − N (x) ≤ 1
π L √
dθ
24λ
ϕn (θ/ n) − ψ (θ)
+
θ
πL
−L (2.4.1) and it remains to estimate the righthand side. This phase of the argument is fairly
routine, but there is a fair amount of algebra. To save the reader from trying to
improve the inequalities along the way in hopes of getting a better bound, we would
like to observe that we have used the fact that C = 3 to get rid of the cases n ≤ 9,
and we use n ≥ 10 in (e).
To estimate the second term in (2.4.1), we observe that
(a) sup G (x) = G (0) = (2π )−1/2 = 0.39894 < 2/5
x For the ﬁrst, we observe that if α, β  ≤ γ
n−1 (b) αn−m β m − αn−m−1 β m+1  ≤ nα − β γ n−1 αn − β n  ≤
m=0 Using (2.3.3) now gives (recall we are supposing σ 2 = 1)
(c) ϕ(t) − 1 + t2 /2 ≤ ρt3 /6 so if t2 ≤ 2
(d) ϕ(t) ≤ 1 − t2 /2 + ρt3 /6 √
√
Let L = 4 n/3ρ. If θ ≤ L then by (d) and the fact ρθ/ n ≤ 4/3
√
ϕ(θ/ n) ≤ 1 − θ2 /2n + ρθ3 /6n3/2
≤ 1 − 5θ2 /18n ≤ exp(−5θ2 /18n) 104 CHAPTER 2. CENTRAL LIMIT THEOREMS since 1 − x ≤ e−x . We will now apply (b) with
√
α = ϕ(θ/ n)
β = exp(−θ2 /2n) γ = exp(−5θ2 /18n) Since we are supposing n ≥ 10
γ n−1 ≤ exp(−θ2 /4) (e) For the other part of (b), we write
√
nα − β  ≤ nϕ(θ/ n) − 1 + θ2 /2n + n1 − θ2 /2n − exp(−θ2 /2n)
To bound the ﬁrst term on the righthand side, observe (c) implies
√
nϕ(θ/ n) − 1 + θ2 /2n ≤ ρθ3 /6n1/2
For the second term, note that if 0 < x < 1 then we have an alternating series with
decreasing terms so
x2
x2
x3
+
− ... ≤
2!
3!
2
√
Taking x = θ2 /2n it follows that for θ ≤ L ≤ 2n
e−x − (1 − x) = − n1 − θ2 /2n − exp(−θ2 /2n) ≤ θ4 /8n
Combining this with our estimate on the ﬁrst term gives
nα − β  ≤ ρθ3 /6n1/2 + θ4 /8n (f) Using (f) and (e) in (b), gives
1n√
θ3
ρθ2
+
ϕ (θ/ n) − exp(−θ2 /2) ≤ exp(−θ2 /4)
θ
8n
6n 1 / 2
2
1
2θ
θ3
≤ exp(−θ2 /4)
+
L
9
18
√
√
√
since ρ/ n = 4/3L, and 1/n = 1/ n · 1/ n ≤ 4/3L · 1/3 since ρ ≥ 1 and n ≥ 10.
Using the last result and (a) in Lemma 2.4.11 gives
πLFn (x) − N (x) ≤ exp(−θ2 /4) 2θ2
θ3
+
9
18 dθ + 9.6 √
Recalling L = 4 n/3ρ, we see that the last result is of the form Fn (x) − N (x) ≤
√
Cρ/ n. To evaluate the constant, we observe
(2πa)−1/2 x2 exp(−x2 /2a)dx = a
and writing x3 = 2x2 · x/2 and integrating by parts
∞ ∞ x3 exp(−x2 /4) dx = 2 2
0 4x exp(−x2 /4) dx
0
2 = −16e−x /4 ∞ = 16
0 This gives us
Fn (x) − N (x) ≤ 13
·
π4 √
2
16
· 2 · 4π +
+ 9.6
9
18 ρ
ρ
√ < 3√
n
n For the last step, you have to get out your calculator or trust Feller. 2.5. LOCAL LIMIT THEOREMS* 2.5 105 Local Limit Theorems* In Section 2.1 we saw that if X1 , X2 , . . . are i.i.d. with P (X1 = 1) = P (X1 = −1) =
1/2 and kn is a sequence of integers with 2kn /(2n)1/2 → x then
P (S2n = 2kn ) ∼ (πn)−1/2 exp(−x2 /2)
In this section, we will prove two theorems that generalize the last result. We begin
with two deﬁnitions. A random variable X has a lattice distribution if there are
constants b and h > 0 so that P (X ∈ b + hZ) = 1, where b + hZ = {b + hz : z ∈ Z}.
The largest h for which the last statement holds is called the span of the distribution.
Example 2.5.1. If P (X = 1) = P (X = −1) = 1/2 then X has a lattice distribution
with span 2. When h is 2, one possible choice is b = −1.
The next result relates the last deﬁnition to the characteristic function. To check
(ii) in its statement, note that in the last example E (eitX ) = cos t has  cos(t) = 1
when t = nπ .
Theorem 2.5.1. Let φ(t) = EeitX . There are only three possibilities.
(i) ϕ(t) < 1 for all t = 0.
(ii) There is a λ > 0 so that ϕ(λ) = 1 and ϕ(t) < 1 for 0 < t < λ. In this case, X
has a lattice distribution with span 2π/λ.
(iii) ϕ(t) = 1 for all t. In this case, X = b a.s. for some b.
Proof. We begin with (ii). It suﬃces to show that φ(t) = 1 if and only if P (X ∈
b + (2π/t)Z) = 1 for some b. First, if P (X ∈ b + (2π/t)Z) = 1 then
ϕ(t) = EeitX = eitb ei2πn P (X = b + (2π/t)n) = eitb
n∈Z Conversely, if ϕ(t) = 1, then there is equality in the inequality EeitX  ≤ E eitX ,
so by Exercise 1.3.1 the distribution of eitX must be concentrated at some point eitb ,
and P (X ∈ b + (2π/t)Z) = 1.
To prove trichotomy now, we suppose that (i) and (ii) do not hold, i.e., there is a
sequence tn ↓ 0 so that ϕ(tn ) = 1. The ﬁrst paragraph shows that there is a bn so
that P (X ∈ bn + (2π/tn )Z) = 1. Without loss of generality, we can pick bn ∈ (−π/tn ,
π/tn ]. As n → ∞, P (X ∈ (−π/tn , π/tn ]) → 0 so it follows that P (X = bn ) → 1.
/
This is only possible if bn = b for n ≥ N , and P (X = b) = 1.
We call the three cases in Theorem 2.5.1: (i) nonlattice, (ii) lattice, and (iii)
degenerate. The reader should notice that this means that lattice random variables
are by deﬁnition nondegenerate. Before we turn to the main business of this section,
we would like to introduce one more special case. If X is a lattice distribution and we
can take b = 0, i.e., P (X ∈ hZ) = 1, then X is said to be arithmetic. In this case,
if λ = 2π/h then ϕ(λ) = 1 and ϕ is periodic: ϕ(t + λ) = ϕ(t).
Our ﬁrst local limit theorem is for the lattice case. Let X1 , X2 , . . . be i.i.d. with
2
EXi = 0, EXi = σ 2 ∈ (0, ∞), and having a common lattice distribution with span
h. If Sn = X1 + · · · + Xn and P (Xi ∈ b + hZ) = 1 then P (Sn ∈ nb + hZ) = 1. We put
√
√
pn (x) = P (Sn / n = x) for x ∈ Ln = {(nb + hz )/ n : z ∈ Z}
and
n(x) = (2πσ 2 )−1/2 exp(−x2 /2σ 2 ) for x ∈ (−∞, ∞) 106 CHAPTER 2. CENTRAL LIMIT THEOREMS Theorem 2.5.2. Under the hypotheses above, as n → ∞
n1 /2
pn (x) − n(x) → 0
h sup
x∈Ln Remark. To explain the statement, note that if we followed the approach in Example
2.4.3 then we would conclude that for x ∈ Ln
√
x+h/2 n pn (x) ≈ √
x−h/2 n h
n(y ) dy ≈ √ n(x)
n Proof. Let Y be a random variable with P (Y ∈ a + θZ) = 1 and ψ (t) = E exp(itY ).
It follows from part (iii) of Exercise 2.3.2 that
P (Y = x) = 1
2π/θ π /θ e−itx ψ (t) dt
−π/θ √ √
√
Using this formula with θ = h/ n, ψ (t) = E exp(itSn / n) = ϕn (t/ n), and then
multiplying each side by 1/θ gives
n1 /2
1
pn (x) =
h
2π √
π n/h
√
−π n/h √
e−itx ϕn (t/ n) dt Using the inversion formula, Theorem 2.3.5, for n(x), which has ch.f. exp(−σ 2 t2 /2),
gives
1
e−itx exp(−σ 2 t2 /2) dt
n(x) =
2π
Subtracting the last two equations gives (recall π > 1, e−itx  ≤ 1)
n1 /2
pn (x) − n(x) ≤
h
+ √
π n/h
√
−π n/h
∞
√
π n/h √
ϕn (t/ n) − exp(−σ 2 t2 /2) dt exp(−σ 2 t2 /2) dt The righthand side is independent of x, so to prove Theorem 2.5.2 it suﬃces to show
that it approaches 0. The second integral clearly → 0. To estimate the ﬁrst integral,
√
we observe that ϕn (t/ n) → exp(−σ 2 t2 /2), so the integrand goes to 0 and it is now
just a question of “applying the dominated convergence theorem.”
To do this, we will divide the integral into three pieces. The bounded convergence
theorem implies that for any A < ∞ the integral over (−A, A) approaches 0. To
2
estimate the integral over (−A, A)c , we observe that since EXi = 0 and EXi = σ 2 ,
formula (2.3.3) and the triangle inequality imply that
ϕ(u) ≤ 1 − σ 2 u2 /2 + u2
E (min(u · X 3 , 6X 2 ))
2 The last expected value → 0 as u → 0. This means we can pick δ > 0 so that if
u < δ , it is ≤ σ 2 /2 and hence
ϕ(u) ≤ 1 − σ 2 u2 /2 + σ 2 u2 /4 = 1 − σ 2 u2 /4 ≤ exp(−σ 2 u2 /4)
√
√
since 1 − x ≤ e−x . Applying the last result to u = t/ n we see that for t ≤ δ n
√
(∗)
ϕ(t/ n)n  ≤ exp(−σ 2 t2 /4) 2.5. LOCAL LIMIT THEOREMS* 107 √
√
So the integral over (−δ n, δ n) − (−A, A) is smaller than
√
δn exp(−σ 2 t2 /4) dt 2
A which is small if A is large.
To estimate the rest of the integral we observe that since X has span h, Theorem
2.5.1 implies ϕ(u) = 1 for u ∈ [δ, π/h]. ϕ is continuous so there is an η < 1 so that
√
ϕ(u) ≤ √ < 1 for u ∈ [δ, π/h]. Letting u = t/ n again, we see that the integral
η
√
√
√
over [−π n/h, π n/h] − (−δ n, δ n) is smaller than
√
π n/h 2 √
δn η n + exp(−σ 2 t2 /2) dt which → 0 as n → ∞. This completes the proof.
We turn now to the nonlattice case. Let X1 , X2 , . . . be i.i.d. with EXi = 0,
2
EXi = σ 2 ∈ (0, ∞), and having a common characteristic function ϕ(t) that has
ϕ(t) < 1 for all t = 0. Let Sn = X1 + · · · + Xn and n(x) = (2πσ 2 )−1/2 exp(−x2 /2σ 2 ).
√
Theorem 2.5.3. Under the hypotheses above, if xn / n → x and a < b
√
nP (Sn ∈ (xn + a, xn + b)) → (b − a)n(x)
Remark. The proof of this result has to be a little devious because the assumption
above does not give us much control over the behavior of ϕ. For a bad example, let
q1 , q2 , . . . be an enumeration of the positive rationals which has qn ≤ n. Suppose
P (X = qn ) = P (X = −qn ) = 1/2n+1
In this case EX = 0, EX 2 < ∞, and the distribution is nonlattice. However, the
characteristic function has lim supt→∞ ϕ(t) = 1.
Proof. To tame bad ch.f.’s we use a trick. Let δ > 0
h0 (y ) = 1 1 − cos δy
·
π
δy 2 be the density of the Polya’s distribution and let hθ (x) = eiθx h0 (x). If we introduce
the Fourier transform
g (u) = eiuy g (y ) dy
ˆ
then it follows from Example 2.3.8 that
ˆ
h0 (u) = 1 − u/δ  if u ≤ δ
0
otherwise ˆ
ˆ
and it is easy to see that hθ (u) = h0 (u + θ). We will show that for any θ
(a) √ n Ehθ (Sn − xn ) → n(x) hθ (y ) dy Before proving (a), we will show it implies Theorem 2.5.3. Let
√
µn (A) = nP (Sn − xn ∈ A), and µ(A) = n(x)A 108 CHAPTER 2. CENTRAL LIMIT THEOREMS where A = the Lebesgue measure of A. Let
αn = √ n Eh0 (Sn − xn ) and α = n(x) h0 (y ) dy = n(x) Finally, deﬁne probability measures by
νn (B ) = 1
αn h0 (y )µn (dy ), and ν (B ) = B 1
α h0 (y )µ(dy )
B Taking θ = 0 in (a) we see αn → α and so (a) implies
eiθy νn (dy ) → (b) eiθy ν (dy ) Since this holds for all θ, it follows from Theorem 2.3.6 that νn ⇒ ν . Now if a, b <
2π/δ then the function
1
· 1(a,b) (y )
k (y ) =
h0 (y )
is bounded and continuous a.s. with respect to ν so it follows from Theorem 2.2.4
that
k (y )νn (dy ) → k (y )ν (dy )
Since αn → α, this implies
√
nP (Sn ∈ (xn + a, xn + b)) → (b − a)n(x)
which is the conclusion of Theorem 2.5.3.
Turning now to the proof of (a), the inversion formula, Theorem 2.3.5, implies
h0 (x) = 1
2π ˆ
e−iux h0 (u) du Recalling the deﬁnition of hθ , using the last result, and changing variables u = v + θ
we have
1
2π
1
=
2π hθ (x) = eiθx h0 (x) = ˆ
e−i(u−θ)x h0 (u) du
ˆ
e−ivx hθ (v ) dv ˆ
ˆ
since hθ (v ) = h0 (v + θ). Letting Fn be the distribution of Sn − xn and integrating
gives
1
2π
1
=
2π Ehθ (Sn − xn ) = ˆ
e−iux hθ (u) du dFn (x)
ˆ
e−iux dFn (x)hθ (u) du ˆ
by Fubini’s theorem. (Recall hθ (u) has compact support and Fn is a distribution
function.) Using (e) of Theorem 2.3.1, we see that the last expression
= 1
2π ˆ
ϕ(−u)n eiuxn hθ (u) du 2.5. LOCAL LIMIT THEOREMS* 109 ˆ
To take the limit as n → ∞ of this integral, let [−M, M ] be an interval with hθ (u) = 0
for u ∈ [−M, M ]. By (∗) above, we can pick δ so that for u < δ
/
ϕ(u) ≤ exp(−σ 2 u2 /4) (c) Let I = [−δ, δ ] and J = [−M, M ] − I . Since ϕ(u) < 1 for u = 0 and ϕ is continuous,
ˆ
there is a constant η < 1 so that ϕ(u) ≤ η < 1 for u ∈ J . Since hθ (u) ≤ 1, this
implies that
√
√
n
n
n iuxn ˆ
ϕ(−u) e
hθ (u) du ≤
· 2M η n → 0
2π J
2π
√
as n → ∞. For the integral over I , change variables u = t/ n to get
1
2π √
δn
√
−δ n √
√
√
ˆ
ϕ(−t/ n)n eitxn / n hθ (t/ n) dt √
The central limit theorem implies ϕ(−t/ n)n → exp(−σ 2 t2 /2). Using (c) now and
√
the dominated convergence theorem gives (recall xn / n → x)
√
n
1
ˆ
ˆ
ϕ(−u)n eiuxn hθ (u) du →
exp(−σ 2 t2 /2)eitx hθ (0) dt
2π I
2π
ˆ
= n(x)hθ (0) = n(x) hθ (y ) dy ˆ
by the inversion formula, Theorem 2.3.5, and the deﬁnition of hθ (0). This proves (a)
and completes the proof of Theorem 2.5.3. 110 CHAPTER 2. CENTRAL LIMIT THEOREMS 2.6 Poisson Convergence 2.6.1 The Basic Limit Theorem Our ﬁrst result is sometimes facetiously called the “weak law of small numbers” or
the “law of rare events.” These names derive from the fact that the Poisson appears
as the limit of a sum of indicators of events that have small probabilities.
Theorem 2.6.1. For each n let Xn,m , 1 ≤ m ≤ n be independent random variables
with P (Xn,m = 1) = pn,m , P (Xn,m = 0) = 1 − pn,m . Suppose
(i) n
m=1 pn,m → λ ∈ (0, ∞), and (ii) max1≤m≤n pn,m → 0.
If Sn = Xn,1 + · · · + Xn,n then Sn ⇒ Z where Z is Poisson(λ).
Here Poisson(λ) is shorthand for Poisson distribution with mean λ, that is,
P (Z = k ) = e−λ λk /k !
Note that in the spirit of the LindebergFeller theorem, no single term contributes
very much to the sum. In contrast to that theorem, the contributions, when positive,
are not small.
First proof. Let ϕn,m (t) = E (exp(itXn,m )) = (1 − pn,m ) + pn,m eit and let Sn =
Xn,1 + · · · + Xn,n . Then
n (1 + pn,m (eit − 1)) E exp(itSn ) =
m=1
it Let 0 ≤ p ≤ 1.  exp(p(e − 1)) = exp(p Re (eit − 1)) ≤ 1 and 1 + p(eit − 1) ≤ 1
since it is on the line segment connecting 1 to eit . Using Lemma 2.4.3 with θ = 1 and
then Lemma 2.4.4, which is valid when maxm pn,m ≤ 1/2 since eit − 1 ≤ 2,
n n pn,m (eit − 1) exp {1 + pn,m (eit − 1)} −
m=1 m=1
n exp(pn,m (eit − 1)) − {1 + pn,m (eit − 1)} ≤
m=1
n p2 eit − 12
n,m ≤
m=1
it Using e − 1 ≤ 2 again, it follows that the last expression
n ≤4 pn,m → 0 max pn,m 1≤m≤n m=1 by assumptions (i) and (ii). The last conclusion and n
m=1 pn,m → λ imply E exp(itSn ) → exp(λ(eit − 1))
To complete the proof now, we consult Example 2.3.2 for the ch.f. of the Poisson
distribution and apply Theorem 2.3.6.
We will now consider some concrete situations in which Theorem 2.6.1 can be
applied. In each case we are considering a situation in which pn,m = c/n, so we
approximate the distribution of the sum by a Poisson with mean c. 2.6. POISSON CONVERGENCE 111 Example 2.6.1. In a calculus class with 400 students, the number of students who
have their birthday on the day of the ﬁnal exam has approximately a Poisson distribution with mean 400/365 = 1.096. This means that the probability no one was
born on that date is about e−1.096 = .334. Similar reasoning shows that the number
of babies born on a given day or the number of people who arrive at a bank between
1:15 and 1:30 should have a Poisson distribution.
Example 2.6.2. Suppose we roll two dice 36 times. The probability of “double ones”
(one on each die) is 1/36 so the number of times this occurs should have approximately
a Poisson distribution with mean 1. Comparing the Poisson approximation with exact
probabilities shows that the agreement is good even though the number of trials is
small.
k
Poisson
exact 0
.3678
.3627 1
.3678
.3730 2
.1839
.1865 3
.0613
.0604 After we give the second proof of Theorem 2.6.1, we will discuss rates of convergence.
Those results will show that for large n the largest discrepancy occurs for k = 1 and
is about 1/2en ( = 0.0051 in this case).
Example 2.6.3. Let ξn,1 , . . . , ξn,n be independent and uniformly distributed over
[−n, n]. Let Xn,m = 1 if ξn,m ∈ (a, b), = 0 otherwise. Sn is the number of points
that land in (a, b). pn,m = (b − a)/2n so
m pn,m = (b − a)/2. This shows (i)
and (ii) in Theorem 2.6.1 hold, and we conclude that Sn ⇒ Z , a Poisson r.v. with
mean (b − a)/2. A twodimensional version of the last theorem might explain why
the statistics of ﬂying bomb hits in the South of London during World War II ﬁt a
Poisson distribution. As Feller, Vol. I (1968), p.160–161 reports, the area was divided
into 576 areas of 1/4 square kilometers each. The total number of hits was 537 for an
average of 0.9323 per cell. The table below compares Nk the number of cells with k
hits with the predictions of the Poisson approximation.
k
Nk
Poisson 0
229
226.74 1
211
211.39 2
93
98.54 3
35
30.62 4
7
7.14 ≥5
1
1.57 For other observations ﬁtting a Poisson distribution, see Feller, Vol. I (1968), Section
VI.7.
Our second proof of Theorem 2.6.1 requires a little more work but provides information about the rate of convergence. We begin by deﬁning the total variation
distance between two measures on a countable set S.
µ−ν ≡ 1
2 µ(z ) − ν (z ) = sup µ(A) − ν (A)
z A⊂S The ﬁrst equality is a deﬁnition. To prove the second, note that for any A
µ(z ) − ν (z ) ≥ µ(A) − ν (A) + µ(Ac ) − ν (Ac ) = 2µ(A) − ν (A)
z and there is equality when A = {z : µ(z ) ≥ ν (z )}.
Exercise 2.6.1. Show that (i) d(µ, ν ) = µ − ν deﬁnes a metric on probability
measures on Z and (ii) µn − µ → 0 if and only if µn (x) → µ(x) for each x ∈ Z,
which by Exercise 2.2.11 is equivalent to µn ⇒ µ. 112 CHAPTER 2. CENTRAL LIMIT THEOREMS Exercise 2.6.2. Show that µ − ν ≤ 2δ if and only if there are random variables X
and Y with distributions µ and ν so that P (X = Y ) ≤ δ.
The next three lemmas are the keys to our second proof.
Lemma 2.6.2. If µ1 × µ2 denotes the product measure on Z × Z that has (µ1 ×
µ2 )(x, y ) = µ1 (x)µ2 (y ) then
µ1 × µ2 − ν1 × ν2 ≤ µ1 − ν1 + µ2 − ν2
Proof. 2 µ1 × µ2 − ν1 × ν2 =
≤ x,y µ1 (x)µ2 (y ) − ν1 (x)ν2 (y ) µ1 (x)µ2 (y ) − ν1 (x)µ2 (y ) +
x,y = ν1 (x)µ2 (y ) − ν1 (x)ν2 (y )
x,y µ1 (x) − ν1 (x) + µ2 (y )
y x µ2 (y ) − ν2 (y ) ν1 (x)
x y = 2 µ1 − ν1 + 2 µ2 − ν2
which gives the desired result.
Lemma 2.6.3. If µ1 ∗ µ2 denotes the convolution of µ1 and µ2 , that is,
µ1 (x − y )µ2 (y ) µ1 ∗ µ2 (x) =
y then µ1 ∗ µ2 − ν1 ∗ ν2 ≤ µ1 × µ2 − ν1 × ν2
Proof. 2 µ1 ∗ µ2 − ν1 ∗ ν2 = x y µ1 (x − y )µ2 (y ) − y ν1 (x − y )ν2 (y ) µ1 (x − y )µ2 (y ) − ν1 (x − y )ν2 (y ) ≤
x y = 2 µ1 × µ2 − ν1 × ν2
which gives the desired result.
Lemma 2.6.4. Let µ be the measure with µ(1) = p and µ(0) = 1 − p. Let ν be a
Poisson distribution with mean p. Then µ − ν ≤ p2 .
Proof. 2 µ − ν = µ(0) − ν (0) + µ(1) − ν (1) + n≥2 ν (n) = 1 − p − e−p  + p − p e−p  + 1 − e−p (1 + p)
Since 1 − x ≤ e−x ≤ 1 for x ≥ 0, the above
= e−p − 1 + p + p(1 − e−p ) + 1 − e−p − pe−p
= 2p(1 − e−p ) ≤ 2p2
which gives the desired result.
Second proof of Theorem 2.6.1. Let µn,m be the distribution of Xn,m . Let µn be
the distribution of Sn . Let νn,m , νn , and ν be Poisson distributions with means 2.6. POISSON CONVERGENCE 113 pn,m , λn = m≤n pn,m , and λ respectively. Since µn = µn,1 ∗ · · · ∗ µn,n and νn =
νn,1 ∗ · · · ∗ νn,n , Lemmas 2.6.3, 2.6.2, and 2.6.4 imply
n µn − νn ≤ n p2
n,m µn,m − νn,m ≤ 2
m=1 (2.6.1) m=1 Using the deﬁnition of total variation distance now gives
n p2
n,m sup µn (A) − νn (A) ≤
A m=1 Assumptions (i) and (ii) imply that the righthand side → 0. Since νn ⇒ ν as n → ∞,
the result follows.
Remark. The proof above is due to Hodges and Le Cam (1960). By diﬀerent
methods, C. Stein (1987) (see (43) on p. 89) has proved
n sup µn (A) − νn (A) ≤ (λ ∨ 1)−1
A p2
n,m
m=1 Rates of convergence. When pn,m = 1/n, (2.6.1) becomes
sup µn (A) − νn (A) ≤ 1/n
A To assess the quality of this bound, we will compare the Poisson and binomial probabilities for k successes.
k Poisson 0
1
2 −1 3 Binomial
n 1
1− n
1
n · n−1 1 − n
n
1
−2
1− n
2n e
e−1
−1
e /2!
−1 e n
3 /3! n −3 1− n−1
n−2 1 n−3
n 1
= 1− n
1
= 1− n = 1− n−1 2
n n−1 / 2! 1− 1 n−2
n 3! Since (1 − x) ≤ e−x , we have µn (0) − νn (0) ≤ 0. Expanding
log(1 + x) = x − x3
x2
+
− ...
2
3 gives
(n − 1) log 1 −
So
n 1− 1
n 1
n =− n−1 n−1
1
−
− . . . = −1 +
+ O(n−2 )
2
n
2n
2n n−1 − e−1 = ne−1 exp{1/2n + O(n−2 )} − 1 → e−1 /2 and it follows that
n(µn (1) − νn (1)) → e−1 /2
n(µn (2) − νn (2)) → e−1 /4
For k ≥ 3, using (1 − 2/n) ≤ (1 − 1/n)2 and (1 − x) ≤ e−x shows µn (k ) − νn (k ) ≤ 0,
so
sup µn (A) − νn (A) ≈ 3/4en
A⊂Z There is a large literature on Poisson approximations for dependent events. Here
we consider 114 CHAPTER 2. CENTRAL LIMIT THEOREMS 2.6.2 Two Examples with Dependence Example 2.6.4. Matching. Let π be a random permutation of {1, 2, . . . , n}, let
Xn,m = 1 if m is a ﬁxed point (0 otherwise), and let Sn = Xn,1 + · · · + Xn,n be the
number of ﬁxed points. We want to compute P (Sn = 0). (For a more exciting story
consider men checking hats or wives swapping husbands.) Let An,m = {Xn,m = 1}.
The inclusionexclusion formula implies
P (∪n =1 Am ) =
m P (Am ) −
m P (A ∩ Am )
<m P (Ak ∩ A ∩ Am ) − . . . +
k< <m =n· n (n − 2)!
n (n − 3)!
1
−
+
− ...
n
n!
n!
2
3 since the number of permutations with k speciﬁed ﬁxed points is (n − k )! Canceling
some factorials gives
n n (−1)m−1
m!
m=1 P (Sn > 0) = so P (Sn = 0) = (−1)m
m!
m=0 Recognizing the second sum as the ﬁrst n + 1 terms in the expansion of e−1 gives
∞ P (Sn = 0) − e−1  =
≤ (−1)m
m!
m=n+1
1
(n + 1)! ∞ (n + 2)−k =
k=0 1
1
· 1−
(n + 1)!
n+2 −1 a much better rate of convergence than 1/n. To compute the other probabilities, we
observe that by considering the locations of the ﬁxed points
1
n
P (Sn−k = 0)
k n(n − 1) · · · (n − k + 1)
1
= P (Sn−k = 0) → e−1 /k !
k! P (Sn = k ) = Example 2.6.5. Occupancy problem. Suppose that r balls are placed at random
into n boxes. It follows from the Poisson approximation to the Binomial that if
n → ∞ and r/n → c, then the number of balls in a given box will approach a Poisson
distribution with mean c. The last observation should explain why the fraction of
empty boxes approached e−c in Example 1.5.5 of Chapter 1. Here we will show:
Theorem 2.6.5. If ne−r/n → λ ∈ [0, ∞) the number of empty boxes approaches a
Poisson distribution with mean λ.
Proof. To see where the answer comes from, notice that in the Poisson approximation
the probability that a given box is empty is e−r/n ≈ λ/n, so if the occupancy of
the various boxes were independent, the result would follow from Theorem 2.6.1. To
prove the result, we begin by observing
P ( boxes i1 , i2 , . . . , ik are empty ) = 1− k
n r 2.6. POISSON CONVERGENCE 115 If we let pm (r, n) = the probability exactly m boxes are empty when r balls are put
in n boxes, then P ( no empty box ) = 1 − P ( at least one empty box ). So by
inclusionexclusion
n (a) n
k (−1)k p0 (r, n) =
k=0 1− k
n r By considering the locations of the empty boxes
(b) n
m pm (r, n) = 1− m
n r p0 (r, n − m) To evaluate the limit of pm (r, n) we begin by showing that if ne−r/n → λ then
n
m (c) 1− m
n r → λm /m! One half of this is easy. Since (1 − x) ≤ e−x and ne−r/n → λ
(d) n
m For the other direction, observe
n
m nm −mr/n
e
→ λm /m!
m! m
n r n
m ≥ (n − m)m /m! so 1− 1− m
n ≤ r ≥ 1− m
n m+r nm /m! Now (1 − m/n)m → 1 as n → ∞ and 1/m! is a constant. To deal with the rest, we
note that if 0 ≤ t ≤ 1/2 then
log(1 − t) = −t − t2 /2 − t3 /3 . . .
≥ −t −
so we have
log nm 1 − m
n t2
1 + 2−1 + 2−2 + · · · = −t − t2
2
r ≥ m log n − rm/n − r(m/n)2 Our assumption ne−r/n → λ means
r = n log n − n log λ + o(n)
so r(m/n)2 → 0. Multiplying the last display by m/n and rearranging gives m log n −
rm/n → m log λ. Combining the last two results shows
lim inf nm 1 −
n→∞ m
n r ≥ λm and (c) follows. From (a), (c), and the dominated convergence theorem (using (d) to
get the domination) we get
(e) if ne−r/n → λ then p0 (r, n) → ∞
k λk
k=0 (−1) k! = e−λ For ﬁxed m, (n − m)e−r/(n−m) → λ, so it follows from (e) that p0 (r, n − m) → e−λ .
Combining this with (b) and (c) completes the proof. 116 CHAPTER 2. CENTRAL LIMIT THEOREMS Example 2.6.6. Coupon collector’s problem. Let X1 , X2 , . . . be i.i.d. uniform
on {1, 2, . . . , n} and Tn = inf {m : {X1 , . . . Xm } = {1, 2, . . . , n}}. Since Tn ≤ m if and
only if m balls ﬁll up all n boxes, it follows from Theroem 2.6.5 that
P (Tn − n log n ≤ nx) → exp(−e−x )
Proof. If r = n log n + nx then ne−r/n → e−x .
Note that Tn is the sum of n independent random variables (see Example 1.5.3 in
Chapter 1), but Tn does not converge to the normal distribution. The problem is that
the last few terms in the sum are of order n so the hypotheses of the LindebergFeller
theorem are not satisﬁed.
For a concrete instance of the previous result consider: What is the probability
that in a village of 2190 (= 6 · 365) people all birthdays are represented? Do you think
the answer is much diﬀerent for 1825 (= 5 · 365) people?
Solution. Here n = 365, so 365 log 365 = 2153 and
P (T365 ≤ 2190) = P ((T365 − 2153)/365 ≤ 37/365)
≈ exp(−e−0.1014 ) = exp(−0.9036) = 0.4051
P (T365 ≤ 1825) = P ((T365 − 2153)/365 ≤ −328/365)
≈ exp(−e0.8986 ) = exp(−2.4562) = 0.085
As we observed in Example 1.5.3 of Chapter 1, if we let
n
τk = inf {m : {X1 , . . . , Xm } = k }
n
n
n
then τ1 = 1 and for 2 ≤ k ≤ n, τk − τk−1 are independent and have a geometric
distribution with parameter 1 − (k − 1)/n.
n
Exercise 2.6.3. Suppose k/n1/2 → λ ∈ [0, ∞) and show that τk −k ⇒ Poisson(λ2 /2).
Hint: This is easy if you use Theorem 2.6.6 below.
n
2
n
Exercise 2.6.4. Let µn,k = Eτk and σn,k = var (τk ). Suppose k/n → a ∈ (0, 1),
√
n
and use the LindebergFeller theorem to show (τk − µn,k )/ n ⇒ σχ. The last result is true when k/n1/2 → ∞ and n − k → ∞, see Baum and Billingsley
(1966). Results for k = n − j can be obtained from Theorem 2.6.5, so we have
examined all the possibilities. 2.6.3 Poisson Processes Theorem 2.6.1 generalizes trivially to give the following result.
Theorem 2.6.6. Let Xn,m , 1 ≤ m ≤ n be independent nonnegative integer valued
random variables with P (Xn,m = 1) = pn,m , P (Xn,m ≥ 2) = n,m .
(i) n
m=1 pn,m → λ ∈ (0, ∞), (ii) max1≤m≤n pn,m → 0,
and (iii) n
m=1 n,m → 0. If Sn = Xn,1 + · · · + Xn,n then Sn ⇒ Z where Z is Poisson(λ). 2.6. POISSON CONVERGENCE 117 Proof. Let Xn,m = 1 if Xn,m = 1, and 0 otherwise. Let Sn = Xn,1 + · · · + Xn,n .
(i)(ii) and Theorem 2.6.1 imply Sn ⇒ Z , (iii) tells us P (Sn = Sn ) → 0 and the result
follows from the converging together lemma, Exercise 2.2.13.
The next result, which uses Theorem 2.6.6, explains why the Poisson distribution
comes up so frequently in applications. Let N (s, t) be the number of arrivals at a
bank or an ice cream parlor in the time interval (s, t]. Suppose
(i) the numbers of arrivals in disjoint intervals are independent,
(ii) the distribution of N (s, t) only depends on t − s,
(iii) P (N (0, h) = 1) = λh + o(h),
and (iv) P (N (0, h) ≥ 2) = o(h).
Here, the two o(h) stand for functions g1 (h) and g2 (h) with gi (h)/h → 0 as h → 0.
Theorem 2.6.7. If (i)–(iv) hold then N (0, t) has a Poisson distribution with mean
λt.
Proof. Let Xn,m = N ((m − 1)t/n, mt/n) for 1 ≤ m ≤ n and apply Theorem 2.6.6.
A family of random variables Nt , t ≥ 0 satisfying:
(i) if 0 = t0 < t1 < . . . < tn , N (tk ) − N (tk−1 ), 1 ≤ k ≤ n are independent,
(ii) N (t) − N (s) is Poisson(λ(t − s)),
is called a Poisson process with rate λ. To understand how Nt behaves, it is
useful to have another method to construct it. Let ξ1 , ξ2 , . . . be independent random
variables with P (ξi > t) = e−λt for t ≥ 0. Let Tn = ξ1 + · · · + ξn and Nt = sup{n :
Tn ≤ t} where T0 = 0. In the language of renewal theory (see Theorem 1.7.6), Tn is
the time of the nth arrival and Nt is the number of arrivals by time t. To check that
Nt is a Poisson process, we begin by recalling (see Exercise 1.4.8):
fTn (s) = λn sn−1 −λs
e
for s ≥ 0
(n − 1)! i.e., the distribution of Tn has a density given by the righthand side. Now
P (Nt = 0) = P (T1 > t) = e−λt
and for n ≥ 1
t P (Nt = n) = P (Tn ≤ t < Tn+1 ) = P (Tn = s)P (ξn+1 > t − s) ds
0 t =
0 λn sn−1 −λs −λ(t−s)
(λt)n
e
e
ds = e−λt
(n − 1)!
n! The last two formulas show that Nt has a Poisson distribution with mean λt. To
check that the number of arrivals in disjoint intervals is independent, we observe
P (Tn+1 ≥ uNt = n) = P (Tn+1 ≥ u, Tn ≤ t)/P (Nt = n)
To compute the numerator, we observe
t P (Tn+1 ≥ u, Tn ≤ t) = fTn (s)P (ξn+1 ≥ u − s) ds
0
t =
0 λn sn−1 −λs −λ(u−s)
(λt)n
e
e
ds = e−λu
(n − 1)!
n! 118 CHAPTER 2. CENTRAL LIMIT THEOREMS The denominator is P (Nt = n) = e−λt (λt)n /n!, so
P (Tn+1 ≥ uNt = n) = e−λu /e−λt = e−λ(u−t)
or rewriting things P (Tn+1 − t ≥ sNt = n) = e−λs . Let T1 = TN (t)+1 − t, and Tk =
TN (t)+k − TN (t)+k−1 for k ≥ 2. The last computation shows that T1 is independent
of Nt . If we observe that
P (Tn ≤ t, Tn+1 ≥ u, Tn+k − Tn+k−1 ≥ vk , k = 2, . . . , K )
K = P (Tn ≤ t, Tn+1 ≥ u) P (ξn+k ≥ vk )
k=2 then it follows that
(a) T1 , T2 , . . . are i.i.d. and independent of Nt .
The last observation shows that the arrivals after time t are independent of Nt and
have the same distribution as the original sequence. From this it follows easily that:
(b) If 0 = t0 < t1 . . . < tn then N (ti ) − N (ti−1 ), i = 1, . . . , n are independent.
To see this, observe that the vector (N (t2 ) − N (t1 ), . . . , N (tn ) − N (tn−1 )) is σ (Tk , k ≥
1) measurable and hence is independent of N (t1 ). Then use induction to conclude
n P (N (ti ) − N (ti−1 ) = ki , i = 1, . . . , n) = exp(−λ(ti − ti−1 ))
i=1 λ(ti − ti−1 ))ki
ki ! Remark. The key to the proof of (a) is the lack of memory property of the exponential
distribution:
(∗) P (T > t + sT > t) = P (T > s) which implies that the location of the ﬁrst arrival after t is independent of what
occurred before time t and has an exponential distribution.
Exercise 2.6.5. Show that if P (T > 0) = 1 and (∗) holds then there is a λ > 0 so
that P (T > t) = e−λt for t ≥ 0. Hint: First show that this holds for t = m2−n .
Exercise 2.6.6. Show that (iii) and (iv) in Theorem 2.6.7 can be replaced by
(v) If Ns− = limr↑s Nr then P (Ns − Ns− ≥ 2 for some s) = 0.
That is, if (i), (ii), and (v) hold then there is a λ ≥ 0 so that N (0, t) has a Poisson
distribution with mean λt. Prove this by showing: (a) If u(s) = P (Ns = 0) then (i)
and (ii) imply u(r)u(s) = u(r + s). It follows that u(s) = e−λs for some λ ≥ 0, so
(iii) holds. (b) if v (s) = P (Ns ≥ 2) and An = {Nk/n − N(k−1)/n ≥ 2 for some k ≤ n}
then (v) implies P (An ) → 0 as n → ∞ and (iv) holds.
Exercise 2.6.7. Let Tn be the time of the nth arrival in a rate λ Poisson process. Let
n
U1 , U2 , . . . , Un be independent uniform on (0,1) and let Vk be the kth smallest number
n
n
in {U1 , . . . , Un }. Show that the vectors (V1 , . . . , Vn ) and (T1 /Tn+1 , . . . , Tn /Tn+1 )
have the same distribution.
Spacings. The last result can be used to study the spacings between the order
statistics of i.i.d. uniforms. We use notation of Exercise 6.7 in the next four exercises,
n
taking λ = 1 and letting V0n = 0, and Vn+1 = 1. 2.6. POISSON CONVERGENCE 119 n
Exercise 2.6.8. Smirnov (1949) nVk ⇒ Tk . Exercise 2.6.9. Weiss (1955) n−1 n
m=1 1(n(Vin −Vin 1 )>x) → e−x in probability.
− n
n
Exercise 2.6.10. (n/ log n) max1≤m≤n+1 Vm − Vm−1 → 1 in probability.
n
n
Exercise 2.6.11. P (n2 min1≤m≤n Vm − Vm−1 > x) → e−x . For the rest of the section, we concentrate on the Poisson process itself.
Exercise 2.6.12. Thinning. Let N have a Poisson distribution with mean λ and let
X1 , X2 , . . . be an independent i.i.d. sequence with P (Xi = j ) = pj for j = 0, 1, . . . , k .
Let Nj = {m ≤ N : Xm = j }. Show that N0 , N1 , . . . , Nk are independent and Nj
has a Poisson distribution with mean λpj .
In the important special case Xi ∈ {0, 1}, the result says that if we thin a Poisson
process by ﬂipping a coin with probability p of heads to see if we keep the arrival,
then the result is a Poisson process with rate λp.
Exercise 2.6.13. Poissonization and the occupancy problem. If we put a
Poisson number of balls with mean r in n boxes and let Ni be the number of balls in
box i, then the last exercise implies N1 , . . . , Nn are independent and have a Poisson
distribution with mean r/n. Use this observation to prove Theorem 2.6.5.
Hint: If r = n log n − (log λ)n + o(n) and si = n log n − (log µi )n with µ2 < λ < µ1 then
the normal approximation to the Poisson tells us P (Poisson(s1 ) < r < Poisson(s2 )) →
1 as n → ∞.
Example 2.6.7. Compound Poisson process. At the arrival times T1 , T2 , . . . of a
Poisson process with rate λ, groups of customers of size ξ1 , ξ2 , . . . arrive at an ice cream
parlor. Suppose the ξi are i.i.d. and independent of the Tj s. This is a compound
Poisson process. The result of Exercise 2.6.12 shows that Ntk = the number of
groups of size k to arrive in [0, t] are independent Poisson’s with mean pk λt.
Example 2.6.8. A Poisson process on a measure space (S, S , µ) is a random
map m : S → {0, 1, . . .} that for each ω is a measure on S and has the following
property: if A1 , . . . , An are disjoint sets with µ(Ai ) < ∞ then m(A1 ), . . . , m(An ) are
independent and have Poisson distributions with means µ(Ai ). µ is called the mean
measure of the process. Exercise 2.6.12 implies that if µ(S ) < ∞ we can construct
m by the following recipe: let X1 , X2 , . . . be i.i.d. elements of S with distribution
ν (·) = µ(·)/µ(S ), let N be an independent Poisson random variable with mean µ(S ),
and let m(A) = {j ≤ N : Xj ∈ A}. To extend the construction to inﬁnite measure
spaces, e.g., S = Rd , S = Borel sets, µ = Lebesgue measure, divide the space up into
disjoint sets of ﬁnite measure and put independent Poisson processes on each set. 120 2.7 CHAPTER 2. CENTRAL LIMIT THEOREMS Stable Laws* Let X1 , X2 , . . . be i.i.d. and Sn = X1 + · · · + Xn . Theorem 2.4.1 showed that if
EXi = µ and var (Xi ) = σ 2 ∈ (0, ∞) then
(Sn − nµ)/ σn1/2 ⇒ χ
2
In this section, we will investigate the case EX1 = ∞ and give necessary and suﬃcient
conditions for the existence of constants an and bn so that (Sn − bn )/an ⇒ Y where Y is nondegenerate We begin with an example. Suppose the distribution of Xi has
P (X1 > x) = P (X1 < −x) = xα /2 for x ≥ 1 (2.7.1) where 0 < α < 2. If ϕ(t) = E exp(itX1 ) then
∞ −1 α
dx +
2xα+1
1 − cos(tx)
dx
xα+1 (1 − eitx ) 1 − ϕ(t) =
1 ∞ =α
1 (1 − eitx )
−∞ α
dx
2xα+1 Changing variables tx = u, dx = du/t the last integral becomes
∞ =α
t 1 − cos u du
= tα α
(u/t)α+1 t ∞
t 1 − cos u
du
uα+1 As u → 0, 1 − cos u ∼ u2 /2. So (1 − cos u)/uα+1 ∼ u−α+1 /2 which is integrable, since
α < 2 implies −α + 1 > −1. If we let
∞ C=α
0 1 − cos u
du < ∞
uα+1 and observe (2.7.1) implies ϕ(t) = ϕ(−t), then the results above show
1 − ϕ(t) ∼ C tα as t → 0 (2.7.2) Let X1 , X2 , . . . be i.i.d. with the distribution given in (2.7.1) and let Sn = X1 +· · ·+Xn .
E exp(itSn /n1/α ) = ϕ(t/n1/α )n = (1 − {1 − ϕ(t/n1/α )})n
As n → ∞, n(1 − ϕ(t/n1/α )) → C tα , so it follows from Theorem 2.4.2 that
E exp(itSn /n1/α ) → exp(−C tα )
From part (ii) of Theorem 2.3.6, it follows that the expression on the right is the
characteristic function of some Y and
Sn /n1/α ⇒ Y (2.7.3) To prepare for our general result, we will now give another proof of (2.7.3). If
0 < a < b and an1/α > 1 then
P (an1/α < X1 < bn1/α ) = 1 −α
(a − b−α )n−1
2 2.7. STABLE LAWS* 121 so it follows from Theorem 2.6.1 that
Nn (a, b) ≡ {m ≤ n : Xm /n1/α ∈ (a, b)} ⇒ N (a, b)
where N (a, b) has a Poisson distribution with mean (a−α − b−α )/2. An easy extension
of the last result shows that if A ⊂ R − (−δ, δ ) and δn1/α > 1 then
P (X1 /n1/α ∈ A) = n−1
A α
dx
2xα+1 so Nn (A) ≡ {m ≤ n : Xm /n1/α ∈ A} ⇒ N (A), where N (A) has a Poisson distribution with mean
α
dx < ∞
µ(A) =
α+1
A 2x
The limiting family of random variables N (A) is called a Poisson process on
(−∞, ∞) with mean measure µ. (See Example 2.6.8 for more on this process.)
Notice that for any > 0, µ( , ∞) = −α /2 < ∞, so N ( , ∞) < ∞.
The last paragraph describes the limiting behavior of the random set
Xn = {Xm /n1/α : 1 ≤ m ≤ n}
To describe the limit of Sn /n1/α , we will “sum up the points.” Let > 0 and In ( ) = {m ≤ n : Xm  > n1/α }
ˆ
Sn ( ) = Xm ¯
ˆ
Sn ( ) = Sn − Sn ( ) m∈In ( ) ˆ
In ( ) = the indices of the “big terms,” i.e., those > n1/α in magnitude. Sn ( ) is the
¯
sum of the big terms, and Sn ( ) is the rest of the sum. The ﬁrst thing we will do is
¯
show that the contribution of Sn ( ) is small if is. Let
¯
Xm ( ) = Xm 1(Xm ≤ n1/α ) ¯
¯
¯
Symmetry implies E Xm ( ) = 0, so E (Sn ( )2 ) = nE X1 ( )2 .
∞ ¯
E X1 ( )2 = n1/α 1 ¯
2yP (X1 ( ) > y ) dy ≤
0 2
=1+
2−α 0
2−α 2/α−1 n 2y y −α dy 2y dy +
1 2
2 2−α 2/α−1
−
≤
n
2−α
2−α where we have used α < 2 in computing the integral and α > 0 in the ﬁnal inequality.
From this it follows that
2 2−α
¯
E (Sn ( )/n1/α )2 ≤
(2.7.4)
2−α
ˆ
To compute the limit of Sn ( )/n1/α , we observe that In ( ) has a binomial distriˆ
bution with success probability p = −α /n. Given In ( ) = m, Sn ( )/n1/α is the sum
of m independent random variables with a distribution Fn that is symmetric about 0
and has
1 − Fn (x) = P (X1 /n1/α > x  X1 /n1/α > ) = x−α /2 −α for x ≥ 122 CHAPTER 2. CENTRAL LIMIT THEOREMS The last distribution is the same as that of X1 , so if ϕ(t) = E exp(itX1 ), the distribution Fn has characteristic function ϕ( t). Combining the observations in this
paragraph gives
n ˆ
E exp(itSn ( )/n1/α ) =
m=0 n
(
m −α −α /n)m (1 − /n)n−m ϕ( t)m Writing
n1
1 n(n − 1) · · · (n − m + 1)
1
=
≤
m
m
mn
m!
n
m!
noting (1 − −α −α /n)n ≤ exp(− ) and using the dominated convergence theorem
∞ ˆ
E exp(itSn ( )/n1/α ) → exp(− −α )( −α m ) ϕ( t)m /m! m=0 = exp(− −α {1 − ϕ( t)}) (2.7.5) To get (2.7.3) now, we use the following generalization of Lemma 2.4.7.
Lemma 2.7.1. If hn ( ) → g ( ) for each
pick n → 0 so that hn ( n ) → g (0). > 0 and g ( ) → g (0) as → 0 then we can Proof. Let Nm be chosen so that hn (1/m) − g (1/m) ≤ 1/m for n ≥ Nm and m → Nm
is increasing. Let n = 1/m for Nm ≤ n < Nm+1 and = 1 for n < N1 . When
Nm ≤ n < Nm+1 , n = 1/m so it follows from the triangle inequality and the
deﬁnition of n that
hn ( n ) − g (0) ≤ hn (1/m) − g (1/m) + g (1/m) − g (0)
≤ 1/m + g (1/m) − g (0)
When n → ∞, we have m → ∞ and the result follows.
ˆ
Let hn ( ) = E exp(itSn ( )/n1/α ) and g ( ) = exp(−
α
1 − φ(t) ∼ C t as t → 0 so
g ( ) → exp(−C tα ) as −α {1 − ϕ( t)}). (2.7.2) implies →0 and Lemma 2.7.1 implies we can pick n → 0 with hn ( n ) → exp(−C tα ). Introducing
ˆ
Y with E exp(itY ) = exp(−C tα ), it follows that Sn ( n )/n1/α ⇒ Y . If n → 0 then
(2.7.4) implies
¯
Sn ( n )/n1/α ⇒ 0
and (2.7.3) follows from the converging together lemma, Exercise 2.2.13.
Once we give one ﬁnal deﬁnition, we will state and prove the general result alluded
to above. L is said to be slowly varying, if
lim L(tx)/L(x) = 1 x→∞ for all t > 0 Exercise 2.7.1. Show that L(t) = log t is slowly varying but t is not if = 0. Theorem 2.7.2. Suppose X1 , X2 , . . . are i.i.d. with a distribution that satisﬁes
(i) limx→∞ P (X1 > x)/P (X1  > x) = θ ∈ [0, 1]
(ii) P (X1  > x) = x−α L(x) 2.7. STABLE LAWS* 123 where α < 2 and L is slowly varying. Let Sn = X1 + · · · + Xn
an = inf {x : P (X1  > x) ≤ n−1 } and bn = nE (X1 1(X1 ≤an ) ) As n → ∞, (Sn − bn )/an ⇒ Y where Y has a nondegenerate distribution.
Remark. This is not much of a generalization of the example, but the conditions are
necessary for the existence of constants an and bn so that (Sn − bn )/an ⇒ Y , where
Y is nondegenerate. Proofs of necessity can be found in Chapter 9 of Breiman (1968)
or in Gnedenko and Kolmogorov (1954). (2.7.11) gives the ch.f. of Y . The reader
has seen the main ideas in the second proof of (2.7.3) and so can skip to that point
without much loss.
Proof. It is not hard to see that (ii) implies
nP (X1  > an ) → 1 (2.7.6) To prove this, note that nP (X1  > an ) ≤ 1 and let > 0. Taking x = an /(1 + ) and
t = 1 + 2 , (ii) implies
(1 + 2 )−α = lim n→∞ proving (2.7.6) since P (X1  > (1 + 2 )an /(1 + ))
P (X1  > an )
≤ lim inf
n→∞
P (X1  > an /(1 + ))
1/n is arbitrary. Combining (2.7.6) with (i) and (ii) gives
nP (X1 > xan ) → θx−α for x > 0 (2.7.7) so {m ≤ n : Xm > xan } ⇒ Poisson(θx−α ). The last result leads, as before, to
the conclusion that Xn = {Xm /an : 1 ≤ m ≤ n} converges to a Poisson process on
(−∞, ∞) with mean measure
θαx−(α+1) dx + µ(A) =
A∩(0,∞) (1 − θ)αx−(α+1) dx
A∩(−∞,0) To sum up the points, let In ( ) = {m ≤ n : Xm  > an }
µ( ) = EXm 1(
ˆ an <Xm ≤an ) ˆ
Sn ( ) = Xm
m∈In ( ) µ( ) = EXm 1(Xm ≤
¯ an )
n ¯
ˆ
Sn ( ) = (Sn − bn ) − (Sn ( ) − nµ( )) =
ˆ {Xm 1(Xm ≤ an ) − µ( )}
¯ m=1 ¯
If we let Xm ( ) = Xm 1(Xm ≤ an ) then ¯
¯
¯
E (Sn ( )/an )2 = n var (X1 ( )/an ) ≤ nE (X1 ( )/an )2
¯
E (X1 ( )/an )2 ≤ 2yP (X1  > yan ) dy
0 = P (X1  > an ) 2y
0 P (X1  > yan )
dy
P (X1  > an ) We would like to use (2.7.7) and (ii) to conclude
¯
nE (X1 ( )/an )2 → 2y y −α dy =
0 2
2−α 2−α 124 CHAPTER 2. CENTRAL LIMIT THEOREMS and hence 2 2−α
¯
lim sup E (Sn ( )/an )2 ≤
2−α
n→∞ (2.7.8) To justify interchanging the limit and the integral and complete the proof of (2.7.8),
we show the following (take δ < 2 − α):
Lemma 2.7.3. For any δ > 0 there is C so that for all t ≥ t0 and y ≤ 1
P (X1  > yt)/P (X1  > t) ≤ Cy −α−δ
Proof. (ii) implies that as t → ∞
P (X1  > t/2)/P (X1  > t) → 2α
so for t ≥ t0 we have
P (X1  > t/2)/P (X1  > t) ≤ 2α+δ
Iterating and stopping the ﬁrst time t/2m < t0 we have for all n ≥ 1
P (X1  > t/2n )/P (X1  > t) ≤ C 2(α+δ)n
where C = 1/P (X1  > t0 ). Applying the last result to the ﬁrst n with 1/2n < y and
noticing y ≤ 1/2n−1 , we have
P (X1  > yt)/P (X1  > t) ≤ C 2α+δ y −α−δ
which proves the lemma.
ˆ
To compute the limit of Sn ( ), we observe that In ( ) ⇒ Poisson( −α ). Given
ˆn ( )/an is the sum of m independent random variables with distribution
In ( ) = m, S
Fn that has
1 − Fn (x) = P (X1 /an > x  X1 /an > ) → θx−α / −α Fn (−x) = P (X1 /an < −x  X1 /an > ) → (1 − θ)x−α / −α for x ≥ . If we let ψn (t) denote the characteristic function of Fn , then Theorem 2.3.6
implies
∞ − eitx θ ψn (t) → ψ (t) = α αx−(α+1) dx + eitx (1 − θ) α αx−(α+1) dx
−∞ as n → ∞. So repeating the proof of (2.7.5) gives
ˆ
E exp(itSn ( )/an ) → exp(− −α {1 − ψ (t)}) ∞ (eitx − 1)θαx−(α+1) dx = exp
− (eitx − 1)(1 − θ)αx−(α+1) dx +
−∞ where we have used −α = ∞ αx−(α+1) dx. To bring in µ( ) = EXm 1(
ˆ an <Xm ≤an ) 2.7. STABLE LAWS* 125 we observe that (2.7.7) implies nP (xan < Xm ≤ yan ) → θ(x−α − y −α ). So
− 1 xθαx−(α+1) dx + nµ( )/an →
ˆ x(1 − θ)αx−(α+1) dx
−1 ˆ
From this it follows that E exp(it{Sn ( ) − nµ( )}/an ) →
ˆ
∞ (eitx − 1)θαx−(α+1) dx exp
1 1 + (eitx − 1 − itx)θαx−(α+1) dx
− (eitx − 1 − itx)(1 − θ)αx−(α+1) dx + (2.7.9) −1
−1 (eitx − 1)(1 − θ)αx−(α+1) dx +
−∞ The last expression is messy, but eitx − 1 − itx ∼ −t2 x2 /2 as t → 0, so we need to
subtract the itx to make
1 (eitx − 1 − itx)x−(α+1) dx converge when α ≥ 1 0 To reduce the number of integrals from four to two, we can write the limit as
of the lefthand side of (2.7.9) as
∞ eitx − 1 − exp itc +
0 itx
1 + x2 0 eitx − 1 − +
−∞ →0 θαx−(α+1) dx
itx
1 + x2 (1 − θ)αx−(α+1) dx (2.7.10) where c is a constant. Combining (2.7.6) and (2.7.9) using Lemma 2.7.1, it follows
easily that (Sn − bn )/an ⇒ Y where EeitY is given in (2.7.10).
Exercise 2.7.2. Show that when α < 1, centering is unnecessary, i.e., we can let
bn = 0.
By doing some calculus (see Breiman (1968), p. 204–206) one can rewrite (2.7.10)
as
exp(itc − btα {1 + iκ sgn (t)wα (t)}) (2.7.11) where −1 ≤ κ ≤ 1, (κ = 2θ − 1) and
wα (t) = tan(πα/2) if α = 1
(2/π ) log t if α = 1 The reader should note that while we have assumed 0 < α < 2 throughout the
developments above, if we set α = 2 then the term with κ vanishes and (2.7.11)
reduces to the characteristic function of the normal distribution with mean c and
variance 2b.
The distributions whose characteristic functions are given in (2.7.11) are called
stable laws. α is commonly called the index. When α = 1 and κ = 0, we have the
Cauchy distribution. Apart from the Cauchy and the normal, there is only one other 126 CHAPTER 2. CENTRAL LIMIT THEOREMS case in which the density is known: When α = 1/2, κ = 1, c = 0, and b = 1, the
density is
(2πy 3 )−1/2 exp(−1/2y ) for y ≥ 0
(2.7.12)
One can calculate the ch.f. and verify our claim. However, later (see Section 7.4)
we will be able to check the claim without eﬀort, so we leave the somewhat tedious
calculation to the reader.
We are now ﬁnally ready to treat some examples
Example 2.7.1. Let X1 , X2 , . . . be i.i.d. with a density that is symmetric about 0,
and continuous and positive at 0. We claim that
1
n 1
1
+ ··· +
X1
Xn ⇒ a Cauchy distribution (α = 1, κ = 0) To verify this, note that
x−1 P (1/Xi > x) = P (0 < Xi < x−1 ) = f (y ) dy ∼ f (0)/x
0 as x → ∞. A similar calculation shows P (1/Xi < −x) ∼ f (0)/x so in (i) in Theorem
2.7.2 holds with θ = 1/2, and (ii) holds with α = 1. The scaling constant an ∼ 2f (0)n,
while the centering constant vanishes since we have supposed the distribution of X is
symmetric about 0.
Remark. Readers who want a challenge should try to drop the symmetry assumption,
assuming for simplicity that f is diﬀerentiable at 0.
Example 2.7.2. Let X1 , X2 , . . . be i.i.d. with P (Xi = 1) = P (Xi = −1) = 1/2, let
Sn = X1 + · · · + Xn , and let τ = inf {n ≥ 1 : Sn = 1}. In Chapter 3 (see the discussion
after (3.3.1)) we will show
P (τ > 2n) ∼ π −1/2 n−1/2 as n → ∞ Let τ1 , τ2 , . . . be independent with the same distribution as τ , and let Tn = τ1 + · · · + τn .
Results in Section 3.1 imply that Tn has the same distribution as the nth time Sm
hits 0. We claim that Tn /n2 converges to the stable law with α = 1/2, κ = 1 and note
that this is the key to the derivation of (2.7.12). To prove the claim, note that in (i)
in Theorem 2.7.2 holds with θ = 1 and (ii) holds with α = 1/2. The scaling constant
an ∼ Cn2 . Since α < 1, Exercise 2.7.2 implies the centering constant is unnecessary.
Example 2.7.3. Assume n objects Xn,1 , . . . , Xn,n are placed independently and at
random in [−n, n]. Let
n sgn (Xn,m )/Xn,m p Fn =
m=1 be the net force exerted on 0. We will now show that if p > 1/2, then
lim E exp(itFn ) = exp(−ct1/p ) n→∞ To do this, it is convenient to let Xn,m = nYm where the Yi are i.i.d. on [−1, 1]. Then
n Fn = n−p sgn (Ym )/Ym p
m=1 p Letting Zm = sgn (Ym )/Ym  , Zm is symmetric about 0 with P (Zm  > x) =
P (Ym  < x−1/p ) so in (i) in Theorem 2.7.2 holds with θ = 1/2 and (ii) holds with
α = 1/p. The scaling constant an ∼ Cnp and the centering constant is 0 by symmetry. 2.7. STABLE LAWS* 127 Exercise 2.7.3. Show that (i) If p < 1/2 then Fn /n1/2−p ⇒ cχ.
(ii) If p = 1/2 then Fn /(log n)1/2 ⇒ cχ.
Example 2.7.4. In the examples above, we have had bn = 0. To get a feel for the
centering constants consider X1 , X2 , . . . i.i.d. with
P (Xi > x) = θx−α P (Xi < −x) = (1 − θ)x−α where 0 < α < 2. In this case an = n1/α and
n1/α bn = n
1 cn −α
(2θ − 1)αx dx ∼ cn log n 1/α cn α>1
α=1
α<1 When α < 1 the centering is the same size as the scaling and can be ignored. When
α > 1, bn ∼ nµ where µ = EXi .
Our next result explains the name stable laws. A random variable Y is said to
have a stable law if for every integer k > 0 there are constants ak and bk so that if
Y1 , . . . , Yk are i.i.d. and have the same distribution as Y , then (Y1 + . . . + Yk − bk )/ak =d
Y . The last deﬁnition makes half of the next result obvious.
Theorem 2.7.4. Y is the limit of (X1 + · · · + Xk − bk )/ak for some i.i.d. sequence
Xi if and only if Y has a stable law.
Proof. If Y has a stable law we can take X1 , X2 , . . . i.i.d. with distribution Y . To go
the other way, let
Zn = (X1 + · · · + Xn − bn )/an
j
and Sn = X(j −1)n+1 + · · · + Xjn . A little arithmetic shows
1
k
Znk = (Sn + · · · + Sn − bnk )/ank
1
k
ank Znk = (Sn − bn ) + · · · + (Sn − bn ) + (kbn − bnk )
1
k
ank Znk /an = (Sn − bn )/an + · · · + (Sn − bn )/an + (kbn − bnk )/an The ﬁrst k terms on the righthand side ⇒ Y1 + · · · + Yk as n → ∞ where Y1 , . . . , Yk are
independent and have the same distribution as Y , and Znk ⇒ Y . Taking Wn = Znk
and
kbn − bnk
akn
Znk −
Wn =
an
an
gives the desired result.
Theorem 2.7.5. Convergence of types theorem. If Wn ⇒ W and there are
constants αn > 0, βn so that Wn = αn Wn + βn ⇒ W where W and W are nondegenerate, then there are constants α and β so that αn → α and βn → β.
Proof. Let ϕn (t) = E exp(itWn ).
ψn (t) = E exp(it(αn Wn + βn )) = exp(itβn )ϕn (αn t)
If ϕ and ψ are the characteristic functions of W and W , then
(a) ϕn (t) → ϕ(t) ψn (t) = exp(itβn )ϕn (αn t) → ψ (t) 128 CHAPTER 2. CENTRAL LIMIT THEOREMS Take a subsequence αn(m) that converges to a limit α ∈ [0, ∞]. Our ﬁrst step is to
observe α = 0 is impossible. If this happens, then using the uniform convergence
proved in Exercise 2.3.16
(b) ψn (t) = ϕn (αn t) → 1 ψ (t) ≡ 1, and the limit is degenerate by Theorem 2.5.1. Letting t = u/αn and
interchanging the roles of ϕ and ψ shows α = ∞ is impossible. If α is a subsequential
limit, then arguing as in (b) gives ψ (t) = ϕ(αt). If there are two subsequential
limits α < α, using the last equation for both limits implies ϕ(u) = ϕ(uα /α).
Iterating gives ϕ(u) = ϕ(u(α /α)k ) → 1 as k → ∞, contradicting our assumption
that W is nondegenerate, so αn → α ∈ [0, ∞).
To conclude that βn → β now, we observe that (ii) of Exercise 3.16 implies φn → φ
uniformly on compact sets so ϕn (αn t) → ϕ(αt). If δ is small enough so that ϕ(αt) >
0 for t ≤ δ , it follows from (a) and another use of Exercise 2.3.16 that
exp(itβn ) = ψn (t)
ψ (t)
→
φn (αt)
ϕ(αt) uniformly on [−δ, δ ]. exp(itβn ) is the ch.f. of a point mass at βn . Using (2.3.1) now
as in the proof of Theorem 2.3.6, it follows that the sequence of distributions that
are point masses at βn is tight, i.e., βn is bounded. If βnm → β then exp(itβ ) =
ψ (t)/ϕ(αt) for t ≤ δ , so there can only be one subsequential limit.
Theorem 2.7.4 justiﬁes calling the distributions with characteristic functions given
by (2.7.11) or (2.7.10) stable laws. To complete the story, we should mention that
these are the only stable laws. Again, see Chapter 9 of Breiman (1968) or Gnedenko
and Kolmogorov (1954). The next example shows that it is sometimes useful to know
what all the possible limits are.
Example 2.7.5. The Holtsmark distribution. (α = 3/2, κ = 0). Suppose stars
are distributed in space according to a Poisson process with density t and their masses
are i.i.d. Let Xt be the xcomponent of the gravitational force at 0 when the density
is t. A change of density 1 → t corresponds to a change of length 1 → t−1/3 , and
gravitational attraction follows an inverse square law so
d X t = t3 / 2 X 1 (2.7.13) If we imagine thinning the Poisson process by rolling an nsided die, then Exercise
2.6.12 implies
d
1
n
Xt = Xt/n + · · · + Xt/n
where the random variables on the righthand side are independent and have the same
distribution as Xt/n . It follows from Theorem 2.7.4 that Xt has a stable law. The
scaling property (2.7.13) implies α = 3/2. Since Xt =d −Xt , κ = 0.
Exercises
Exercise 2.7.4. Let Y be a stable law with κ = 1. Use the limit theorem Theorem
2.7.2 to conclude that Y ≥ 0 if α < 1.
Exercise 2.7.5. Let X be symmetric stable with index α. (i) Use (2.3.1) to show
that E X p < ∞ for p < α. (ii) Use the second proof of (2.7.3) to show that P (X  ≥
x) ≥ Cx−α so E X α = ∞. 2.7. STABLE LAWS* 129 Exercise 2.7.6. Let Y, Y1 , Y2 , . . . be independent and have a stable law with index
α. Theorem 2.7.4 implies there are constants αk and βk so that Y1 + · · · + Yk and
αk Y + βk have the same distribution. Use the proof of Theorem 2.7.4, Theorem 2.7.2
and Exercise 2.7.2 to conclude that (i) αk = k 1/α , (ii) if α < 1 then βk = 0.
Exercise 2.7.7. Let Y be a stable law with index α < 1 and κ = 1. Exercise 2.7.4
implies that Y ≥ 0, so we can deﬁne its Laplace transform ψ (λ) = E exp(−λY ). The
previous exercise implies that for any integer n ≥ 1 we have ψ (λ)n = ψ (n1/α λ). Use
this to conclude E exp(−λY ) = exp(−cλα ).
Exercise 2.7.8. (i) Show that if X is symmetric stable with index α and Y ≥ 0
is an independent stable with index β < 1 then XY 1/α is symmetric stable with
2
index αβ . (ii) Let W1 and W2 be independent standard normals. Check that 1/W2
has the density given in (2.7.12) and use this to conclude that W1 /W2 has a Cauchy
distribution. 130 2.8 CHAPTER 2. CENTRAL LIMIT THEOREMS Inﬁnitely Divisible Distributions* In the last section, we identiﬁed the distributions that can appear as the limit of
normalized sums of i.i.d.r.v.’s. In this section, we will describe those that are limits
of sums
(∗) Sn = Xn,1 + · · · + Xn,n where the Xn,m are i.i.d. Note the verb “describe.” We will prove almost nothing in
this section, just state some of the most important facts to bring the reader up to
cocktail party literacy.
A suﬃcient condition for Z to be a limit of sums of the form (∗) is that Z has
an inﬁnitely divisible distribution, i.e., for each n there is an i.i.d. sequence
Yn,1 , . . . , Yn,n so that
d Z = Yn,1 + · · · + Yn,n
Our ﬁrst result shows that this condition is also necessary.
Theorem 2.8.1. Z is a limit of sums of type (∗) if and only if Z has an inﬁnitely
divisible distribution.
Proof. As remarked above, we only have to prove necessity. Write
S2n = (X2n,1 + · · · + X2n,n ) + (X2n,n+1 + · · · + X2n,2n ) ≡ Yn + Yn
The random variables Yn and Yn are independent and have the same distribution. If
Sn ⇒ Z then the distributions of Yn are a tight sequence since
P (Yn > y )2 = P (Yn > y )P (Yn > y ) ≤ P (S2n > 2y )
and similarly P (Yn < −y )2 ≤ P (S2n < −2y ). If we take a subsequence nk so that
Ynk ⇒ Y (and hence Ynk ⇒ Y ) then Z =d Y + Y . A similar argument shows that
Z can be divided into n > 2 pieces and the proof is complete.
With Theorem 2.8.1 established, we turn now to examples. In the ﬁrst three cases,
the distribution is inﬁnitely divisible because it is a limit of sums of the form (∗). The
number gives the relevant limit theorem.
Example 2.8.1. Normal distribution. Theorem 2.4.1
Example 2.8.2. Stable Laws. Theorem 2.7.2
Example 2.8.3. Poisson distribution. Theorem 2.6.1
Example 2.8.4. Compound Poisson distribution. Let ξ1 , ξ2 , . . . be i.i.d. and
N (λ) be an independent Poisson r.v. with mean λ. Then Z = ξ1 + · · · + ξN (λ) has an
inﬁnitely divisible distribution. (Let Xn,j =d ξ1 + · · · + ξN (λ/n) .) For developments
below, we would like to observe that if ϕ(t) = E exp(itξi ) then
∞ e−λ E exp(itZ ) =
n=0 λn
ϕ(t)n = exp(−λ(1 − ϕ(t)))
n! (2.8.1) Exercise 2.8.1. Show that the gamma distribution is inﬁnitely divisible.
The next two exercises give examples of distributions that are not inﬁnitely divisible. 2.8. INFINITELY DIVISIBLE DISTRIBUTIONS* 131 Exercise 2.8.2. Show that the distribution of a bounded r.v. Z is inﬁnitely divisible
if and only if Z is constant. Hint: Show var (Z ) = 0.
Exercise 2.8.3. Show that if µ is inﬁnitely divisible, its ch.f. ϕ never vanishes. Hint:
Look at ψ = ϕ2 , which is also inﬁnitely divisible, to avoid taking nth roots of complex
numbers then use Exercise 2.3.20.
Example 2.8.4 is a son of 2.8.3 but a father of 2.8.1 and 2.8.2. To explain this
remark, we observe that if ξ = and − with probability 1/2 each then ϕ(t) =
(ei t + e−i t )/2 = cos( t). So if λ = −2 , then (2.8.1) implies
E exp(itZ ) = exp(− −2 (1 − cos( t))) → exp(−t2 /2) as → 0. In words, the normal distribution is a limit of compound Poisson distributions. To see that stable laws are also a special case (using the notation from the
proof of Theorem 2.7.2), let
In ( ) = {m ≤ n : Xm  > an }
ˆ
Sn ( ) = Xm
m∈In ( ) ¯
ˆ
Sn ( ) = Sn − Sn ( )
¯
If n → 0 then Sn ( n )/an ⇒ 0. If is ﬁxed then as n → ∞ we have In ( ) ⇒
ˆ
Poisson( −α ) and Sn ( )/an ⇒ a compound Poisson distribution:
ˆ
E exp(itSn ( )/an ) → exp(− −α {1 − ψ (t)}) Combining the last two observations and using the proof of Theorem 2.7.2 shows that
stable laws are limits of compound Poisson distributions. The formula (2.7.10) for
the limiting ch.f.
∞ eitx − 1 − exp itc +
0
0 eitx − 1 − +
−∞ itx
1 + x2 itx
1 + x2 θαx−(α+1) dx
(1 − θ)αx−(α+1) dx (2.8.2) helps explain:
Theorem 2.8.2. L´vyKhinchin Theorem. Z has an inﬁnitely divisible distribue
tion if and only if its characteristic function has
log ϕ(t) = ict − σ 2 t2
+
2 where µ is a measure with µ({0}) = 0 and eitx − 1 − itx
1 + x2 x2
1+x2 µ(dx) µ(dx) < ∞. For a proof, see Breiman (1968), Section 9.5., or Feller II (1971), Section XVII.2. µ
is called the L´vy measure of the distribution. Comparing with (2.8.2) and recalling
e
the proof of Theorem 2.7.2 suggests the following interpretation of µ: If σ 2 = 0 then
Z can be built up by making a Poisson process on R with mean measure µ and then
summing up the points. As in the case of stable laws, we have to sum the points in
[− , ]c , subtract an appropriate constant, and let → 0.
Exercise 2.8.4. What is the L´vy measure for the limit ℵ in part (iii) of Exercise
e
2.4.13? 132 CHAPTER 2. CENTRAL LIMIT THEOREMS The theory of inﬁnitely divisible distributions is simpler in the case of ﬁnite variance. In this case, we have:
Theorem 2.8.3. Kolmogorov’s Theorem. Z has an inﬁnitely divisible distribution with mean 0 and ﬁnite variance if and only if its ch.f. has
log ϕ(t) = (eitx − 1 − itx)x−2 ν (dx) Here the integrand is −t2 /2 at 0, ν is called the canonical measure and var (Z ) =
ν (R).
To explain the formula, note that if Zλ has a Poisson distribution with mean λ
E exp(itx(Zλ − λ)) = exp(λ(eitx − 1 − itx))
so the measure for Z = x(Zλ − λ) has ν ({x}) = λx2 . 2.9. LIMIT THEOREMS IN RD 2.9 133 Limit Theorems in Rd Let X = (X1 , . . . , Xd ) be a random vector. We deﬁne its distribution function by
F (x) = P (X ≤ x). Here x ∈ Rd , and X ≤ x means Xi ≤ xi for i = 1, . . . , d. As in
one dimension, F has three obvious properties:
(i) It is nondecreasing, i.e., if x ≤ y then F (x) ≤ F (y ).
(ii) limx→∞ F (x) = 1, limxi →−∞ F (x) = 0. (iii) F is right continuous, i.e., limy↓x F (y ) = F (x).
Here x → ∞ means each coordinate xi goes to ∞, xi → −∞ means we let xi → −∞
keeping the other coordinates ﬁxed, and y ↓ x means each coordinate yi ↓ xi .
In one dimension, any function with properties (i)–(iii) is the distribution of some
random variable. See Theorem 1.1.2. In d ≥ 2, this is not the case. Suppose d = 2
and let a1 < b1 , a2 < b2 .
P (X ∈ (a1 , b1 ] × (a2 , b2 ]) = F (b1 , b2 ) − F (a1 , b2 ) − F (b1 , a2 ) + F (a1 , a2 )
so if F is going to be a distribution function the last quantity has to be ≥ 0. The
next example shows that this is not guaranteed by (i)–(iii). if x1 , x2 ≥ 1
1 2/3 if x ≥ 1 and 0 ≤ x < 1
1
2
F (x1 , x2 ) =
2/3 if x2 ≥ 1 and 0 ≤ x1 < 1 0
otherwise
If 0 < a1 , a2 < 1 ≤ b1 , b2 < ∞, then
F (b1 , b2 ) − F (a1 , b2 ) − F (b1 , a2 ) + F (a1 , a2 ) = 1 − 2/3 − 2/3 + 0 = −1/3
A little thought reveals that F is the distribution function of the measure with
µ({(0, 1)}) = µ({(1, 0)}) = 2/3 µ({(1, 1)}) = −1/3 To formulate the additional condition, we need to guarantee that F is the distribution
function of a probability measure, let
A = (a1 , b1 ] × · · · × (ad , bd ]
V = {a1 , b1 } × · · · × {ad , bd }
V = the vertices of the rectangle A. If v ∈ V , let
sgn (v ) = (−1)# of a’s in v
The inclusionexclusion formula implies
P (X ∈ A) = sgn (v )F (v )
v ∈V So if we use ∆A F to denote the righthand side, we need
(iv) ∆A F ≥ 0 for all rectangles A.
The last condition guarantees that the measure assigned to each rectangle is ≥ 0. A
standard result from measure theory (see Theorem ??) now implies there is a unique
probability measure with distribution F. 134 CHAPTER 2. CENTRAL LIMIT THEOREMS Exercise 2.9.1. If F is the distribution of (X1 , . . . , Xd ) then Fi (x) = P (Xi ≤ x) are
its marginal distributions. How can they be obtained from F ?
Exercise 2.9.2. Let F1 , . . . , Fd be distributions on R. Show that for any α ∈ [−1, 1]
d d F (x1 , . . . , xd ) = (1 − Fi (xi )) 1+α Fj (xj )
j =1 i=1 is a d.f. with the given marginals. The case α = 0 corresponds to independent r.v.’s.
Exercise 2.9.3. A distribution F is said to have a density f if
x1 F (x1 , ..., xk ) = xk ...
−∞ f (y ) dyk . . . dy1
−∞ Show that if f is continuous, ∂ k F/∂x1 . . . ∂xk = f.
If Fn and F are distribution functions on Rd , we say that Fn converges weakly
to F , and write Fn ⇒ F , if Fn (x) → F (x) at all continuity points of F . Our ﬁrst task
is to show that there are enough continuity points for this to be a sensible deﬁnition.
For a concrete example, consider 1 if x ≥ 0, y ≥ 1 F (x, y ) = y if x ≥ 0, 0 ≤ y < 1 0 otherwise
F is the distribution function of (0, Y ) where Y is uniform on (0,1). Notice that this
distribution has no atoms, but F is discontinuous at (0, y ) when y > 0.
Keeping the last example in mind, observe that if xn < x, i.e., xn,i < xi for all
coordinates i, and xn ↑ x as n → ∞ then
F (x) − F (xn ) = P (X ≤ x) − P (X ≤ xn ) ↓ P (X ≤ x) − P (X < x)
In d = 2, the last expression is the probability X lies in
{(a, x2 ) : a ≤ x1 } ∪ {(x1 , b) : b ≤ x2 }
i
Let Hc = {x : xi = c} be the hyperplane where the ith coordinate is c. For each i,
i
i
the Hc are disjoint so Di = {c : P (X ∈ Hc ) > 0} is at most countable. It is easy to
i
see that if x has xi ∈ D for all i then F is continuous at x. This gives us more than
/
enough points to reconstruct F.
As in Section 2.2, it will be useful to have several equivalent deﬁnitions of weak
convergence. In Chapter 7, we will need to know that this is valid for an arbitrary
metric space (S, ρ), so we will prove the result in that generality and insert another
equivalence that will be useful there. f is said to be Lipschitz continuous if there
is a constant C so that f (x) − f (y ) ≤ Cρ(x, y ). Theorem 2.9.1. The following statements are equivalent to Xn ⇒ X∞ .
(i) Ef (Xn ) → Ef (X∞ ) for all bounded continuous f.
(ii) Ef (Xn ) → Ef (X∞ ) for all bounded Lipschitz continuous f.
(iii) For all closed sets K , lim supn→∞ P (Xn ∈ K ) ≤ P (X∞ ∈ K ).
(iv) For all open sets G, lim inf n→∞ P (Xn ∈ G) ≥ P (X∞ ∈ G).
(v) For all sets A with P (X∞ ∈ ∂A) = 0, limn→∞ P (Xn ∈ A) = P (X∞ ∈ A).
(vi) Let Df = the set of discontinuities of f . For all bounded functions f with P (X∞ ∈
Df ) = 0, we have Ef (Xn ) → Ef (X∞ ). 2.9. LIMIT THEOREMS IN RD 135 Proof. We will begin by showing that (i)–(vi) are equivalent.
(i) implies (ii): Trivial.
(ii) implies (iii): Let ρ(x, K ) = inf {ρ(x, y ) : y ∈ K }, ϕj (r) = (1 − jr)+ , and fj (x) =
ϕj (ρ(x, K )). fj is Lipschitz continuous, has values in [0,1], and ↓ 1K (x) as j ↑ ∞. So
lim sup P (Xn ∈ K ) ≤ lim Efj (Xn ) = Efj (X∞ ) ↓ P (X∞ ∈ K ) as j ↑ ∞
n→∞ n→∞ (iii) is equivalent to (iv): As in the proof of Theorem 2.2.5, this follows easily from
two facts: A is open if and only if Ac is closed; P (A) + P (Ac ) = 1.
¯
(iii) and (iv) imply (v): Let K = A, G = Ao , and reason as in the proof of Theorem
2.2.5.
(v) implies (vi): Suppose f (x) ≤ K and pick α0 < α1 < . . . < α so that P (f (X∞ ) =
αi ) = 0 for 0 ≤ i ≤ , α0 < −K < K < α , and αi − αi−1 < . This is always possible
since {α : P (f (X∞ ) = α) > 0} is a countable set. Let Ai = {x : αi−1 < f (x) ≤ αi }.
∂Ai ⊂ {x : f (x) ∈ {αi−1 , αi }} ∪ Df , so P (X∞ ∈ ∂Ai ) = 0 , and it follows from (v)
that
αi P (Xn ∈ Ai ) →
i=1 αi P (X∞ ∈ Ai )
i=1 The deﬁnition of the αi implies
0≤ αi P (Xn ∈ Ai ) − Ef (Xn ) ≤ for 1 ≤ n ≤ ∞ i=1 Since is arbitrary, it follows that Ef (Xn ) → Ef (X∞ ). it (vi) implies (i): Trivial.
It remains to show that the six conditions are equivalent to weak convergence (⇒).
(v) implies (⇒) : If F is continuous at x, then A = (−∞, x1 ] × . . . × (−∞, xd ] has
µ(∂A) = 0, so Fn (x) = P (Xn ∈ A) → P (X∞ ∈ A) = F (x).
i
i
(⇒) implies (iv): Let Di = {c : P (X∞ ∈ Hc ) > 0} where Hc = {x : xi = c}. We say
i
a rectangle A = (a1 , b1 ] × . . . × (ad , bd ] is good if ai , bi ∈ D for all i. (⇒) implies that
/
for all good rectangles P (Xn ∈ A) → P (X∞ ∈ A). This is also true for B that are a
ﬁnite disjoint union of good rectangles. Now any open set G is an increasing limit of
Bk ’s that are a ﬁnite disjoint union of good rectangles, so lim inf P (Xn ∈ G) ≥ lim inf P (Xn ∈ Bk ) = P (X∞ ∈ Bk ) ↑ P (X∞ ∈ G)
n→∞ n→∞ as k → ∞. The proof is complete.
Remark. In Section 2.2, we proved that (i)–(v) are consequences of weak convergence
by constructing r.v’s with the given distributions so that Xn → X∞ a.s. This can be
done in Rd (or any complete separable metric space), but the construction is rather
messy. See Billingsley (1979), p. 337–340 for a proof in Rd .
Exercise 2.9.4. Let Xn be random vectors. Show that if Xn ⇒ X then the coordinates Xn,i ⇒ Xi . 136 CHAPTER 2. CENTRAL LIMIT THEOREMS A sequence of probability measures µn is said to be tight if for any
is an M so that lim inf n→∞ µn ([−M, M ]d ) ≥ 1 − . > 0, there Theorem 2.9.2. If µn is tight, then there is a weakly convergent subsequence.
Proof. Let Fn be the associated distribution functions, and let q1 , q2 , . . . be an enumeration of Qd = the points in Rd with rational coordinates. By a diagonal argument like the one in the proof of Theorem 2.2.6, we can pick a subsequence so that
Fn(k) (q ) → G(q ) for all q ∈ Qd . Let
F (x) = inf {G(q ) : q ∈ Qd , q > x}
where q > x means qi > xi for all i. It is easy to see that F is right continuous. To
check that it is a distribution function, we observe that if A is a rectangle with vertices
in Qd then ∆A Fn ≥ 0 for all n, so ∆A G ≥ 0, and taking limits we see that the last
conclusion holds for F for all rectangles A. Tightness implies that F has properties
(i) and (ii) of a distribution F . We leave it to the reader to check that Fn ⇒ F . The
proof of Theorem 2.2.6 works if you read inequalities such as r1 < r2 < x < s as the
corresponding relations between vectors.
The characteristic function of (X1 , . . . , Xd ) is ϕ(t) = E exp(it · X ) where t · X =
t1 X1 + · · · + td Xd is the usual dot product of two vectors.
Theorem 2.9.3. Inversion formula. If A = [a1 , b1 ] × . . . × [ad , bd ] with µ(∂A) = 0
then
d µ(A) = lim (2π )−d
T →∞ ψj (tj )ϕ(t) dt
[−T,T ]d j =1 where ψj (s) = (exp(−isaj ) − exp(−isbj ))/is.
Proof. Fubini’s theorem implies
d ψj (tj ) exp(itj xj ) µ(dx) dt
[−T,T ]d j =1
d T j =1 −T = ψj (tj ) exp(itj xj ) dtj µ(dx) It follows from the proof of Theorem 2.3.4 that
T ψj (tj ) exp(itj xj ) dtj → π 1(aj ,bj ) (x) + 1[aj ,bj ] (x)
−T so the desired conclusion follows from the bounded convergence theorem.
Exercise 2.9.5. Let ϕ be the ch.f. of a distribution F on R. What is the distribution
on Rd that corresponds to the ch.f. ψ (t1 , . . . , td ) = ϕ(t1 + · · · + td )?
Exercise 2.9.6. Show that random variables X1 , . . . , Xk are independent if and only
if
k ϕX1 ,...Xk (t) = ϕXj (tj )
j =1 2.9. LIMIT THEOREMS IN RD 137 Theorem 2.9.4. Convergence theorem. Let Xn , 1 ≤ n ≤ ∞ be random vectors
with ch.f. ϕn . A necessary and suﬃcient condition for Xn ⇒ X∞ is that ϕn (t) →
ϕ∞ (t).
Proof. exp(it · x) is bounded and continuous, so if Xn ⇒ X∞ then ϕn (t) → ϕ∞ (t).
To prove the other direction it suﬃces, as in the proof of Theorem 2.3.6, to prove that
the sequence is tight. To do this, we observe that if we ﬁx θ ∈ Rd , then for all s ∈ R,
ϕn (sθ) → ϕ∞ (sθ), so it follows from Theorem 2.3.6, that the distributions of θ · Xn
are tight. Applying the last observation to the d unit vectors e1 , . . . , ed shows that
the distributions of Xn are tight and completes the proof.
Remark. As before, if ϕn (t) → ϕ∞ (t) with ϕ∞ (t) continuous at 0, then ϕ∞ (t) is the
ch.f. of some X∞ and Xn ⇒ X∞ .
Theorem 2.9.4 has an important corollary.
Theorem 2.9.5. Cram´rWold device. A suﬃcient condition for Xn ⇒ X∞ is
e
that θ · Xn ⇒ θ · X∞ for all θ ∈ Rd .
Proof. The indicated condition implies E exp(iθ · Xn ) → E exp(iθ · X∞ ) for all θ ∈
Rd .
Theorem 2.9.5 leads immediately to
Theorem 2.9.6. The central limit theorem in Rd . Let X1 , X2 , . . . be i.i.d. random vectors with EXn = µ, and ﬁnite covariances
Γij = E ((Xn,i − µi )(Xn,j − µj ))
If Sn = X1 + · · · + Xn then (Sn − nµ)/n1/2 ⇒ χ, where χ has a multivariate normal
distribution with mean 0 and covariance Γ, i.e., E exp(iθ · χ) = exp − θi θj Γij /2
i j Proof. By considering Xn = Xn − µ, we can suppose without loss of generality that
µ = 0. Let θ ∈ Rd . θ · Xn is a random variable with mean 0 and variance
2 E θi Xn,i = i E (θi θj Xn,i Xn,j ) =
i j θi θj Γij
i j so it follows from the onedimensional central limit theorem and 2.9.5 that Sn /n1/2 ⇒
χ where E exp(iθ · χ) = exp − θi θj Γij /2
i j which proves the desired result.
To illustrate the use of Theorem 2.9.6, we consider two examples. In each e1 , . . . , ed
are the d unit vectors.
Example 2.9.1. Simple random walk on Zd . Let X1 , X2 , . . . be i.i.d. with
P (Xn = +ei ) = P (Xn = −ei ) = 1/2d for i = 1, . . . , d i
i
j
EXn = 0 and if i = j then EXn Xn = 0 since both components cannot be nonzero
simultaneously. So the covariance matrix is Γij = (1/2d)I. 138 CHAPTER 2. CENTRAL LIMIT THEOREMS Example 2.9.2. Let X1 , X2 , . . . be i.i.d. with P (Xn = ei ) = 1/6 for i = 1, 2, . . . , 6. In
words, we are rolling a die and keeping track of the numbers that come up. EXn,i =
1/6 and EXn,i Xn,j = 0 for i = j , so Γij = (1/6)(5/6) when i = j and = −(1/6)2
when i = j . In this case, the limiting distribution is concentrated on {x : i xi = 0}.
Our treatment of the central limit theorem would not be complete without some
discussion of the multivariate normal distribution. We begin by observing that Γij =
Γji and if EXi = 0 and EXi Xj = Γi,j
2 θi θj Γij = E
i j θi Xi ≥0 i so Γ is symmetric and nonnegative deﬁnite. A wellknown result implies that there
is an orthogonal matrix U (i.e., one with U t U = I , the identity matrix) so that
Γ = U t V U , where V ≥ 0 is a diagonal matrix. Let W be the nonnegative diagonal
matrix with W 2 = V . If we let A = W U , then Γ = At A. Let Y be a ddimensional
vector whose components are independent and have normal distributions with mean
0 and variance 1. If we view vectors as 1 × d matrices and let χ = Y A, then χ has
the desired normal distribution. To check this, observe that
θ·YA= θi
i Yj Aji
j has a normal distribution with mean 0 and variance
2 Aji θi
j i θ i At
ij =
j i Ajk θk = θAt Aθt = θΓθt k so E (exp(iθ · χ)) = exp(−(θΓθt )/2).
If the covariance matrix has rank d, we say that the normal distribution is nondegenerate. In this case, its density function is given by yi Γ−1 yj /2
ij (2π )−d/2 (det Γ)−1/2 exp −
i,j The joint distribution in degenerate cases can be computed by using a linear transformation to reduce to the nondegenerate case. For instance, in Example 2.9.2 we can
look at the distribution of (X1 , . . . , X5 ).
Exercise 2.9.7. Suppose (X1 , . . . , Xd ) has a multivariate normal distribution with
mean vector θ and covariance Γ. Show X1 , . . . , Xd are independent if and only if
Γij = 0 for i = j . In words, uncorrelated random variables with a joint normal
distribution are independent.
Exercise 2.9.8. Show that (X1 , . . . , Xd ) has a multivariate normal distribution with
mean vector θ and covariance Γ if and only if every linear combination c1 X1 +· · ·+cd Xd
has a normal distribution with mean cθt and variance cΓct . Chapter 3 Random Walks
Let X1 , X2 , . . . be i.i.d. taking values in Rd and let Sn = X1 + . . . + Xn . Sn is a
random walk. In the last chapter, we were primarily concerned with the distribution
of Sn . In this one, we will look at properties of the sequence S1 (ω ), S2 (ω ), . . . For
example, does the last sequence return to (or near) 0 inﬁnitely often? The ﬁrst
section introduces stopping times, a concept that will be very important in this and
the next two chapters. After the ﬁrst section is completed, the remaining three can
be read in any order or skipped without much loss. The second section is not starred
since it contains some basic facts about random walks. 3.1 Stopping Times Most of the results in this section are valid for i.i.d. X ’s taking values in some nice
measurable space (S, S ) and will be proved in that generality. For several reasons,
it is convenient to use the special probability space from the proof of Kolmogorov’s
extension theorem:
Ω = {(ω1 , ω2 , . . .) : ωi ∈ S }
F = S × S × ...
P = µ × µ × ... µ is the distribution of Xi Xn (ω ) = ωn
So, throughout this section, we will suppose (without loss of generality) that our
random variables are constructed on this special space.
Before taking up our main topic, we will prove a 01 law that, in the i.i.d. case,
generalizes Kolmogorov’s. To state the new 01 law we need two deﬁnitions. A ﬁnite
permutation of N = {1, 2, . . .} is a map π from N onto N so that π (i) = i for only
ﬁnitely many i. If π is a ﬁnite permutation of N and ω ∈ S N we deﬁne (πω )i = ωπ(i) .
In words, the coordinates of ω are rearranged according to π . Since Xi (ω ) = ωi
this is the same as rearranging the random variables. An event A is permutable if
π −1 A ≡ {ω : πω ∈ A} is equal to A for any ﬁnite permutation π , or in other words,
if its occurrence is not aﬀected by rearranging the random variables. The collection
of permutable events is a σ ﬁeld. It is called the exchangeable σ ﬁeld and denoted
by E .
To see the reason for interest in permutable events, suppose S = R and let Sn (ω ) =
X1 (ω ) + · · · + Xn (ω ). Two examples of permutable events are
139 140 CHAPTER 3. RANDOM WALKS (i) {ω : Sn (ω ) ∈ B i.o.}
(ii) {ω : lim supn→∞ Sn (ω )/cn ≥ 1}
In each case, the event is permutable because Sn (ω ) = Sn (πω ) for large n. The list
of examples can be enlarged considerably by observing:
(iii) All events in the tail σ ﬁeld T are permutable.
To see this, observe that if A ∈ σ (Xn+1 , Xn+2 , . . .) then the occurrence of A is unaﬀected by a permutation of X1 , . . . , Xn . (i) shows that the converse of (iii) is false.
The next result shows that for an i.i.d. sequence there is no diﬀerence between E and
T . They are both trivial.
Theorem 3.1.1. HewittSavage 01 law. If X1 , X2 , . . . are i.i.d. and A ∈ E then
P (A) ∈ {0, 1}.
Proof. Let A ∈ E . As in the proof of Kolmogorov’s 01 law, we will show A is
independent of itself, i.e., P (A) = P (A ∩ A) = P (A)P (A) so P (A) ∈ {0, 1}. Let
An ∈ σ (X1 , . . . , Xn ) so that
P (An ∆A) → 0 (a) Here A∆B = (A − B ) ∪ (B − A) is the symmetric diﬀerence. The existence of the
An ’s is proved in Exercise A.3.1. An can be written as {ω : (ω1 , . . . , ωn ) ∈ Bn } with
Bn ∈ S n . Let j + n if 1 ≤ j ≤ n
π (j ) = j − n if n + 1 ≤ j ≤ 2n j
if j ≥ 2n + 1
Observing that π 2 is the identity (so we don’t have to worry about whether to write
π or π −1 ) and the coordinates are i.i.d. (so the permuted coordinates are) gives
(b) P (ω : ω ∈ An ∆A) = P (ω : πω ∈ An ∆A) Now {ω : πω ∈ A} = {ω : ω ∈ A}, since A is permutable, and
{ω : πω ∈ An } = {ω : (ωn+1 , . . . , ω2n ) ∈ Bn }
If we use An to denote the last event then we have
(c) {ω : πω ∈ An ∆A} = {ω : ω ∈ An ∆A} Combining (b) and (c) gives
(d) P (An ∆A) = P (An ∆A)
It is easy to see that
P (B ) − P (C ) ≤ P (B ∆C  so (d) implies P (An ), P (An ) → P (A). Now A − C ⊂ (A − B ) ∪ (B − C ) and with a
similar inequality for C − A implies A∆C ⊂ (A∆B ) ∪ (B ∆C ). The last inequality,
(d), and (a) imply
P (An ∆An ) ≤ P (An ∆A) + P (A∆An ) → 0 3.1. STOPPING TIMES 141 The last result implies
0 ≤ P (An ) − P (An ∩ An )
≤ P (An ∪ An ) − P (An ∩ An ) = P (An ∆An ) → 0
so P (An ∩ An ) → P (A). But An and An are independent, so
P (An ∩ An ) = P (An )P (An ) → P (A)2
This shows P (A) = P (A)2 , and proves (1.1).
A typical application of Theorem 3.1.1 is
Theorem 3.1.2. For a random walk on R, there are only four possibilities, one of
which has probability one.
(i) Sn = 0 for all n.
(ii) Sn → ∞.
(iii) Sn → −∞.
(iv) −∞ = lim inf Sn < lim sup Sn = ∞.
Proof. Theorem 3.1.1 implies lim sup Sn is a constant c ∈ [−∞, ∞]. Let Sn = Sn+1 −
X1 . Since Sn has the same distribution as Sn , it follows that c = c − X1 . If c is
ﬁnite, subtracting c from both sides we conclude X1 ≡ 0 and (i) occurs. Turning the
last statement around, we see that if X1 ≡ 0 then c = −∞ or ∞. The same analysis
applies to the liminf. Discarding the impossible combination lim sup Sn = −∞ and
lim inf Sn = +∞, we have proved the result.
Exercise 3.1.1. Symmetric random walk. Let X1 , X2 , . . . ∈ R be i.i.d. with a
distribution that is symmetric about 0 and nondegenerate (i.e., P (Xi = 0) < 1).
Show that we are in case (iv) of Theorem 3.1.2.
2
Exercise 3.1.2. Let X1 , X2 , . . . be i.i.d. with EXi = 0 and EXi = σ 2 ∈ (0, ∞). Use
the central limit theorem to conclude that we are in case (iv) of Theorem 3.1.2. Later
in Exercise 1.11 you will show that EXi = 0 and P (Xi = 0) < 1 is suﬃcient. The special case in which P (Xi = 1) = P (Xi = −1) = 1/2 is called simple random
walk. Since a simple random walk cannot skip over any integers, it follows from
either exercise above that with probability one it visits every integer inﬁnitely many
times.
Let Fn = σ (X1 , . . . , Xn ) = the information known at time n. A random variable
N taking values in {1, 2, . . .} ∪ {∞} is said to be a stopping time or an optional
random variable if for every n < ∞, {N = n} ∈ Fn . If we think of Sn as giving the
(logarithm of the) price of a stock at time n, and N as the time we sell it, then the last
deﬁnition says that the decision to sell at time n must be based on the information
known at that time. The last interpretation gives one explanation for the second
name. N is a time at which we can exercise an option to buy a stock. Chung prefers
the second name because N is “usually rather a momentary pause after which the
process proceeds again: time marches on!”
The canonical example of a stopping time is N = inf {n : Sn ∈ A}, the hitting
time of A. To check that this is a stopping time, we observe that
{N = n} = {S1 ∈ Ac , . . . , Sn−1 ∈ Ac , Sn ∈ A} ∈ Fn
Two concrete examples of hitting times that have appeared above are 142 CHAPTER 3. RANDOM WALKS Example 3.1.1. N = inf {k : Sk  ≥ x} from the proof of Theorem 1.8.2.
Example 3.1.2. If the Xi ≥ 0 and Nt = sup{n : Sn ≤ t} is the random variable that
ﬁrst appeared in Example 1.7.1, then Nt + 1 = inf {n : Sn > t} is a stopping time.
The next result allows us to construct new examples from the old ones.
Exercise 3.1.3. If S and T are stopping times then S ∧ T and S ∨ T are stopping
times. Since constant times are stopping times, it follows that S ∧ n and S ∨ n are
stopping times.
Exercise 3.1.4. Suppose S and T are stopping times. Is S + T a stopping time?
Give a proof or a counterexample.
Associated with each stopping time N is a σ ﬁeld FN = the information known
at time N . Formally, FN is the collection of sets A that have A ∩ {N = n} ∈ Fn for
all n < ∞, i.e., when N = n, A must be measurable with respect to the information
known at time n. Trivial but important examples of sets in FN are {N ≤ n}, i.e., N
is measurable with respect to FN .
Exercise 3.1.5. Show that if Yn ∈ Fn and N is a stopping time, YN ∈ FN . As a
corollary of this result we see that if f : S → R is measurable, Tn = m≤n f (Xm ),
and Mn = maxm≤n Tm then TN and MN ∈ FN . An important special case is S = R,
f (x) = x.
Exercise 3.1.6. Show that if M ≤ N are stopping times then FM ⊂ FN .
Exercise 3.1.7. Show that if L ≤ M and A ∈ FL then
N= L
M on A
on Ac is a stopping time Our ﬁrst result about FN is
Theorem 3.1.3. Let X1 , X2 , . . . be i.i.d., Fn = σ (X1 , . . . , Xn ) and N be a stopping
time. Conditional on {N < ∞}, {XN +n , n ≥ 1} is independent of FN and has the
same distribution as the original sequence.
Proof. By Theorem A.2.2 it is enough to show that if A ∈ FN and Bj ∈ S for
1 ≤ j ≤ k then
k P (A, N < ∞, XN +j ∈ Bj , 1 ≤ j ≤ k ) = P (A ∩ {N < ∞}) µ(Bj )
j =1 where µ(B ) = P (Xi ∈ B ). The method (“divide and conquer”) is one that we will
see many times below. We break things down according to the value of N in order to
replace N by n and reduce to the case of a ﬁxed time.
P (A, N = n, XN +j ∈ Bj , 1 ≤ j ≤ k ) = P (A, N = n, Xn+j ∈ Bj , 1 ≤ j ≤ k )
k = P (A ∩ {N = n}) µ(Bj )
j =1 since A ∩{N = n} ∈ Fn and that σ ﬁeld is independent of Xn+1 , . . . , Xn+k . Summing
over n now gives the desired result. 3.1. STOPPING TIMES 143 To delve further into properties of stopping times, we recall we have supposed
Ω = S N and deﬁne the shift θ : Ω → Ω by
(θω )(n) = ω (n + 1) n = 1, 2, . . . In words, we drop the ﬁrst coordinate and shift the others one place to the left. The
iterates of θ are deﬁned by composition. Let θ1 = θ, and for k ≥ 2 let θk = θ ◦ θk−1 .
Clearly, (θk ω )(n) = ω (n + k ), n = 1, 2, . . . To extend the last deﬁnition to stopping
times, we let
θn ω on {N = n}
θN ω =
∆
on {N = ∞}
Here ∆ is an extra point that we add to Ω. According to the only joke in Blumenthal
and Getoor (1968), ∆ is a “cemetery or heaven depending upon your point of view.”
Seriously, ∆ is a convenience in making deﬁnitions like the next one.
Example 3.1.3. Returns to 0. For a concrete example of the use of θ, suppose
S = Rd and let
τ (ω ) = inf {n : ω1 + · · · + ωn = 0}
where inf ∅ = ∞, and we set τ (∆) = ∞. If we let τ2 (ω ) = τ (ω ) + τ (θτ ω ) then on
{τ < ∞},
τ (θτ ω ) = inf {n : (θτ ω )1 + · · · + (θτ ω )n = 0}
= inf {n : ωτ +1 + · · · + ωτ +n = 0}
τ (ω ) + τ (θτ ω ) = inf {m > τ : ω1 + · · · + ωm = 0}
So τ2 is the time of the second visit to 0 (and thanks to the conventions θ∞ ω = ∆
and τ (∆) = ∞, this is true for all ω ). The last computation generalizes easily to show
that if we let
τn (ω ) = τn−1 (ω ) + τ (θτn−1 ω )
then τn is the time of the nth visit to 0.
If we have any stopping time T , we can deﬁne its iterates by T0 = 0 and
Tn (ω ) = Tn−1 (ω ) + T (θTn−1 ω ) for n ≥ 1 If we assume P = µ × µ × . . . then
P (Tn < ∞) = P (T < ∞)n (3.1.1) Proof. We will prove this by induction. The result is trivial when n = 1. Suppose
now that it is valid for n − 1. Applying Theorem 3.1.3 to N = Tn−1 , we see that
T (θTn−1 ) < ∞ is independent of Tn−1 < ∞, and has the same probability as T < ∞,
so
P (Tn < ∞) = P (Tn−1 < ∞, T (θTn−1 ω ) < ∞)
= P (Tn−1 < ∞)P (T < ∞) = P (T < ∞)n
by the induction hypothesis.
Letting tn = T (θTn−1 ), we can extend Theorem 3.1.3 to 144 CHAPTER 3. RANDOM WALKS Theorem 3.1.4. Suppose P (T < ∞) = 1. Then the “random vectors”
Vn = (tn , XTn−1 +1 , . . . , XTn )
are independent and identically distributed.
Proof. It is clear from Theorem 3.1.3 that Vn and V1 have the same distribution.
The independence follows from Theorem 3.1.3 and induction since V1 , . . . , Vn−1 ∈
F (Tn−1 ).
Example 3.1.4. Ladder variables. Let α(ω ) = inf {n : ω1 + · · · + ωn > 0} where
inf ∅ = ∞, and set α(∆) = ∞. Let α0 = 0 and let
αk (ω ) = αk−1 (ω ) + α(θαk−1 ω )
for k ≥ 1. At time αk , the random walk is at a record high value.
The next three exercises investigate these times.
Exercise 3.1.8. (i) If P (α < ∞) < 1 then P (sup Sn < ∞) = 1.
(ii) If P (α < ∞) = 1 then P (sup Sn = ∞) = 1.
Exercise 3.1.9. Let β = inf {n : Sn < 0}. Prove that the four possibilities in
Theorem 3.1.2 correspond to the four combinations of P (α < ∞) < 1 or = 1, and
P (β < ∞) < 1 or = 1.
¯
Exercise 3.1.10. Let S0 = 0, β = inf {n ≥ 1 : Sn ≤ 0} and
An = {0 ≥ Sm , S1 ≥ Sm , . . . , Sm−1 ≥ Sm , Sm < Sm+1 , . . . , Sm < Sn }
m
n n ¯
(i) Show 1 = m=0 P (An ) = m=0 P (α > m)P (β > n − m).
m
¯
(ii) Let n → ∞ and conclude Eα = 1/P (β = ∞).
Exercise 3.1.11. (i) Combine the last exercise with the proof of (ii) in Exercise 3.1.8
¯
to conclude that if EXi = 0 then P (β = ∞) = 0. (ii) Show that if we assume in
addition that P (Xi = 0) < 1 then P (β = ∞) = 0 and Exercise 3.1.9 implies we are
in case (iv) of Theorem 3.1.2. A famous result about stopping times for random walks is:
Theorem 3.1.5. Wald’s equation. Let X1 , X2 , . . . be i.i.d. with E Xi  < ∞. If N
is a stopping time with EN < ∞ then ESN = EX1 EN.
Proof. First suppose the Xi ≥ 0.
∞ ESN = SN dP = ∞ n Sn 1{N =n} dP =
n=1 Xm 1{N =n} dP
n=1 m=1 Since the Xi ≥ 0, we can interchange the order of summation (i.e., use Fubini’s
theorem) to conclude that the last expression
∞ ∞ = ∞ Xm 1{N =n} dP =
m=1 n=m Xm 1{N ≥m} dP
m=1 3.1. STOPPING TIMES 145 Now {N ≥ m} = {N ≤ m − 1}c ∈ Fm−1 and is independent of Xm , so the last
expression
∞ EXm P (N ≥ m) = EX1 EN =
m=1 To prove the result in general, we run the last argument backwards. If we have
EN < ∞ then
∞ ∞> ∞ ∞ Xm 1{N =n} dP E Xm P (N ≥ m) =
m=1 n=m m=1 The last formula shows that the double sum converges absolutely in one order, so
Fubini’s theorem gives
∞ ∞ ∞ n Xm 1{N =n} dP Xm 1{N =n} dP =
m=1 n=m n=1 m=1 Using the independence of {N ≥ m} ∈ Fm−1 and Xm , and rewriting the last identity,
it follows that
∞ EXm P (N ≥ m) = ESN
m=1 Since the lefthand side is EN EX1 , the proof is complete.
Exercise 3.1.12. Let X1 , X2 , . . . be i.i.d. uniform on (0,1), let Sn = X1 + · · · + Xn ,
and let T = inf {n : Sn > 1}. Show that P (T > n) = 1/n!, so ET = e and EST = e/2.
Example 3.1.5. Simple random walk. Let X1 , X2 , . . . be i.i.d. with P (Xi = 1) =
1/2 and P (Xi = −1) = 1/2. Let a < 0 < b be integers and let N = inf {n : Sn ∈
(a, b)}. To apply Theorem 3.1.5, we have to check that EN < ∞. To do this, we
observe that if x ∈ (a, b), then
P (x + Sb−a ∈ (a, b)) ≥ 2−(b−a)
/
since b − a steps of size +1 in a row will take us out of the interval. Iterating the last
inequality, it follows that
P (N > n(b − a)) ≤ 1 − 2−(b−a) n so EN < ∞. Applying Theorem 3.1.5 now gives ESN = 0 or
bP (SN = b) + aP (SN = a) = 0
Since P (SN = b) + P (SN = a) = 1, it follows that (b − a)P (SN = b) = −a, so
P (SN = b) = −a
b−a P (SN = a) = b
b−a Letting Ta = inf {n : Sn = a}, we can write the last conclusion as
P (Ta < Tb ) = b
b−a for a < 0 < b Setting b = M and letting M → ∞ gives
P (Ta < ∞) ≥ P (Ta < TM ) → 1 (3.1.2) 146 CHAPTER 3. RANDOM WALKS for all a < 0. From symmetry (and the fact that T0 ≡ 0), it follows that
P (Tx < ∞) = 1 for all x ∈ Z (3.1.3) Our ﬁnal fact about Tx is that ETx = ∞ for x = 0. To prove this, note that if
ETx < ∞ then Theorem 3.1.5 would imply
x = ESTx = EX1 ETx = 0
In Section 3.3, we will compute the distribution of T1 and show that
P (T1 > t) ∼ C t−1/2
Exercise 3.1.13. Asymmetric simple random walk. Let X1 , X2 , . . . be i.i.d. with
P (X1 = 1) = p > 1/2 and P (X1 = −1) = 1 − p, and let Sn = X1 + · · · + Xn . Let
α = inf {m : Sm > 0} and β = inf {n : Sn < 0}.
(i) Use Exercise 3.1.9 to conclude that P (α < ∞) = 1 and P (β < ∞) < 1.
(ii) If Y = inf Sn , then P (Y ≤ −k ) = P (β < ∞)k .
(iii) Apply Wald’s equation to α ∧ n and let n → ∞ to get Eα = 1/EX1 = 1/(2p − 1).
¯
Comparing with Exercise 3.1.10 shows P (β = ∞) = 2p − 1.
Exercise 3.1.14. An optimal stopping problem. Let Xn , n ≥ 1 be i.i.d. with
+
EX1 < ∞ and let
Yn = max Xm − cn
1≤m≤n That is, we are looking for a large value of X , but we have to pay c > 0 for each
observation. (i) Let T = inf {n : Xn > a}, p = P (Xn > a), and compute EYT . (ii)
Let α (possibly < 0) be the unique solution of E (X1 − α)+ = c. Show that EYT = α
in this case and use the inequality
n ((Xm − α)+ − c) Yn ≤ α +
m=1 for n ≥ 1 to conclude that if τ ≥ 1 is a stopping time with Eτ < ∞, then EYτ ≤ α.
The analysis above assumes that you have to play at least once. If the optimal α < 0,
then you shouldn’t play at all.
Theorem 3.1.6. Wald’s second equation. Let X1 , X2 , . . . be i.i.d. with EXn = 0
2
2
and EXn = σ 2 < ∞. If T is a stopping time with ET < ∞ then EST = σ 2 ET .
Proof. Using the deﬁnitions and then taking expected value
2
2
2
ST ∧n = ST ∧(n−1) + (2Xn Sn−1 + Xn )1(T ≥n)
2
2
EST ∧n = EST ∧(n−1) + σ 2 P (T ≥ n) since EXn = 0 and Xn is independent of Sn−1 and 1(T ≥n) ∈ Fn−1 . [The expectation
of Sn−1 Xn exists since both random variables are in L2 .] From the last equality and
induction we get
n
2
EST ∧n = σ 2 P (T ≥ m)
m=1
n E (ST ∧n − ST ∧m )2 = σ 2 P (T ≥ n)
k=m+1 The second equality follows from the ﬁrst applied to Xm+1 , Xm+2 , . . .. The second
equality implies that ST ∧n is a Cauchy sequence in L2 , so letting n → ∞ in the ﬁrst
2
it follows that EST = σ 2 ET . 3.1. STOPPING TIMES 147 Example 3.1.6. Simple random walk, II. Continuing Example 3.1.5 we investigate N = inf {Sn ∈ (a, b)}. We have shown that EN < ∞. Since σ 2 = 1 it follows
from 3.1.6 and (3.1.2) that
2
EN = ESN = a2 b
−a
+ b2
= −ab
b−a
b−a If b = L and a = −L, EN = L2 .
An amusing consequence of Theorem 3.1.6 is
2
Theorem 3.1.7. Let X1 , X2 , . . . be i.i.d. with EXn = 0 and EXn = 1, and let
1 /2
Tc = inf {n ≥ 1 : Sn  > cn }. ETc <∞
=∞ for c < 1
for c ≥ 1 Proof. One half of this is easy. If ETc < ∞ then the previous exercise implies
2
ETc = E (STc ) > c2 ETc a contradiction if c ≥ 1. To prove the other direction, we let
2
τ = Tc ∧ n and observe Sτ −1 ≤ c2 (τ − 1), so using the CauchySchwarz inequality
2
2
2
Eτ = ESτ = ESτ −1 + 2E (Sτ −1 Xτ ) + EXτ 2
2
≤ c2 Eτ + 2c(Eτ EXτ )1/2 + EXτ To complete the proof now, we will show
Lemma 3.1.8. If T is a stopping time with ET = ∞ then
2
EXT ∧n /E (T ∧ n) → 0 Theorem 3.1.7 follows for if
a contradiction. < 1 − c2 and n is large, we will have Eτ ≤ (c2 + )Eτ , Proof. We begin by writing
n
2
E (XT ∧n ) = 2
2
E (XT ∧n ; XT ∧n 2
2
E (Xj ; T ∧ n = j, Xj > j ) ≤ (T ∧ n)) +
j =1 The ﬁrst term is ≤ E (T ∧ n). To bound the second, choose N ≥ 1 so that for n ≥ N
n
2
2
E (Xj ; Xj > j ) < n
j =1
2
2
This is possible since the dominated convergence theorem implies E (Xj ; Xj > j ) → 0
as j → ∞. For the ﬁrst part of the sum, we use a trivial bound
N
2
2
2
E (Xj ; T ∧ n = j, Xj > j ) ≤ N EX1
j =1
2
To bound the remainder of the sum, we note (i) Xj ≥ 0; (ii) {T ∧ n ≥ j } is ∈ Fj −1 and
2
2
hence is independent of Xj 1(Xj > j ) , (iii) use some trivial arithmetic, (iv) use Fubini’s 148 CHAPTER 3. RANDOM WALKS theorem and enlarge the range of j , (v) use the choice of N and a trivial inequality
n n
2
2
E (Xj ; T ∧ n = j, Xj > j ) ≤
j =N j =N
n
∞ n
2
2
P (T ∧ n ≥ j )E (Xj ; Xj > j ) = =
j =N
∞ 2
2
E (Xj ; T ∧ n ≥ j, Xj > j ) 2
2
P (T ∧ n = k )E (Xj ; Xj > j )
j =N k=j
∞ k
2
2
P (T ∧ n = k )E (Xj ; Xj > j ) ≤ ≤
k=N j =1 kP (T ∧ n = k ) ≤ E (T ∧ n)
k=N Combining our estimates shows
2
2
EXT ∧n ≤ 2 E (T ∧ n) + N EX1 Letting n → ∞ and noting E (T ∧ n) → ∞, we have
2
lim sup EXT ∧n /E (T ∧ n) ≤ 2
n→∞ where is arbitrary. 3.2. RECURRENCE 3.2 149 Recurrence Throughout this section, Sn will be random walk, i.e., Sn = X1 + · · · + Xn where
X1 , X2 , . . . are i.i.d., and we will investigate the question mentioned at the beginning of
the chapter. Does the sequence S1 (ω ), S2 (ω ), . . . return to (or near) 0 inﬁnitely often?
The answer to the last question is either Yes or No, and the random walk is called
recurrent or transient accordingly. We begin with some deﬁnitions that formulate the
question precisely and a result that establishes a dichotomy between the two cases.
The number x ∈ Rd is said to be a recurrent value for the random walk Sn if
for every > 0, P ( Sn − x < i.o.) = 1. Here x = sup xi . The reader will see
the reason for this choice of norm in the proof of Lemma 3.2.5. The HewittSavage
01 law, Theorem 3.1.1, implies that if the last probability is < 1, it is 0. Our ﬁrst
result shows that to know the set of recurrent values, it is enough to check x = 0. A
number x is said to be a possible value of the random walk if for any > 0, there
is an n so that P ( Sn − x < ) > 0.
Theorem 3.2.1. The set V of recurrent values is either ∅ or a closed subgroup of
Rd . In the second case, V = U , the set of possible values.
Proof. Suppose V = ∅. It is clear that V c is open, so V is closed. To prove that V is
a group, we will ﬁrst show that
(∗) if x ∈ U and y ∈ V then y − x ∈ V .
This statement has been formulated so that once it is established, the result follows
easily. Let
pδ,m (z ) = P ( Sn − z ≥ δ for all n ≥ m)
If y − x ∈ V , there is an > 0 and m ≥ 1 so that p2
/
is a k so that P ( Sk − x < ) > 0. Since ,m (y − x) > 0. Since x ∈ U , there P ( Sn − Sk − (y − x) ≥ 2 for all n ≥ k + m) = p2 ,m (y − x) and is independent of { Sk − x < }, it follows that
p ,m+k (y ) ≥ P ( Sk − x < )p2 ,m (y − x) > 0 contradicting y ∈ V , so y − x ∈ V .
To conclude V is a group when V = ∅, let q, r ∈ V , and observe: (i) taking
x = y = r in (∗) shows 0 ∈ V , (ii) taking x = r, y = 0 shows −r ∈ V , and (iii) taking
x = −r, y = q shows q + r ∈ V . To prove that V = U now, observe that if u ∈ U
taking x = u, y = 0 shows −u ∈ V and since V is a group, it follows that u ∈ V .
If V = ∅, the random walk is said to be transient, otherwise it is called recurrent.
Before plunging into the technicalities needed to treat a general random walk, we
begin by analyzing the special case Polya considered in 1921. Legend has it that
Polya thought of this problem while wandering around in a park near Z¨rich when
u
he noticed that he kept encountering the same young couple. History does not record
what the young couple thought.
Example 3.2.1. Simple random walk on Zd .
P (Xi = ej ) = P (Xi = −ej ) = 1/2d 150 CHAPTER 3. RANDOM WALKS for each of the d unit vectors ej . To analyze this case, we begin with a result that is
valid for any random walk. Let τ0 = 0 and τn = inf {m > τn−1 : Sm = 0} be the time
of the nth return to 0. From (3.1.1), it follows that
P (τn < ∞) = P (τ1 < ∞)n
a fact that leads easily to:
Theorem 3.2.2. For any random walk, the following are equivalent:
∞
(i) P (τ1 < ∞) = 1, (ii) P (Sm = 0 i.o.) = 1, and (iii) m=0 P (Sm = 0) = ∞.
Proof. If P (τ1 < ∞) = 1, then P (τn < ∞) = 1 for all n and P (Sm = 0 i.o.) = 1. Let
∞ V= ∞ 1(Sm =0) =
m=0 1(τn <∞)
n=0 be the number of visits to 0, counting the visit at time 0. Taking expected value and
using Fubini’s theorem to put the expected value inside the sum:
∞ ∞ EV = P (τn < ∞) P (Sm = 0) =
n=0 m=0
∞ P (τ1 < ∞)n = =
n=0 1
1 − P (τ1 < ∞) The second equality shows (ii) implies (iii), and in combination with the last two
shows that if (i) is false then (iii) is false (i.e., (iii) implies (i)).
Theorem 3.2.3. Simple random walk is recurrent in d ≤ 2 and transient in d ≥ 3.
To steal a joke from Kakutani (U.C.L.A. colloquium talk): “A drunk man will eventually ﬁnd his way home but a drunk bird may get lost forever.”
Proof. Let ρd (m) = P (Sm = 0). ρd (m) is 0 if m is odd. From Theorem 2.1.3, we
get ρ1 (2n) ∼ (πn)−1/2 as n → ∞. This and Theorem 3.2.2 gives the result in one
dimension. Our next step is
Simple random walk is recurrent in two dimensions. Note that in order for S2n = 0
we must for some 0 ≤ m ≤ n have m up steps, m down steps, n − m to the left and
n − m to the right so
n ρ2 (2n) = 4−2n
= 4−2n 2n !
m! m! (n − m)! (n − m)!
m=0
2n
n n m=0 n
m n
n−m = 4−2n 2n
n 2 = ρ1 (2n)2 To see the next to last equality, consider choosing n students from a class with n boys
and n girls and observe that for some 0 ≤ m ≤ n you must choose m boys and n − m
girls. Using the asymptotic formula ρ1 (2n) ∼ (πn)−1/2 , we get ρ2 (2n) ∼ (πn)−1 .
Since
n−1 = ∞, the result follows from Theorem 3.2.2.
1
2
Remark. For a direct proof of ρ2 (2n) = ρ1 (2n)2 , note that if Tn and Tn are independent, one dimensional random walks then Tn jumps from x to x + (1, 1), x + (1, −1), 3.2. RECURRENCE 151 x + (−1, 1), and x + (−1, −1) with equal probability, so rotating Tn by 45 degrees and
√
dividing by 2 gives Sn .
Simple random walk is transient in three dimensions. Intuitively, this holds since the
probability of being back at 0 after 2n steps is ∼ cn−3/2 and this is summable. We
will not compute the probability exactly but will get an upper bound of the right
order of magnitude. Again, since the number of steps in the directions ±ei must be
equal for i = 1, 2, 3
ρ3 (2n) = 6−2n
j,k (2n)!
(j !k !(n − j − k )!)2 = 2−2n 2n
n ≤ 2−2n j,k n!
j !k !(n − j − k )! 2 2n
n!
max 3−n
j,k
j !k !(n − j − k )!
n 3−n where in the last inequality we have used the fact that if aj,k are ≥ 0 and sum to 1
then j,k a2 ≤ maxj,k aj,k . Our last step is to show
j,k
max 3−n
j,k n!
≤ Cn−1
j !k !(n − j − k )! To do this, we note that (a) if any of the numbers j , k or n − j − k is < [n/3] increasing
the smallest number and decreasing the largest number decreases the denominator
(since x(1 − x) is maximized at 1/2), so the maximum occurs when all three numbers
are as close as possible to n/3; (b) Stirlings’ formula implies
n!
nn
·
∼ jk
j !k !(n − j − k )!
j k (n − j − k )n−j −k n
1
·
jk (n − j − k ) 2π Taking j and k within 1 of n/3 the ﬁrst term on the right is ≤ C 3n , and the desired
result follows.
1
2
3
Simple random walk is transient in d > 3. Let Tn = (Sn , Sn , Sn ), N (0) = 0 and
N (n) = inf {m > N (n − 1) : Tm = TN (n−1) }. It is easy to see that TN (n) is a
threedimensional simple random walk. Since TN (n) returns inﬁnitely often to 0 with
probability 0 and the ﬁrst three coordinates are constant in between the N (n), Sn is
transient. Remark. Let πd = P (Sn = 0 for some n ≥ 1) be the probability simple random
walk on Zd returns to 0. The last display in the proof of Theorem 3.2.2 implies
∞ P (S2n = 0) =
n=0 1
1 − πd (3.2.1) ∞ In d = 3, P (S2n = 0) ∼ Cn−3/2 so n=N P (S2n = 0) ∼ C N −1/2 , and the series
converges rather slowly. For example, if we want to compute the return probability
to 5 decimal places, we would need 1010 terms. At the end of the section, we will give
another formula that leads very easily to accurate results.
The rest of this section is devoted to proving the following facts about random
walks: 152 CHAPTER 3. RANDOM WALKS
• Sn is recurrent in d = 1 if Sn /n → 0 in probability.
• Sn is recurrent in d = 2 if Sn /n1/2 ⇒ a nondegenerate normal distribution.
• Sn is transient in d ≥ 3 if it is “truly three dimensional.” To prove the last result we will give a necessary and suﬃcient condition for recurrence.
The ﬁrst step in deriving these results is to generalize Theorem 3.2.2.
∞ Lemma 3.2.4. If n=1 P ( Sn < ) < ∞ then P ( Sn <
∞
If n=1 P ( Sn < ) = ∞ then P ( Sn < 2 i.o.) = 1. i.o.) = 0. Proof. The ﬁrst conclusion follows from the BorelCantelli lemma. To prove the
second, let F = { Sn < i.o.}c . Breaking things down according to the last time
Sn < ,
∞ P ( Sm < , Sn ≥ P (F ) = for all n ≥ m + 1) m=0
∞ ≥ P ( Sm < , Sn − Sm ≥ 2 for all n ≥ m + 1)
m=0
∞ = P ( Sm < )ρ2 ,1 m=0 where ρδ,k = P ( Sn ≥ δ for all n ≥ k ). Since P (F ) ≤ 1, and
∞ P ( Sm < ) = ∞
m=0 it follows that ρ2 ,1 = 0. To extend this conclusion to ρ2
Am = { S m < , S n ≥ ,k with k ≥ 2, let for all n ≥ m + k } Since any ω can be in at most k of the Am , repeating the argument above gives
∞ k≥ ∞ P (Am ) ≥
m=0 P ( Sm < )ρ2 ,k m=0 So ρ2 ,k = P ( Sn ≥ 2 for all j ≥ k ) = 0, and since k is arbitrary, the desired
conclusion follows.
Our second step is to show that the convergence or divergence of the sums in
Lemma 3.2.4 is independent of . The previous proof works for any norm. For the
next one, we need x = supi xi .
Lemma 3.2.5. Let m be an integer ≥ 2.
∞ ∞ P ( Sn < m ) ≤ (2m)d
n=0 P ( Sn < )
n=0 3.2. RECURRENCE 153 Proof. We begin by observing
∞ ∞ P (Sn ∈ k + [0, )d ) P ( Sn < m ) ≤
n=0 n=0 k where the inner sum is over k ∈ {−m, . . . , m − 1}d . If we let
Tk = inf { ≥ 0 : S ∈ k + [0, )d }
then breaking things down according to the value of Tk and using Fubini’s theorem
gives
∞ ∞ n P (Sn ∈ k + [0, )d ) =
n=0 P (Sn k + [0, )d , Tk = )
n=0 =0
∞∞ ≤ P ( Sn − S < , Tk = ) =0 n= Since {Tk = } and { Sn − S < } are independent, the last sum
∞ ∞ m=0 ∞ P ( Sj < ) ≤ P (Tk = m) = j =0 P ( Sj < )
j =0 Since there are (2m)d values of k in {−m, . . . , m − 1}d , the proof is complete.
Combining Lemmas 3.2.4 and 3.2.5 gives:
Theorem 3.2.6. The convergence (resp. divergence) of n P ( Sn < ) for a single
value of > 0 is suﬃcient for transience (resp. recurrence).
In d = 1, if EXi = µ = 0, then the strong law of large numbers implies Sn /n → µ
so Sn  → ∞ and Sn is transient. As a converse, we have
Theorem 3.2.7. ChungFuchs theorem. Suppose d = 1. If the weak law of large
numbers holds in the form Sn /n → 0 in probability, then Sn is recurrent.
Proof. Let un (x) = P (Sn  < x) for x > 0. Lemma 3.2.5 implies
∞ ∞ un (1) ≥
n=0 Am 1
1
un (m) ≥
un (n/A)
2m n=0
2m n=0 for any A < ∞ since un (x) ≥ 0 and is increasing in x. By hypothesis un (n/A) → 1,
so letting m → ∞ and noticing the righthand side is A/2 times the average of the
ﬁrst Am terms
∞ un (1) ≥ A/2
n=0 Since A is arbitrary, the sum must be ∞, and the desired conclusion follows from
Theorem 3.2.6.
Theorem 3.2.8. If Sn is a random walk in R2 and Sn /n1/2 ⇒ a nondegenerate
normal distribution then Sn is recurrent. 154 CHAPTER 3. RANDOM WALKS Remark. The conclusion is also true if the limit is degenerate, but in that case the
random walk is essentially one (or zero) dimensional, and the result follows from the
ChungFuchs theorem.
Proof. Let u(n, m) = P ( Sn < m). Lemma 3.2.5 implies
∞ ∞ u(n, 1) ≥ (4m2 )−1
n=0 u(n, m)
n=0 √
If m/ n → c then
u(n, m) → n (x) dx
[−c,c]2 where n(x) is the density of the limiting normal distribution. If we use ρ(c) to denote
the righthand side and let n = [θm2 ], it follows that u([θm2 ], m) → ρ(θ−1/2 ). If we
write
∞ ∞ m−2 u([θm2 ], m) dθ u(n, m) =
0 n=0 let m → ∞, and use Fatou’s lemma, we get
∞ ∞ u(n, m) ≥ 4−1 lim inf (4m2 )−1
m→∞ ρ(θ−1/2 ) dθ
0 n=0 Since the normal density is positive and continuous at 0
n (x) dx ∼ n (0)(2c)2 ρ(c) =
[−c,c]2 as c → 0. So ρ(θ−1/2 ) ∼ 4n (0)/θ as θ → ∞, the integral diverges, and backtracking
∞
to the ﬁrst inequality in the proof it follows that
n=0 u(n, 1) = ∞, proving the
result.
We come now to the promised necessary and suﬃcient condition for recurrence.
Here φ = E exp(it · Xj ) is the ch.f. of one step of the random walk.
Theorem 3.2.9. Let δ > 0. Sn is recurrent if and only if
Re
(−δ,δ )d 1
dy = ∞
1 − ϕ(y ) We will prove a weaker result:
Theorem 3.2.10. Let δ > 0. Sn is recurrent if and only if
sup
r<1 Re
(−δ,δ )d 1
dy = ∞
1 − rϕ(y ) Remark. Half of the work needed to get the ﬁrst result from the second is trivial.
0 ≤ Re 1
1
→ Re
1 − rϕ(y )
1 − ϕ(y ) as r → 1 so Fatou’s lemma shows that if the integral is inﬁnite, the walk is recurrent. The
other direction is rather diﬃcult: the second result is in Chung and Fuchs (1951), 3.2. RECURRENCE 155 but a proof of the ﬁrst result had to wait for Ornstein (1969) and Stone (1969) to
solve the problem independently. Their proofs use a trick to reduce to the case where
the increments have a density and then a second trick to deal with that case, so we
will not give the details here. The reader can consult either of the sources cited or
Port and Stone (1969), where the result is demonstrated for random walks on Abelian
groups.
Proof. The ﬁrst ingredient in the solution is the
Lemma 3.2.11. Parseval relation. Let µ and ν be probability measures on Rd with
ch.f.’s ϕ and ψ .
ψ (t) µ(dt) = ϕ(x) ν (dx) Proof. Since eit·x is bounded, Fubini’s theorem implies
ψ (t)µ(dt) = eitx ν (dx)µ(dt) = eitx µ(dt)ν (dx) = ϕ(x)ν (dx) Our second ingredient is a little calculus.
Lemma 3.2.12. If x ≤ π/3 then 1 − cos x ≥ x2 /4.
Proof. It suﬃces to prove the result for x > 0. If z ≤ π/3 then cos z ≥ 1/2,
y cos z dz ≥ sin y =
0 y
2 x 1 − cos x = x sin y dy ≥
0 0 y
x2
dy =
2
4 which proves the desired result.
From Example 2.3.5, we see that the density
δ − x
δ2 when x ≤ δ, 0 otherwise has ch.f. 2(1 − cos δt)/(δt)2 . Let µn denote the distribution of Sn . Using Lemma
3.2.12 (note π/3 ≥ 1) and then Lemma 3.2.11, we have
d P ( Sn < 1/δ ) ≤ 4d 1 − cos(δti )
µn (dt)
(δti )2
i=1
d = 2d δ − xi  n
ϕ (x) dx
δ2
(−δ,δ )d i=1 Our next step is to sum from 0 to ∞. To be able to interchange the sum and the
integral, we ﬁrst multiply by rn where r < 1.
∞ d rn P ( Sn < 1/δ ) ≤ 2d
(−δ,δ )d n=0 δ − xi 
1
dx
δ 2 1 − rϕ(x)
i=1 Symmetry dictates that the integral on the right is real, so we can take the real part
without aﬀecting its value. Letting r ↑ 1 and using (δ − x)/δ ≤ 1
∞ P ( Sn < 1/δ ) ≤
n=0 2
δ d sup
r<1 Re
(−δ,δ )d 1
dx
1 − rϕ(x) 156 CHAPTER 3. RANDOM WALKS and using Theorem 3.2.6 gives half of Theorem 3.2.10.
To prove the other direction, we begin by noting that Example 2.3.8 shows that
the density (1 − cos(x/δ ))/πx2 /δ has ch.f. 1 − δt when t ≤ 1/δ , 0 otherwise. Using
d
1 ≥ i=1 (1 − δxi ) and then Lemma 3.2.11,
d P ( Sn < 1/δ ) ≥ (1 − δxi ) µn (dx)
(−1/δ,1/δ )d i=1
d = 1 − cos(ti /δ ) n
ϕ (t) dt
πt2 /δ
i
i=1 Multiplying by rn and summing gives
∞ d 1
1 − cos(ti /δ )
dt
2 /δ
πti
1 − rϕ(t)
i=1 rn P ( Sn < 1/δ ) ≥
n=0 The last integral is real, so its value is unaﬀected if we integrate only the real part of
the integrand. If we do this and apply Lemma 3.2.12, we get
∞ rn P ( Sn < 1/δ ) ≥ (4πδ )−d Re
(−δ,δ )d n=0 1
dt
1 − rϕ(t) Letting r ↑ 1 and using Theorem 3.2.6 now completes the proof of Theorem 3.2.10.
We will now consider some examples. Our goal in d = 1 and d = 2 is to convince
you that the conditions in Theorems 3.2.7 and 3.2.8 are close to the best possible.
d = 1. Consider the symmetric stable laws that have ch.f. ϕ(t) = exp(−tα ). To avoid
using facts that we have not proved, we will obtain our conclusions from Theorem
3.2.10. It is not hard to use that form of the criterion in this case since
1 − rϕ(t) ↓ 1 − exp(−tα )
α α 1 − exp(−t ) ∼ t as r ↑ 1
as t → 0 From this, it follows that the corresponding random walk is transient for α < 1 and
recurrent for α ≥ 1. The case α > 1 is covered by Theorem 3.2.7 since these random
walks have mean 0. The result for α = 1 is new because the Cauchy distribution does
not satisfy Sn /n → 0 in probability. The random walks with α < 1 are interesting
because Theorem 3.1.2 implies (see Exercise 3.1.1)
−∞ = lim inf Sn < lim sup Sn = ∞
but P (Sn  < M i.o.) = 0 for any M < ∞.
Remark. The stable law examples are misleading in one respect. Shepp (1964)
has proved that recurrent random walks may have arbitrarily large tails. To be
precise, given a function (x) ↓ 0 as x ↑ ∞, there is a recurrent random walk with
P (X1  ≥ x) ≥ (x) for large x.
d = 2. Let α < 2, and let ϕ(t) = exp(−tα ) where t = (t2 + t2 )1/2 . ϕ is the
1
2
characteristic function of a random vector (X1 , X2 ) that has two nice properties:
(i) the distribution of (X1 , X2 ) is invariant under rotations,
(ii) X1 and X2 have symmetric stable laws with index α. 3.2. RECURRENCE 157 Again, 1 − rϕ(t) ↓ 1 − exp(−tα ) as r ↑ 1 and 1 − exp(−tα ) ∼ tα as t → 0. Changing
to polar coordinates and noticing
δ dx x x−α < ∞ 2π
0 when 1 − α > −1 shows the random walks with ch.f. exp(−tα ), α < 2 are transient.
When p < α, we have E X1 p < ∞ by Exercise 2.7.5, so these examples show that
Theorem 3.2.8 is reasonably sharp.
δ d ≥ 3. The integral 0 dx xd−1 x−2 < ∞, so if a random walk is recurrent in d ≥ 3, its
ch.f. must → 1 faster than t2 . In Exercise 2.3.19, we observed that (in one dimension)
if ϕ(r) = 1 + o(r2 ) then ϕ(r) ≡ 1. By considering ϕ(rθ) where r is real and θ is a
ﬁxed vector, the last conclusion generalizes easily to Rd , d > 1 and suggests that once
we exclude walks that stay on a plane through 0, no threedimensional random walks
are recurrent.
A random walk in R3 is truly threedimensional if the distribution of X1 has
P (X1 · θ = 0) > 0 for all θ = 0.
Theorem 3.2.13. No truly threedimensional random walk is recurrent.
Proof. We will deduce the result from Theorem 3.2.10. We begin with some arithmetic. If z is complex, the conjugate of 1 − z is 1 − z , so
¯
1
1−z
¯
=
1−z
1 − z 2 and Re 1
Re (1 − z )
=
1−z
1 − z 2 If z = a + bi with a ≤ 1, then using the previous formula and dropping the b2 from
the denominator
1
1−a
1
≤
=
Re
1−z
(1 − a)2 + b2
1−a
Taking z = rφ(t) and supposing for the second inequality that 0 ≤ Re φ(t) ≤ 1, we
have
(a) Re 1
1
1
≤
≤
1 − rϕ(t)
Re (1 − rϕ(t))
Re (1 − ϕ(t)) The last calculation shows that it is enough to estimate
Re (1 − ϕ(t)) = {1 − cos(x · t)}µ(dx) ≥
x·t<π/3 x · t2
µ(dx)
4 by Lemma 3.2.12. Writing t = ρθ where θ ∈ S = {x : x = 1} gives
Re (1 − ϕ(ρθ)) ≥ (b) ρ2
4 x · θ2 µ(dx)
x·θ <π/3ρ Fatou’s lemma implies that if we let ρ → 0 and θ(ρ) → θ, then
(c) x · θ(ρ)2 µ(dx) ≥ lim inf
ρ→0 x · θ2 µ(dx) > 0 x·θ (ρ)<π/3ρ I claim this implies that for ρ < ρ0
(d) x · θ2 µ(dx) = C > 0 inf θ ∈S x·θ <π/3ρ 158 CHAPTER 3. RANDOM WALKS To get the last conclusion, observe that if it is false, then for ρ = 1/n there is a θn so
that
x · θn 2 µ(dx) ≤ 1/n
x·θn <nπ/3 All the θn lie in S , a compact set, so if we pick a convergent subsequence we contradict
(c). Combining (b) and (d) gives
Re (1 − ϕ(ρθ)) ≥ Cρ2 /4
Using the last result and (a) then changing to polar coordinates, we see that if δ is
small (so Re φ(y ) ≥ 0 on (−δ, δ )d ) (−δ,δ )d 1
Re
dy ≤
1 − rφ(y ) √
δd dρ ρd−1 1
Re (1 − φ(ρθ)) dθ 0
1 dρ ρd−3 < ∞ ≤C
0 when d > 2, so the desired result follows from Theorem 3.2.10.
Remark. The analysis becomes much simpler when we consider random walks on
Zd . The inversion formula given in Exercise 2.3.2 implies
P (Sn = 0) = (2π )−d ϕn (t) dt
(−π,π )d Multiplying by rn and summing gives
∞ rn P (Sn = 0) = (2π )−d
(−π,π )d n=0 In the case of simple random walk in d = 3, φ(t) =
1
1
↑
1 − rφ(t) 1 − φ(t)
1
≤1
0≤
1 − rφ(t) 1
3 1
dt
1 − rϕ(t)
3
j =1 cos tj is real. when φ(t) > 0
when φ(t) ≤ 0 So, using the monotone and bounded convergence theorems
∞
−3 P (Sn = 0) = (2π )
n=0 (−π,π )3 1
1−
3 −1 3 cos xi dx i=1 This integral was ﬁrst evaluated by Watson in 1939 in terms of elliptic integrals, which
could be found in tables. Glasser and Zucker (1977) showed that it was
√
( 6/32π 3 )Γ(1/24)Γ(5/24)Γ(7/24)Γ(11/24) = 1.516386059137 . . .
so it follows from (3.2.1) that
π3 = 0.340537329544...
For numerical results in 4 ≤ d ≤ 9, see Kondo and Hara (1987). 3.3. VISITS TO 0, ARCSINE LAWS* 3.3 159 Visits to 0, Arcsine Laws* In the last section, we took a broad look at the recurrence of random walks. In this
section, we will take a deep look at one example: simple random walk (on Z). To steal
a line from Chung, “We shall treat this by combinatorial methods as an antidote to
the analytic skulduggery above.” The developments here follow Chapter III of Feller,
vol. I. To facilitate discussion, we will think of the sequence S1 , S2 , . . . , Sn as being
represented by a polygonal line with segments (k − 1, Sk−1 ) → (k, Sk ). A path is
a polygonal line that is a possible outcome of simple random walk. To count the
number of paths from (0,0) to (n, x), it is convenient to introduce a and b deﬁned by:
a = (n + x)/2 is the number of positive steps in the path and b = (n − x)/2 is the
number of negative steps. Notice that n = a + b and x = a − b. If −n ≤ x ≤ n and
n − x is even, the a and b deﬁned above are nonnegative integers, and the number of
paths from (0,0) to (n, x) is
(∗) Nn,x = n
a Otherwise, the number of paths is 0.
(n, y ) d d
(0, x) d d
d d
d K (0, −x) d d d d
d d Figure 3.1: Reﬂection Principle
Theorem 3.3.1. Reﬂection principle. If x, y > 0 then the number of paths from
(0, x) to (n, y ) that are 0 at some time is equal to the number of paths from (0, −x)
to (n, y ).
Proof. Suppose (0, s0 ), (1, s1 ), . . . , (n, sn ) is a path from (0, x) to (n, y ). Let K =
inf {k : sk = 0}. Let sk = −sk for k ≤ K , sk = sk for K ≤ k ≤ n. Then (k, sk ),
0 ≤ k ≤ n, is a path from (0, −x) to (n, y ). Conversely, if (0, t0 ), (1, t1 ), . . . , (n, tn )
is a path from (0, −x) to (n, y ) then it must cross 0. Let K = inf {k : tk = 0}. Let
tk = −tk for k ≤ K , tk = tk for K ≤ k ≤ n. Then (k, tk ), 0 ≤ k ≤ n, is a path from
(0, −x) to (n, y ) that is 0 at time K . The last two observations set up a onetoone
correspondence between the two classes of paths, so their numbers must be equal.
From Theorem 3.3.1 we get a result ﬁrst proved in 1878.
Theorem 3.3.2. The Ballot Theorem. Suppose that in an election candidate
A gets α votes, and candidate B gets β votes where β < α. The probability that
throughout the counting A always leads B is (α − β )/(α + β ).
Proof. Let x = α − β , n = α + β . Clearly, there are as many such outcomes as there
are paths from (1,1) to (n, x) that are never 0. The reﬂection principle implies that
the number of paths from (1,1) to (n, x) that are 0 at some time the number of paths 160 CHAPTER 3. RANDOM WALKS from (1,1) to (n, x), so by (∗) the number of paths from (1,1) to (n, x) that are never
0 is
n−1
n−1
−
α−1
α
(n − 1)!
(n − 1)!
−
=
(α − 1)!(n − α)! α!(n − α − 1)!
α − (n − α)
n!
α−β
=
·
=
Nn,x
n
α!(n − α)!
α+β Nn−1,x−1 − Nn−1,x+1 = since n = α + β , this proves the desired result.
Using the ballot theorem, we can compute the distribution of the time to hit 0 for
simple random walk.
Lemma 3.3.3. P (S1 = 0, . . . , S2n = 0) = P (S2n = 0).
∞ Proof. P (S1 > 0, . . . , S2n > 0) = r=1 P (S1 > 0, . . . , S2n−1 > 0, S2n = 2r). From
the proof of Theorem 3.3.2, we see that the number of paths from (0,0) to (2n, 2r)
that are never 0 at positive times (= the number of paths from (1,1) to (2n, 2r) that
are never 0) is
N2n−1,2r−1 − N2n−1,2r+1
If we let pn,x = P (Sn = x) then this implies
P (S1 > 0, . . . , S2n−1 > 0, S2n = 2r) = 1
(p2n−1,2r−1 − p2n−1,2r+1 )
2 Summing from r = 1 to ∞ gives
P (S1 > 0, . . . , S2n > 0) = 1
1
p2n−1,1 = P (S2n = 0)
2
2 Symmetry implies P (S1 < 0, . . . , S2n < 0) = (1/2)P (S2n = 0), and the proof is
complete.
Let R = inf {m ≥ 1 : Sm = 0}. Combining Lemma 3.3.1 with Theorem 2.1.2 gives
P (R > 2n) = P (S2n = 0) ∼ π −1/2 n−1/2 (3.3.1) Since P (R > x)/ P (R > x) = 1, it follows from Theorem 2.7.4 that R is in the
domain of attraction of the stable law with α = 1/2 and κ = 1. This implies that if
Rn is the time of the nth return to 0 then Rn /n2 ⇒ Y , the indicated stable law. In
Example 2.7.2, we considered τ = T1 where Tx = inf {n : Sn = x}. Since S1 ∈ {−1, 1}
and T1 =d T−1 , R =d 1 + T1 , and it follows that Tn /n2 ⇒ Y , the same stable law.
In Example 7.6.6, we will use this observation to show that the limit has the same
distribution as the hitting time of 1 for Brownian motion, which has a density given
in (7.4.8).
This completes our discussion of visits to 0. We turn now to the arcsine laws. The
ﬁrst one concerns
L2n = sup{m ≤ 2n : Sm = 0}
It is remarkably easy to compute the distribution of L2n .
Lemma 3.3.4. Let u2m = P (S2m = 0). Then P (L2n = 2k ) = u2k u2n−2k . 3.3. VISITS TO 0, ARCSINE LAWS* 161 Proof. P (L2n = 2k ) = P (S2k = 0, S2k+1 = 0, . . . , S2n = 0), so the desired result
follows from Lemma 3.3.3.
Theorem 3.3.5. Arcsine law for the last visit to 0. For 0 < a < b < 1,
b π −1 (x(1 − x))−1/2 dx P (a ≤ L2n /2n ≤ b) →
a To see the reason for the name, substitute y = x1/2 , dy = (1/2)x−1/2 dx in the integral
to obtain
√
b
√
√
2
2
(1 − y 2 )−1/2 dy = {arcsin( b) − arcsin( a)}
√
π
π
a
Since L2n is the time of the last zero before 2n, it is surprising that the answer is
symmetric about 1/2. The symmetry of the limit distribution implies
P (L2n /2n ≤ 1/2) → 1/2
In gambling terms, if two people were to bet $1 on a coin ﬂip every day of the year,
then with probability 1/2, one of the players will be ahead from July 1 to the end of
the year, an event that would undoubtedly cause the other player to complain about
his bad luck.
Proof of Theorem 3.3.5. From the asymptotic formula for u2n , it follows that if
k/n → x then
nP (L2n = 2k ) → π −1 (x(1 − x))−1/2
To get from this to the desired result, we let 2nan = the smallest even integer ≥ 2na,
let 2nbn = the largest even integer ≤ 2nb, and let fn (x) = nP (L2n = k ) for 2k/2n ≤
x < 2(k + 1)/2n so we can write
nbn P (a ≤ L2n /2n ≤ b) = bn +1/n P (L2n = 2k ) = fn (x) dx
an k=nan Our ﬁrst result implies that uniformly on compact sets
fn (x) → f (x) = π −1 (x(1 − x))−1/2
The uniformity of the convergence implies
fn (x) → sup f (x) < ∞ sup a≤x≤b an ≤x≤bn +1/n if 0 < a ≤ b < 1, so the bounded convergence theorem gives
bn +1/n b fn (x) dx →
an f (x) dx
a The next result deals directly with the amount of time one player is ahead.
Theorem 3.3.6. Arcsine law for time above 0. Let π2n be the number of segments
(k − 1, Sk−1 ) → (k, Sk ) that lie above the axis (i.e., in {(x, y ) : y ≥ 0}), and let
um = P (Sm = 0).
P (π2n = 2k ) = u2k u2n−2k
and consequently, if 0 < a < b < 1
b π −1 (x(1 − x))−1/2 dx P (a ≤ π2n /2n ≤ b) →
a 162 CHAPTER 3. RANDOM WALKS Remark. Since π2n =d L2n , the second conclusion follows from the proof of Theorem
3.3.5. The reader should note that the limiting density π −1 (x(1 − x))−1/2 has a
minimum at x = 1/2, and → ∞ as x → 0 or 1. An equal division of steps between
the positive and negative side is therefore the least likely possibility, and completely
onesided divisions have the highest probability.
Proof. Let β2k,2n denote the probability of interest. We will prove β2k,2n = u2k u2n−2k
by induction. When n = 1, it is clear that
β0,2 = β2,2 = 1/2 = u0 u2
For a general n, ﬁrst suppose k = n. From the proof of Lemma 3.3.3, we have
1
u2n = P (S1 > 0, . . . , S2n > 0)
2
= P (S1 = 1, S2 − S1 ≥ 0, . . . , S2n − S1 ≥ 0)
1
= P (S1 ≥ 0, . . . , S2n−1 ≥ 0)
2
1
1
= P (S1 ≥ 0, . . . , S2n ≥ 0) = β2n,2n
2
2
The next to last equality follows from the observation that if S2n−1 ≥ 0 then S2n−1 ≥
1, and hence S2n ≥ 0.
The last computation proves the result for k = n. Since β0,2n = β2n,2n , the result
is also true when k = 0. Suppose now that 1 ≤ k ≤ n − 1. In this case, if R is the time
of the ﬁrst return to 0, then R = 2m with 0 < m < n. Letting f2m = P (R = 2m)
and breaking things up according to whether the ﬁrst excursion was on the positive
or negative side gives
n−k k β2k,2n = 1
1
f2m β2k−2m,2n−2m +
f2m β2k,2n−2m
2 m=1
2 m=1 Using the induction hypothesis, it follows that
n−k k β2k,2n 1
1
= u2n−2k
f2m u2k−2m + u2k
f2m u2n−2k−2m
2
2
m=1
m=1 By considering the time of the ﬁrst return to 0, we see
n−k k u 2k = f2m u2k−2m
m=1 u2n−2k = f2m u2n−2k−2m
m=1 and the desired result follows.
Our derivation of Theorem 3.3.6 relied heavily on special properties of simple
random walk. There is a closely related result due to E. SparreAndersen that is valid
for very general random walks. However, notice that the hypothesis (ii) in the next
result excludes simple random walk.
Theorem 3.3.7. Let νn = {k : 1 ≤ k ≤ n, Sk > 0}. Then
(i) P (νn = k ) = P (νk = k )P (νn−k = 0) 3.3. VISITS TO 0, ARCSINE LAWS* 163 (ii) If the distribution of X1 is symmetric and P (Sm = 0) = 0 for all m ≥ 1, then
P (νn = k ) = u2k u2n−2k
where u2m = 2−2m 2m
m is the probability simple random walk is 0 at time 2m. (iii) Under the hypotheses of (ii),
b π 1 (x(1 − x))−1/2 dx P (a ≤ νn /n ≤ b) → for 0 < a < b < 1 a Proof. Taking things in reverse order, (iii) is an immediate consequence of (ii) and the
proof of Theorem 3.3.5. Our next step is to show (ii) follows from (i) by induction.
When n = 1, our assumptions imply P (ν1 = 0) = 1/2 = u0 u2 . If n > 1 and 1 ≤ k < n,
then (i) and the induction hypothesis imply
P (νn = k ) = u2k u0 · u0 u2n−2k = u2k u2n−2k
since u0 = 1. To handle the cases k = 0 and k = n, we note that Lemma 3.3.4 implies
n u2k u2n−2k = 1
k=0
n We have k=0 P (νn = k ) = 1 and our assumptions imply P (νn = 0) = P (νn = n),
so these probabilities must be equal to u0 u2n .
The proof of (i) is tricky and requires careful deﬁnitions since we are not supposing
X1 is symmetric or that P (Sm = 0) = 0. Let νn = {k : 1 ≤ k ≤ n, Sk ≤ 0} = n − νn .
Mn = max Sj n = min{j : 0 ≤ j ≤ n, Sj = Mn } Mn = min Sj n = max{j : 0 ≤ j ≤ n, Sj = Mn } 0≤j ≤n
0≤j ≤n The ﬁrst symmetry is straightforward.
Lemma 3.3.8. ( n , Sn ) and (n − n , Sn ) have the same distribution. Proof. If we let Tk = Sn − Sn−k = Xn + · · · + Xn−k+1 , then Tk 0 ≤ k ≤ n has the
same distribution as Sk , 0 ≤ k ≤ n. Clearly,
max Tk = Sn − min Sn−k 0≤k≤n 0≤k≤n and the set of k for which the extrema are attained are the same.
The second symmetry is much less obvious.
Lemma 3.3.9. ( n , Sn ) and (νn , Sn ) have the same distribution.
( n , Sn ) and (νn , Sn ) have the same distribution.
Remark. (i) follows from Lemma 3.3.8 and the trivial observation
P( n = k) = P ( k = k )P ( n−k = 0) so once Lemma 3.3.9 is established, the proof of Theorem 3.3.7 will be complete. 164 CHAPTER 3. RANDOM WALKS Proof. When n = 1, { 1 = 0} = {S1 ≤ 0} = {ν1 = 0}, and { 1 = 0} = {S1 >
0} = {ν1 = 0}. We shall prove the general case by induction, supposing that both
statements have been proved when n is replaced by n − 1. Let
G(y ) = P ( n−1 = k, Sn−1 ≤ y ) H (y ) = P (νn−1 = k, Sn−1 ≤ y )
On {Sn ≤ 0}, we have
x≤0
P( n n−1 = and νn−1 = νn so if F (y ) = P (X1 ≤ y ) then for n, = k, Sn ≤ x) = F (x − y ) dG(y )
F (x − y ) dH (y ) = P (νn = k, Sn ≤ x) =
On {Sn > 0}, we have
shows that for x ≥ 0
P( n−1 n = n, (3.3.2) and νn−1 = νn , so repeating the last computation = n − k, Sn > x) = P (νn = n − k, Sn > x) Since ( n , Sn ) has the same distribution as (n − n , Sn ) and νn = n − νn , it follows
that for x ≥ 0
P ( n = k, Sn > x) = P (νn = k, Sn > x)
Setting x = 0 in the last result and (3.3.2) and adding gives
P( n = k ) = P (νn = k ) Subtracting the last two equations and combining the result with (3.3.2) gives
P( n = k, Sn ≤ x) = P (νn = k, Sn ≤ x) for all x. Since ( n , Sn ) has the same distribution as (n − n , Sn ) and νn = n − νn , it
follows that
P ( n = n − k , Sn > x) = P (νn = n − k , Sn > x)
for all x. This completes the proof of Lemma 3.3.9 and hence of Theorem 3.3.7. 3.4. RENEWAL THEORY* 3.4 165 Renewal Theory* Let ξ1 , ξ2 , . . . be i.i.d. positive random variables with distribution F and deﬁne a
sequence of times by T0 = 0, and Tk = Tk−1 + ξk for k ≥ 1. As explained in Section
1.7, we think of ξi as the lifetime of the ith light bulb, and Tk is the time the k th
bulb burns out. A second interpretation from Section 2.6 is that Tk is the time of
arrival of the k th customer. To have a neutral terminology, we will refer to the Tk as
renewals. The term renewal refers to the fact that the process “starts afresh” at Tk ,
i.e., {Tk+j − Tk , j ≥ 1} has the same distribution as {Tj , j ≥ 1}. •
0 T1
• T2
• • TN (t)
• •
t Figure 3.2: Renewal sequence.
Departing slightly from the notation in Sections 1.7 and 2.6, we let Nt = inf {k :
Tk > t}. Nt is the number of renewals in [0, t], counting the renewal at time 0. In
Theorem 1.7.6, we showed that
Theorem 3.4.1. As t → ∞, Nt /t → 1/µ a.s. where µ = Eξi ∈ (0, ∞] and 1/∞ = 0.
Our ﬁrst result concerns the asymptotic behavior of U (t) = ENt .
Theorem 3.4.2. As t → ∞, U (t)/t → 1/µ.
Proof. We will apply Wald’s equation to the stopping time Nt . The ﬁrst step is
to show that P (ξi > 0) > 0 implies ENt < ∞. To do this, pick δ > 0 so that
P (ξi > δ ) = > 0 and pick K so that Kδ ≥ t. Since K consecutive ξi s that are > δ
will make Tn > t, we have
P (Nt > mK ) ≤ (1 − Km ) and ENt < ∞. If µ < ∞, applying Wald’s equation now gives
µENt = ETNt ≥ t
so U (t) ≥ t/µ. The last inequality is trivial when µ = ∞ so it holds in general.
Turning to the upper bound, we observe that if P (ξi ≤ c) = 1, then repeating
the last argument shows µENt = ESNt ≤ t + c, and the result holds for bounded
¯
¯
¯
distributions. If we let ξi = ξi ∧ c and deﬁne Tn and Nt in the obvious way then
¯
¯
ENt ≤ E Nt ≤ (t + c)/E (ξi )
Letting t → ∞ and then c → ∞ gives lim supt→∞ ENt /t ≤ 1/µ, and the proof is
complete.
Exercise 3.4.1. Show that t/E (ξi ∧ t) ≤ U (t) ≤ 2t/E (ξi ∧ t).
Exercise 3.4.2. Deduce Theorem 3.4.2 from Theorem 3.4.1 by showing
lim sup E (Nt /t)2 < ∞.
t→∞ Hint: Use a comparison like the one in the proof of Theorem 3.4.2. 166 CHAPTER 3. RANDOM WALKS Exercise 3.4.3. Customers arrive at times of a Poisson process with rate 1. If the
server is occupied, they leave. (Think of a public telephone or prostitute.) If not,
they enter service and require a service time with a distribution F that has mean µ.
Show that the times at which customers enter service are a renewal process with mean
µ + 1, and use Theorem 3.4.1 to conclude that the asymptotic fraction of customers
served is 1/(µ + 1).
To take a closer look at when the renewals occur, we let
∞ P (Tn ∈ A) U (A) =
n=0 U is called the renewal measure. We absorb the old deﬁnition, U (t) = ENt , into
the new one by regarding U (t) as shorthand for U ([0, t]). This should not cause problems since U (t) is the distribution function for the renewal measure. The asymptotic
behavior of U (t) depends upon whether the distribution F is arithmetic, i.e., concentrated on {δ, 2δ, 3δ, . . .} for some δ > 0, or nonarithmetic, i.e., not arithmetic.
We will treat the ﬁrst case in Chapter 5 as an application of Markov chains, so we
will restrict our attention to the second case here.
Theorem 3.4.3. Blackwell’s renewal theorem. If F is nonarithmetic then
U ([t, t + h]) → h/µ
as t → ∞.
We will prove the result in the case µ < ∞ by “coupling” following Lindvall (1977)
and Athreya, McDonald, and Ney (1978). To set the stage for the proof, we need a
deﬁnition and some preliminary computations. If T0 ≥ 0 is independent of ξ1 , ξ2 , . . .
and has distribution G, then Tk = Tk−1 + ξk , k ≥ 1 deﬁnes a delayed renewal
process, and G is the delay distribution. If we let Nt = inf {k : Tk > t} as before
and set V (t) = ENt , then breaking things down according to the value of T0 gives
t U (t − s) dG(s) V (t) = (3.4.1) 0 The last integral, and all similar expressions below, is intended to include the contribution of any mass G has at 0. If we let U (r) = 0 for r < 0, then the last equation
can be written as V = U ∗ G, where ∗ denotes convolution.
Applying similar reasoning to U gives
t U (t − s) dF (s) U (t) = 1 + (3.4.2) 0 or, introducing convolution notation,
U = 1[0,∞) (t) + U ∗ F.
Convolving each side with G (and recalling G ∗ U = U ∗ G) gives
V =G∗U =G+V ∗F (3.4.3) We know U (t) ∼ t/µ. Our next step is to ﬁnd a G so that V (t) = t/µ. Plugging what
we want into (3.4.3) gives
t t−y
dF (y )
µ
0
t
t−y
G(t) = t/µ −
dF (y )
µ
0
t/µ = G(t) + so 3.4. RENEWAL THEORY* 167 The integrationbyparts formula is
t t K (y ) dH (y ) = H (t)K (t) − H (0)K (0) −
0 H (y ) dK (y )
0 If we let H (y ) = (y − t)/µ and K (y ) = 1 − F (y ), then
1
µ t t
−
µ 1 − F (y ) dy =
0 so we have
G(t) = 1
µ t
0 t−y
dF (y )
µ t 1 − F (y ) dy (3.4.4) 0 It is comforting to note that µ = [0,∞) 1 − F (y ) dy , so the last formula denes a
probability distribution. When the delay distribution G is the one given in (3.4.4),
we call the result the stationary renewal process. Something very special happens
when F (t) = 1 − exp(−λt), t ≥ 0 where λ > 0 (i.e., the renewal process is a rate λ
Poisson process). In this case, µ = 1/λ so G(t) = F (t).
Proof of Theorem 3.4.3 for µ < ∞. Let Tn be a renewal process (with T0 = 0) and
Tn be an independent stationary renewal process. Our ﬁrst goal is to ﬁnd J and K
so that TJ − TK  < and the increments {TJ +i − TJ , i ≥ 1} and {TK +i − TK , i ≥ 1}
are i.i.d. sequences independent of what has come before.
Let η1 , η2 , . . . and η1 , η2 , . . . be i.i.d. independent of Tn and Tn and take the values
0 and 1 with probability 1/2 each. Let νn = η1 + · · · + ηn and νn = 1 + η1 + · · · + ηn ,
Sn = Tνn and Sn = Tνn . The increments of Sn − Sn are 0 with probability at least
1/4, and the support of their distribution is symmetric and contains the support of
the ξk so if the distribution of the ξk is nonarithmetic the random walk Sn − Sn is
irreducible. Since the increments of Sn − Sn have mean 0, N = inf {n : Sn − Sn  < }
has P (N < ∞) = 1, and we can let J = νN and K = νN . Let
Tn = Tn
TJ + TK +(n−J ) − TK if J ≥ n
if J < n In other words, the increments TJ +i − TJ are the same as TK +i − TK for i ≥ 1. T0
• T1
•
•
T1 T2
•
•
T2 •
• TJ
•
•
TK TJ +1
•
•
TK +1 TJ +2
•
•
•
•
TK +2 Figure 3.3: Coupling of renewal processes.
It is easy to see from the construction that Tn and Tn have the same distribution.
If we let
N [s, t] = {n : Tn ∈ [s, t]} and N [s, t] = {n : Tn ∈ [s, t]}
be the number of renewals in [s, t] in the two processes, then on {TJ ≤ t}
N [t, t + h] = N [t + TK − TJ , t + h + TK − TJ ] ≥ N [t + , t + h − ]
≤ N [t − , t + h + ] 168 CHAPTER 3. RANDOM WALKS To relate the expected number of renewals in the two processes, we observe that
even if we condition on the location of all the renewals in [0, s], the expected number
of renewals in [s, s + t] is at most U (t), since the worst thing that could happen is to
have a renewal at time s. Combining the last two observations, we see that if < h/2
(so [t + , t + h − ] has positive length)
U ([t, t + h]) = EN [t, t + h] ≥ E (N [t + , t + h − ]; TJ ≤ t)
h−2
≥
− P (TJ > t)U (h)
µ
since EN [t + , t + h − ] = (h − 2 )/µ and {TJ > t} is determined by the renewals
of T in [0, t] and the renewals of T in [0, t + ]. For the other direction, we observe
U ([t, t + h]) ≤ E (N [t − , t + h + ]; TJ ≤ t) + E (N [t, t + h]; TJ > t)
h+2
≤
+ P (TJ > t)U (h)
µ
The desired result now follows from the fact that P (TJ > t) → 0 and
arbitrary. < h/2 is Proof of Theorem 3.4.3 for µ = ∞. In this case, there is no stationary renewal process,
so we have to resort to other methods. Let
β = lim sup U (t, t + 1] = lim U (tk , tk + 1]
k→∞ t→∞ for some sequence tk → ∞. We want to prove that β = 0, for then by addition the
previous conclusion holds with 1 replaced by any integer n and, by monotonicity, with
n replaced by any h < n, and this gives us the result in Theorem 3.4.3. Fix i and let
U (tk − y, tk + 1 − y ] dF i∗ (y ) ak,j =
(j −1,j ] By considering the location of Ti we get
∞ (a) lim k→∞ ak,j = lim k→∞ j =1 U (tk − y, tk + 1 − y ] dF i∗ (y ) = β Since β is the lim sup, we must have
(b) lim sup ak,j ≤ β · P (Ti ∈ (j − 1, j ])
k→∞ We want to conclude from (a) and (b) that
(c) lim inf ak,j ≥ β · P (Ti ∈ (j − 1, j ])
k→∞ To do this, we observe that by considering the location of the ﬁrst renewal in (j − 1, j ]
(d) 0 ≤ ak,j ≤ U (1)P (Ti ∈ (j − 1, j ]) (c) is trivial when β = 0 so we can suppose β > 0. To argue by contradiction,
suppose there exist j0 and > 0 so that
lim inf ak,j0 ≤ β · {P (Ti ∈ (j0 − 1, j0 ]) − }
k→∞ 3.4. RENEWAL THEORY* 169 Pick kn → ∞ so that
akn ,j0 → β · {P (Ti ∈ (j0 − 1, j0 ]) − }
Using (d), we can pick J ≥ j0 so that
∞
n→∞ ∞ akn ,j ≤ U (1) lim sup
j =J +1 P (Ti ∈ (j − 1, j ]) ≤ β /2
j =J +1 Now an easy argument shows
J akn ,j ≤ lim sup
n→∞ J j =1 J lim sup akn ,j ≤ β j =1 n→∞ P (Ti ∈ (j − 1, j ]) − j =1 by (b) and our assumption. Adding the last two results shows
∞ akn ,j ≤ β (1 − /2) lim sup
n→∞ j =1 which contradicts (a), and proves (c).
Now, if j − 1 < y ≤ j , we have
U (tk − y, tk + 1 − y ] ≤ U (tk − j, tk + 2 − j ]
so using (c) it follows that for j with P (Ti ∈ (j − 1, j ]) > 0, we must have
lim inf U (tk − j, tk + 2 − j ] ≥ β
k→∞ Summing over i, we see that the last conclusion is true when U (j − 1, j ] > 0.
The support of U is closed under addition. (If x is in the support of F m∗ and y is
in the support of F n∗ then x + y is in the support of F (m+n)∗ .) We have assumed F
is nonarithmetic, so U (j − 1, j ] > 0 for j ≥ j0 . Letting rk = tk − j0 and considering
the location of the last renewal in [0, rk ] and the index of the Ti gives
∞ rk 1=
i=0
∞ ≥ rk (1 − F (rk − y )) dF i∗ (y ) = 0 (1 − F (rk − y )) dU (y )
0 (1 − F (2n)) U (rk − 2n, rk + 2 − 2n]
n=1 Since lim inf k→∞ U (rk − 2n, rk + 2 − 2n] ≥ β and
∞ (1 − F (2n)) ≥ µ/2 = ∞
n=0 β must be 0, and the proof is complete.
Remark. Following Lindvall (1977), we have based the proof for µ = ∞ on part of
Feller’s (1961) proof of the discrete renewal theorem (i.e., for arithmetic distributions).
See Freedman (1971b) p. 22–25 for an account of Feller’s proof. Purists can ﬁnd a
proof that does everything by coupling in Thorisson (1987).
Our next topic is the renewal equation: H = h + H ∗ F . Two cases we have
seen in (3.4.2) and (3.4.3) are: 170 CHAPTER 3. RANDOM WALKS Example 3.4.1. h ≡ 1: U (t) = 1 + t
0 U (t − s) dF (s) Example 3.4.2. h(t) = G(t): V (t) = G(t) + t
0 V (t − s) dF (s) The last equation is valid for an arbitrary delay distribution. If we let G be the
distribution in (3.4.4) and subtract the last two equations, we get
Example 3.4.3. H (t) = U (t) − t/µ satisﬁes the renewal equation with h(t) =
1∞
µ t 1 − F (s) ds.
Last but not least, we have an example that is a typical application of the renewal
equation.
Example 3.4.4. Let x > 0 be ﬁxed, and let H (t) = P (TN (t) − t > x). By considering
the value of T1 , we get
t H (t) = (1 − F (t + x)) + H (t − s) dF (s)
0 The examples above should provide motivation for:
Theorem 3.4.4. If h is bounded then the function
t h(t − s) dU (s) H (t) =
0 is the unique solution of the renewal equation that is bounded on bounded intervals.
Proof. Let Un (A) = n
m=0 P (Tm ∈ A) and
n t (h ∗ F m∗ ) (t) h(t − s) dUn (s) = Hn (t) =
0 m=0 Here, F m∗ is the distribution of Tm , and we have extended the deﬁnition of h by
setting h(r) = 0 for r < 0. From the last expression, it should be clear that
Hn+1 = h + Hn ∗ F
The fact that U (t) < ∞ implies U (t) − Un (t) → 0. Since h is bounded,
Hn (t) − H (t) ≤ h ∞ U (t) − Un (t) and Hn (t) → H (t) uniformly on bounded intervals. To estimate the convolution, we
note that
Hn ∗ F (t) − H ∗ F (t) ≤ sup Hn (s) − H (s)
s≤t ≤h
∞ ∞ U (t) − Un (t) since U − Un = m=n+1 F m∗ is increasing in t. Letting n → ∞ in Hn+1 = h + Hn ∗ F ,
we see that H is a solution of the renewal equation that is bounded on bounded
intervals.
To prove uniqueness, we observe that if H1 and H2 are two solutions, then K =
H1 − H2 satisﬁes K = K ∗ F . If K is bounded on bounded intervals, iterating gives
K = K ∗ F n∗ → 0 as n → ∞, so H1 = H2 . 3.4. RENEWAL THEORY* 171 The proof of Theorem 3.4.4 is valid when F (∞) = P (ξi < ∞) < 1. In this case,
we have a terminating renewal process. After a geometric number of trials with
mean 1/(1 − F (∞)), Tn = ∞. This “trivial case” has some interesting applications.
Example 3.4.5. Pedestrian delay. A chicken wants to cross a road (we won’t ask
why) on which the traﬃc is a Poisson process with rate λ. She needs one unit of time
with no arrival to safely cross the road. Let M = inf {t ≥ 0 : there are no arrivals in
(t, t + 1]} be the waiting time until she starts to cross the street. By considering the
time of the ﬁrst arrival, we see that H (t) = P (M ≤ t) satisﬁes
1 H (t) = e−λ + H (t − y ) λe−λy dy
0 Comparing with Example 3.4.1 and using Theorem 3.4.4, we see that
∞ H (t) = e−λ F n∗ (t)
n=0 We could have gotten this answer without renewal theory by noting
∞ P (M ≤ t) = P (Tn ≤ t, Tn+1 = ∞)
n=0 The last representation allows us to compute the mean of M . Let µ be the mean of
the interarrival time given that it is < 1, and note that the lack of memory property
of the exponential distribution implies
∞ 1 xλe−λx dx = µ= ∞ − 0 0 =
1 1
1
− 1+
λ
λ e−λ Then, by considering the number of renewals in our terminating renewal process,
∞ e−λ (1 − e−λ )n nµ = (eλ − 1)µ EM =
n=0 since if X is a geometric with success probability e−λ then EM = µE (X − 1).
Example 3.4.6. Cram´r’s estimates of ruin. Consider an insurance company
e
that collects money at rate c and experiences i.i.d. claims at the arrival times of a
Poisson process Nt with rate 1. If its initial capital is x, its wealth at time t is
Nt Wx (t) = x + ct − Yi
m=1 Here Y1 , Y2 , . . . are i.i.d. with distribution G and mean µ. Let
R(x) = P (Wx (t) ≥ 0 for all t)
be the probability of never going bankrupt starting with capital x. By considering
the time and size of the ﬁrst claim:
∞ (a) x+cs e−s R(x) =
0 R(x + cs − y ) dG(y ) ds
0 172 CHAPTER 3. RANDOM WALKS This does not look much like a renewal equation, but with some ingenuity it can be
transformed into one. Changing variables t = x + cs
∞ t R(x)e−x/c = e−t/c R(t − y ) dG(y ) x 0 dt
c Diﬀerentiating w.r.t. x and then multiplying by ex/c ,
R (x) = x 1
R(x) −
c R(x − y ) dG(y ) ·
0 1
c Integrating x from 0 to w
R(w) − R(0) = (b) 1
c w 1
c R(x) dx −
0 w x R(x − y ) dG(y ) dx
0 0 Interchanging the order of integration in the double integral, letting
w S (w) = R(x) dx
0 using dG = −d(1 − G), and then integrating by parts
− 1
c w w R(x − y ) dx dG(y ) = −
0 y 1
c
1
=
c 1
c w S (w − y ) dG(y )
0 w S (w − y ) d(1 − G)(y ) = 0
w (1 − G(y ))R(w − y ) dy −S (w) +
0 Plugging this into (b), we ﬁnally have a renewal equation:
w (c) R(w − y ) R(w) = R(0) +
0 1 − G(y )
dy
c It took some cleverness to arrive at the last equation, but it is straightforward to
analyze. First, we dismiss a trivial case. If µ > c,
1
t Nt ct − →c−µ<0 Yi a.s. m=1 so R(x) ≡ 0. When µ < c,
x F (x) =
0 1 − G(y )
dy
c is a defective probability distribution with F (∞) = µ/c. Our renewal equation can
be written as
(d) R = R(0) + R ∗ F so comparing with Example 3.4.1 and using Theorem 3.4.4 tells us R(w) = R(0)U (w).
To complete the solution, we have to compute the constant R(0). Letting w → ∞ and
noticing R(w) → 1, U (w) → (1 − F (∞))−1 = (1 − µ/c)−1 , we have R(0) = 1 − µ/c. 3.4. RENEWAL THEORY* 173 The basic fact about solutions of the renewal equation (in the nonterminating
case) is:
Theorem 3.4.5. The renewal theorem. If F is nonarithmetic and h is directly
Riemann integrable then as t → ∞
H (t) → 1
µ ∞ h(s) ds
0 Intuitively, this holds since Theorem 3.4.4 implies
t h(t − s) dU (s) H (t) =
0 and Theorem 3.4.3 implies dU (s) → ds/µ as s → ∞. We will deﬁne directly Riemann
integrable in a minute. We will start doing the proof and then ﬁgure out what we
need to assume.
Proof. Suppose ∞ h(s) = ak 1[kδ,(k+1)δ) (s)
k=0 ∞ where
k=0 ak  < ∞. Since U ([t, t + δ ]) ≤ U ([0, δ ]) < ∞, it follows easily from
Theorem 3.4.3 that
∞ t h(t − s)dU (s) =
0 (Pick K so that k=0
k≥K 1
ak U ((t − (k + 1)δ, t − kδ ]) →
µ ∞ ak δ
k=0 ak  ≤ /2U ([0, δ ]) and then T so that ak  · U ((t − (k + 1)δ, t − kδ ]) − δ/µ ≤ 2K for t ≥ T and 0 ≤ k < K.) If h is an arbitrary function on [0, ∞), we let
∞ Iδ = δ sup{h(x) : x ∈ [kδ, (k + 1)δ )}
k=0
∞ δ inf {h(x) : x ∈ [kδ, (k + 1)δ )} Iδ =
k=0 be upper and lower Riemann sums approximating the integral of h over [0, ∞). Comparing h with the obvious upper and lower bounds that are constant on [kδ, (k + 1)δ )
and using the result for the special case,
Iδ
≤ lim inf
t→∞
µ t t h(t − s) dU (s) ≤ lim sup
t→∞ 0 h(t − s) dU (s) ≤
0 Iδ
µ δ If I and Iδ both approach the same ﬁnite limit I as δ → 0, then h is said to be
directly Riemann integrable, and it follows that
t h(t − s) dU (y ) → I/µ
0 Remark. The word “direct” in the name refers to the fact that while the Riemann
integral over [0, ∞) is usually deﬁned as the limit of integrals over [0, a], we are
approximating the integral over [0, ∞) directly.
In checking the new hypothesis in Theorem 3.4.5, the following result is useful. 174 CHAPTER 3. RANDOM WALKS Lemma 3.4.6. If h(x) ≥ 0 is decreasing with h(0) < ∞ and
h is directly Riemann integrable.
Proof. Because h is decreasing, I δ = ∞
k=0 δh(kδ ) and Iδ = ∞
0
∞
k=0 h(x) dx < ∞, then
δh((k + 1)δ ). So ∞ Iδ ≥ h(x) dx ≥ Iδ = I δ − h(0)δ
0 proving the desired result.
The last result suﬃces for all our applications, so we leave it to the reader to do
Exercise 3.4.4. If h ≥ 0 is continuous then h is directly Riemann integrable if and
only if I δ < ∞ for some δ > 0 (and hence for all δ > 0).
Returning now to our examples, we skip the ﬁrst two because, in those cases,
h(t) → 1 as t → ∞, so h is not integrable in any sense.
Example 3.4.7. Continuation of Example 3.4.3. h(t) =
is decreasing, h(0) = 1, and
∞ µ ∞ [t,∞) 1 − F (s) ds. h ∞ 1 − F (s) ds dt h(t) dt =
0 1
µ 0 t
∞ ∞ s 2
s(1 − F (s)) ds = E (ξi /2) 1 − F (s) dt ds = =
0 0 0 2
So, if ν ≡ E (ξi ) < ∞, it follows from (4.10), (4.9), and the formula in Example 3.4.3
that
0 ≤ U (t) − t/µ → ν/2µ2 as t → ∞ When the renewal process is a rate λ Poisson process, i.e., P (ξi > t) = e−λt , N (t) − 1
has a Poisson distribution with mean λt, so U (t) = 1 + λt. According to Feller, Vol. II
(1971), p. 385, if the ξi are uniform on (0,1), then
n (−1)k et−k (t − k )k /k ! U (t) = for n ≤ t ≤ n + 1 k=0 As he says, the exact expression “reveals little about the nature of U . The asymptotic
formula 0 ≤ U (t) − 2t → 2/3 is much more interesting.”
Example 3.4.8. Continuation of Example 3.4.4. h(t) = 1 − F (t + x). Again, h is
decreasing, but this time h(0) ≤ 1 and the integral of h is ﬁnite when µ = E (ξi ) < ∞.
Applying Lemma 3.4.6 and Theorem 3.4.5 now gives
P (TN (t) − t > x) → 1
µ ∞ h(s) ds =
0 1
µ ∞ 1 − F (t) dt
x so (when µ < ∞) the distribution of the residual waiting time TN (t) − t converges
to the delay distribution that produces the stationary renewal process. This fact also
follows from our proof of 3.4.3.
Using the method employed to study Example 3.4.4, one can analyze various other
aspects of the asymptotic behavior of renewal processes. To avoid repeating ourselves
We assume throughout that F is nonarithmetic, and in problems where the mean
appears we assume it is ﬁnite. 3.4. RENEWAL THEORY* 175 Exercise 3.4.5. Let At = t − TN (t)−1 be the “age” at time t, i.e., the amount of time
since the last renewal. If we ﬁx x > 0 then H (t) = P (At > x) satisﬁes the renewal
equation
t H (t) = (1 − F (t)) · 1(x,∞) (t) + H (t − s) dF (s)
0 1
so P (At > x) → µ (x,∞) (1 − F (t))dt, which is the limit distribution for the residual
lifetime Bt = TN (t) − t. Remark. The last result can be derived from Example 3.4.4 by noting that if t > x
then P (At ≥ x) = P (Bt−x > x) = P ( no renewal in (t − x, t]). To check the placement
of the strict inequality, recall Nt = inf {k : Tk > t} so we always have As ≥ 0 and
Bs > 0.
Exercise 3.4.6. Use the renewal equation in the last problem and Theorem 3.4.4 to
conclude that if T is a rate λ Poisson process At has the same distribution as ξi ∧ t.
Exercise 3.4.7. Let At = t − TN (t)−1 and Bt = TN (t) − t. Show that
P (At > x, Bt > y ) → 1
µ ∞ (1 − F (t)) dt
x+y Exercise 3.4.8. Alternating renewal process. Let ξ1 , ξ2 , . . . > 0 be i.i.d. with
distribution F1 and let η1 , η2 , . . . > 0 be i.i.d. with distribution F2 . Let T0 = 0 and
for k ≥ 1 let Sk = Tk−1 + ξk and Tk = Sk + ηk . In words, we have a machine that
works for an amount of time ξk , breaks down, and then requires ηk units of time to
be repaired. Let F = F1 ∗ F2 and let H (t) be the probability the machine is working
at time t. Show that if F is nonarithmetic then as t → ∞
H (t) → µ1 /(µ1 + µ2 )
where µi is the mean of Fi .
Exercise 3.4.9. Write a renewal equation for H (t) = P ( number of renewals in [0, t]
is odd) and use the renewal theorem to show that H (t) → 1/2. Note: This is a special
case of the previous exercise.
Exercise 3.4.10. Renewal densities. Show that if F (t) has a directly Riemann
integrable density function f (t), then the V = U − 1[0,∞) has a density v that satisﬁes
t v (t − s) dF (s) v (t) = f (t) +
0 Use the renewal theorem to conclude that if f is directly Riemann integrable then
v (t) → 1/µ as t → ∞.
Finally, we have an example that would have been given right after (4.1) but was
delayed because we had not yet deﬁned a delayed renewal process.
Example 3.4.9. Patterns in coin tossing. Let Xn , n ≥ 1 take values H and T
with probability 1/2 each. Let T0 = 0 and Tm = inf {n > Tm−1 : (Xn , . . . , Xn+k−1 ) =
(i1 , . . . , ik )} where (i1 , . . . , ik ) is some pattern of heads and tails. It is easy to see
that the Tj form a delayed renewal process, i.e., tj = Tj − Tj −1 are independent for
j ≥ 1 and identically distributed for j ≥ 2. To see that the distribution of t1 may be
diﬀerent, let (i1 , i2 , i3 ) = (H, H, H ). In this case, P (t1 = 1) = 1/8, P (t2 = 1) = 1/2.
Exercise 3.4.11. (i) Show that for any pattern of length k , Etj = 2k for j ≥ 2.
(ii) Compute Et1 when the pattern is HH, and when it is HT. Hint: For HH, observe
Et1 = P (HH ) + P (HT )E (t1 + 2) + P (T )E (t1 + 1) 176 CHAPTER 3. RANDOM WALKS Chapter 4 Martingales
A martingale Xn can be thought of as the fortune at time n of a player who is betting
on a fair game; submartingales (supermartingales) as the outcome of betting on a
favorable (unfavorable) game. There are two basic facts about martingales. The ﬁrst is
that you cannot make money betting on them (see Theorem 4.2.5), and in particular if
you choose to stop playing at some bounded time N then your expected winnings EXN
are equal to your initial fortune X0 . (We are supposing for the moment that X0 is not
random.) Our second fact, Theorem 4.2.8, concerns submartingales. To use a heuristic
we learned from Mike Brennan, “They are the stochastic analogues of nondecreasing
+
sequences and so if they are bounded above (to be precise, supn EXn < ∞) they
converge almost surely.” As the material in Section 4.3 shows, this result has diverse
applications. Later sections give suﬃcient conditions for martingales to converge in
Lp , p > 1 (Section 4.4) and in L1 (Section 4.5); consider martingales indexed by n ≤ 0
(Section 4.6); and give suﬃcient conditions for EXN = EX0 to hold for unbounded
stopping times (Section 4.7). The last result is quite useful for studying the behavior
of random walks and other systems. 4.1 Conditional Expectation We begin with a deﬁnition that is important for this chapter and the next one. After
giving the deﬁnition, we will consider several examples to explain it. Given are a
probability space (Ω, Fo , P ), a σ ﬁeld F ⊂ Fo , and a random variable X ∈ Fo with
E X  < ∞. We deﬁne the conditional expectation of X given F , E (X F ), to be
any random variable Y that has
(i) Y ∈ F , i.e., is F measurable
(ii) for all A ∈ F , A X dP = A Y dP Any Y satisfying (i) and (ii) is said to be a version of E (X F ). The ﬁrst thing to be
settled is that the conditional expectation exists and is unique. We tackle the second
claim ﬁrst but start with a technical point.
Lemma 4.1.1. If Y satisﬁes (i) and (ii), then it is integrable.
177 178 CHAPTER 4. MARTINGALES Proof. Letting A = {Y > 0} ∈ F , using (ii) twice, and then adding
X dP ≤ Y dP =
A A X  dP
A −Y dP = −X dP ≤ Ac Ac X  dP
Ac So we have E Y  ≤ E X .
Uniqueness. If Y also satisﬁes (i) and (ii) then
Y dP =
A for all A ∈ F Y dP
A Taking A = {Y − Y ≥ > 0}, we see
X − X dP = 0=
A Y − Y dP ≥ P (A)
A so P (A) = 0. Since this holds for all we have Y ≤ Y a.s., and interchanging the
roles of Y and Y , we have Y = Y a.s. Technically, all equalities such as Y = E (X F )
should be written as Y = E (X F ) a.s., but we have ignored this point in previous
chapters and will continue to do so.
Exercise 4.1.1. Generalize the last argument to show that if X1 = X2 on B ∈ F
then E (X1 F ) = E (X2 F ) a.s. on B .
Existence. To start, we recall ν is said to be absolutely continuous with respect
to µ (abbreviated ν < µ) if µ(A) = 0 implies ν (A) = 0, and we use Theorem A.8.6:
<
RadonNikodym Theorem. Let µ and ν be σ ﬁnite measures on (Ω, F ). If ν < µ,
<
there is a function f ∈ F so that for all A ∈ F
f dµ = ν (A)
A f is usually denoted dν/dµ and called the RadonNikodym derivative.
The last theorem easily gives the existence of conditional expectation. Suppose
ﬁrst that X ≥ 0. Let µ = P and
ν (A) = for A ∈ F X dP
A The dominated convergence theorem implies ν is a measure (see Exercise A.5.8) and
the deﬁnition of the integral implies ν < µ. The Radon Nikodym derivative dν/dµ ∈
<
F and for any A ∈ F has
X dP = ν (A) =
A A dν
dP
dµ Taking A = Ω, we see that dν/dµ ≥ 0 is integrable, and we have shown that dν/dµ
is a version of E (X F ).
To treat the general case now, write X = X + − X − , let Y1 = E (X + F ) and
Y2 = E (X − F ). Now Y1 − Y2 ∈ F is integrable, and for all A ∈ F we have
X − dP X + dP − X dP =
A A A Y1 dP − =
A (Y1 − Y2 ) dP Y2 dP =
A A This shows Y1 − Y2 is a version of E (X F ) and completes the proof. 4.1. CONDITIONAL EXPECTATION 4.1.1 179 Examples Intuitively, we think of F as describing the information we have at our disposal  for
each A ∈ F , we know whether or not A has occurred. E (X F ) is then our “best
guess” of the value of X given the information we have. Some examples should help
to clarify this and connect E (X F ) with other deﬁnitions of conditional expectation.
Example 4.1.1. If X ∈ F , then E (X F ) = X ; i.e., if we know X then our “best
guess” is X itself. Since X always satisﬁes (ii), the only thing that can keep X from
being E (X F ) is condition (i). A special case of this example is X = c, where c is a
constant.
Example 4.1.2. At the other extreme from perfect information is no information.
Suppose X is independent of F , i.e., for all B ∈ R and A ∈ F
P ({X ∈ B } ∩ A) = P (X ∈ B )P (A)
We claim that, in this case, E (X F ) = EX ; i.e., if you don’t know anything about X ,
then the best guess is the mean EX . To check the deﬁnition, note that EX ∈ F so
(i). To verify (ii), we observe that if A ∈ F then since X and 1A ∈ F are independent,
Theorem 1.4.9 implies
X dP = E (X 1A ) = EX E 1A =
A EX dP
A The reader should note that here and in what follows the game is “guess and verify.”
We come up with a formula for the conditional expectation and then check that it
satisﬁes (i) and (ii).
Example 4.1.3. In this example, we relate the new deﬁnition of conditional expectation to the ﬁrst one taught in an undergraduate probability course. Suppose
Ω1 , Ω2 , . . . is a ﬁnite or inﬁnite partition of Ω into disjoint sets, each of which has
positive probability, and let F = σ (Ω1 , Ω2 , . . .) be the σ ﬁeld generated by these sets.
Then
E (X ; Ωi )
E (X F ) =
on Ωi
P (Ωi )
In words, the information in Ωi tells us which element of the partition our outcome
lies in and given this information, the best guess for X is the average value of X over
Ωi . To prove our guess is correct, observe that the proposed formula is constant on
each Ωi , so it is measurable with respect to F . To verify (ii), it is enough to check
the equality for A = Ωi , but this is trivial: Ωi E (X ; Ωi )
dP = E (X ; Ωi ) =
P (Ωi ) X dP
Ωi A degenerate but important special case is F = {∅, Ω}, the trivial σ ﬁeld. In this
case, E (X F ) = EX.
To continue the connection with undergraduate notions, let
P (AG ) = E (1A G )
P (AB ) = P (A ∩ B )/P (B )
and observe that in the last example P (AF ) = P (AΩi ) on Ωi . 180 CHAPTER 4. MARTINGALES Example 4.1.4. Bayes’ formula. Let G ∈ G and show that
P (GA) = P (AG ) dP P (AG ) dP G Ω When G is the σ ﬁeld generated by a partition, this reduces to the usual Bayes’ formula
P (Gi A) = P (AGi )P (Gi ) P (AGj )P (Gj )
j The deﬁnition of conditional expectation given a σ ﬁeld contains conditioning on
a random variable as a special case. We deﬁne
E (X Y ) = E (X σ (Y ))
where σ (Y ) is the σ ﬁeld generated by Y .
Example 4.1.5. To continue making connection with deﬁnitions of conditional expectation from undergraduate probability, suppose X and Y have joint density f (x, y ),
i.e.,
P ((X, Y ) ∈ B ) = for B ∈ R2 f (x, y ) dx dy
B and suppose for simplicity that f (x, y ) dx > 0 for all y . We claim that in this case,
if E g (X ) < ∞ then E (g (X )Y ) = h(Y ), where
h(y ) = g (x)f (x, y ) dx f (x, y ) dx To “guess” this formula, note that treating the probability densities P (Y = y ) as if
they were real probabilities
P (X = xY = y ) = P (X = x, Y = y )
=
P (Y = y ) f (x, y )
f (x, y ) dx so, integrating against the conditional probability density, we have
E (g (X )Y = y ) = g (x)P (X = xY = y ) dx To “verify” the proposed formula now, observe h(Y ) ∈ σ (Y ) so (i) holds. To check
(ii), observe that if A ∈ σ (Y ) then A = {ω : Y (ω ) ∈ B } for some B ∈ R, so
E (h(Y ); A) = h(y )f (x, y ) dx dy =
B g (x)f (x, y ) dx dy
B = E (g (X )1B (Y )) = E (g (X ); A)
Remark. To drop the assumption that
h(y )
(i.e., h can be anything where
proof. f (x, y ) dx > 0, deﬁne h by f (x, y ) dx = g (x)f (x, y ) dx f (x, y ) dx = 0), and observe this is enough for the 4.1. CONDITIONAL EXPECTATION 181 Example 4.1.6. Suppose X and Y are independent. Let ϕ be a function with
E ϕ(X, Y ) < ∞ and let g (x) = E (ϕ(x, Y )). We will now show that
E (ϕ(X, Y )X ) = g (X )
Proof. It is clear that g (X ) ∈ σ (X ). To check (ii), note that if A ∈ σ (X ) then
A = {X ∈ C }, so using the change of variables formula (Theorem 1.3.9) and the
fact that the distribution of (X, Y ) is product measure (Theorem 1.4.7), then the
deﬁnition of g , and change of variables again,
φ(X, Y ) dP = E {φ(X, Y )1C (X )}
A =
= φ(x, y )1C (x) ν (dy ) µ(dx)
1C (x)g (x) µ(dx) = g (Y ) dP
A which proves the desired result.
Example 4.1.7. Borel’s paradox. Let X be a randomly chosen point on the earth,
let θ be its longitude, and ϕ be its latitude. It is customary to take θ ∈ [0, 2π ) and
ϕ ∈ (−π/2, π/2] but we can equally well take θ ∈ [0, π ) and ϕ ∈ (−π, π ]. In words,
the new longitude speciﬁes the great circle on which the point lies and then ϕ gives
the angle.
At ﬁrst glance it might seem that if X is uniform on the globe then θ and the angle
ϕ on the great circle should both be uniform over their possible values. θ is uniform
but ϕ is not. The paradox completely evaporates once we realize that in the new or
in the traditional formulation ϕ is independent of θ, so the conditional distribution is
the unconditional one, which is not uniform since there is more land near the equator
than near the North Pole. 4.1.2 Properties Conditional expectation has many of the same properties that ordinary expectation
does.
Theorem 4.1.2. (a) Conditional expectation is linear:
E (aX + Y F ) = aE (X F ) + E (Y F ) (4.1.1) E (X F ) ≤ E (Y F ). (4.1.2) (b) If X ≤ Y then
(c) If Xn ≥ 0 and Xn ↑ X with EX < ∞ then
E (Xn F ) ↑ E (X F ) (4.1.3) Remark. By applying the last result to Y1 − Yn , we see that if Yn ↓ Y and we have
E Y1 , E Y  < ∞, then E (Yn F ) ↓ E (Y F ).
Proof. To prove (a), we need to check that the righthand side is a version of the left.
It clearly is F measurable. To check (ii), we observe that if A ∈ F then by linearity 182 CHAPTER 4. MARTINGALES of the integral and the deﬁning properties of E (X F ) and E (Y F ),
{aE (X F ) + E (Y F )} dP = a
A E (X F ) dP +
A =a E (Y F ) dP
A X dP +
A Y dP =
A aX + Y dP
A which proves 4.1.1.
Using the deﬁnition
E (X F ) dP =
A X dP ≤
A E (Y F ) dP Y dP =
A A Letting A = {E (X F ) − E (Y F ) ≥ > 0}, we see that the indicated set has probability 0 for all > 0, and we have proved (4.1.2).
Let Yn = X − Xn . It suﬃces to show that E (Yn F ) ↓ 0. Since Yn ↓, (4.1.2) implies
Zn ≡ E (Yn F ) ↓ a limit Z∞ . If A ∈ F then
Zn dP =
A Yn dP
A Letting n → ∞, noting Yn ↓ 0, and using the dominated convergence theorem gives
that A Z∞ dP = 0 for all A ∈ F , so Z∞ ≡ 0.
Exercise 4.1.2. Prove Chebyshev’s inequality. If a > 0 then
P (X  ≥ aF ) ≤ a−2 E (X 2 F )
Exercise 4.1.3. Suppose X ≥ 0 and EX = ∞. (There is nothing to prove when
EX < ∞.) Show there is a unique F measurable Y with 0 ≤ Y ≤ ∞ so that
X dP =
A Y dP for all A ∈ F A Hint: Let XM = X ∧ M , YM = E (XM F ), and let M → ∞.
Theorem 4.1.3. If ϕ is convex and E X , E ϕ(X ) < ∞ then
ϕ(E (X F )) ≤ E (ϕ(X )F ) (4.1.4) Proof. If ϕ is linear, the result is trivial, so we will suppose ϕ is not linear. We
do this so that if we let S = {(a, b) : a, b ∈ Q, ax + b ≤ ϕ(x) for all x}, then
ϕ(x) = sup{ax + b : (a, b) ∈ S }. See the proof of Theorem 1.3.2 for more details. If
ϕ(x) ≥ ax + b then (4.1.2) and (4.1.1) imply
E (ϕ(X )F ) ≥ a E (X F ) + b a.s. Taking the sup over (a, b) ∈ S gives
E (ϕ(X )F ) ≥ ϕ(E (X F )) a.s. which proves the desired result.
Remark. Here we have written a.s. by the inequalities to stress that there is an
exceptional set for each a, b so we have to take the sup over a countable set. 4.1. CONDITIONAL EXPECTATION 183 Exercise 4.1.4. Imitate the proof in the remark after Theorem A.5.2 to prove the
conditional CauchySchwarz inequality.
E (XY G )2 ≤ E (X 2 G )E (Y 2 G )
Theorem 4.1.4. Conditional expectation is a contraction in Lp , p ≥ 1.
Proof (4.1.4) implies E (X F )p ≤ E (X p F ). Taking expected values gives
E (E (X F )p ) ≤ E (E (X p F )) = E X p
In the last equality, we have used an identity that is an immediate consequence of
the deﬁnition (use property (ii) in the deﬁnition with A = Ω).
E (E (Y F )) = E (Y ) (4.1.5) Conditional expectation also has properties, like (4.1.5), that have no analogue for
“ordinary” expectation.
Theorem 4.1.5. If F ⊂ G and E (X G ) ∈ F then E (X F ) = E (X G ).
Proof. By assumption E (X G ) ∈ F . To check the other part of the deﬁnition we note
that if A ∈ F ⊂ G then
E (X G dP X dP =
A A Theorem 4.1.6. If F1 ⊂ F2 then (i) E (E (X F1 )F2 ) = E (X F1 )
(ii) E (E (X F2 )F1 ) = E (X F1 ).
In words, the smaller σ ﬁeld always wins. As the proof will show, the ﬁrst equality
is trivial. The second is easy to prove, but in combination with Theorem 4.1.7 is
a powerful tool for computing conditional expectations. I have seen it used several
times to prove results that are false.
Proof. Once we notice that E (X F1 ) ∈ F2 , (i) follows from Example 1.1. To prove
(ii), notice that E (X F1 ) ∈ F1 , and if A ∈ F1 ⊂ F2 then
E (X F1 ) dP =
A E (X F2 ) dP X dP =
A A Exercise 4.1.5. Give an example on Ω = {a, b, c} in which
E (E (X F1 )F2 ) = E (E (X F2 )F1 )
The next result shows that for conditional expectation with respect to F , random
variables X ∈ F are like constants. They can be brought outside the “integral.”
Theorem 4.1.7. If X ∈ F and E Y , E XY  < ∞ then
E (XY F ) = XE (Y F ).
Proof. The righthand side ∈ F , so we have to check (ii). To do this, we use the usual
fourstep procedure. First, suppose X = 1B with B ∈ F . In this case, if A ∈ F
1B E (Y F ) dP = E (Y F ) dP =
A∩B A Y dP =
A∩B 1B Y dP
A so (ii) holds. The last result extends to simple X by linearity. If X, Y ≥ 0, let Xn
be simple random variables that ↑ X , and use the monotone convergence theorem to
conclude that
XE (Y F ) dP =
XY dP
A A To prove the result in general, split X and Y into their positive and negative parts. 184 CHAPTER 4. MARTINGALES Exercise 4.1.6. Show that when E X , E Y , and E XY  are ﬁnite, each statement
implies the next one and give examples with X, Y ∈ {−1, 0, 1} a.s. that show the
reverse implications are false: (i) X and Y are independent, (ii) E (Y X ) = EY , (iii)
E (XY ) = EXEY .
Theorem 4.1.8. Suppose EX 2 < ∞. E (X F ) is the variable Y ∈ F that minimizes
the “mean square error” E (X − Y )2 .
Remark. This result gives a “geometric interpretation” of E (X F ). L2 (Fo ) = {Y ∈
Fo : EY 2 < ∞} is a Hilbert space, and L2 (F ) is a closed subspace. In this case,
E (X F ) is the projection of X onto L2 (F ), i.e., the point in the subspace closest to
X.
X
&f
&
f
&
L2 (F )
&
f
&
↑
&
E (X F )
&
&
0
Figure 4.1: Conditional expectation as projection in L2 . Proof. We begin by observing that if Z ∈ L2 (F ), then Theorem 4.1.7 implies
ZE (X F ) = E (ZX F )
(E XZ  < ∞ by the CauchySchwarz inequality.) Taking expected values gives
E (ZE (X F )) = E (E (ZX F )) = E (ZX )
or, rearranging,
E [Z (X − E (X F ))] = 0 for Z ∈ L2 (F ) If Y ∈ L2 (F ) and Z = E (X F ) − Y then
E (X − Y )2 = E {X − E (X F ) + Z }2 = E {X − E (X F )}2 + EZ 2
since the crossproduct term vanishes. From the last formula, it is easy to see E (X −
Y )2 is minimized when Z = 0.
Exercise 4.1.7. Show that if G ⊂ F and EX 2 < ∞ then
E ({X − E (X F )}2 ) + E ({E (X F ) − E (X G )}2 ) = E ({X − E (X G )}2 )
Dropping the second term on the left, we get an inequality that says geometrically,
the larger the subspace the closer the projection is, or statistically, more information
means a smaller mean square error. An important special case occurs when G =
{∅, Ω}.
Exercise 4.1.8. Let var (X F ) = E (X 2 F ) − E (X F )2 . Show that
var (X ) = E ( var (X F )) + var (E (X F )) 4.1. CONDITIONAL EXPECTATION 185 Exercise 4.1.9. Let Y1 , Y2 , . . . be i.i.d. with mean µ and variance σ 2 , N an independent positive integer valued r.v. with EN 2 < ∞ and X = Y1 + · · · + YN . Show that
var (X ) = σ 2 EN + µ2 var (N ). To understand and help remember the formula, think
about the two special cases in which N or Y is constant.
Exercise 4.1.10. Show that if X and Y are random variables with E (Y G ) = X and
EY 2 = EX 2 < ∞, then X = Y a.s.
Exercise 4.1.11. The result in the last exercise implies that if EY 2 < ∞ and E (Y G )
has the same distribution as Y , then E (Y G ) = Y a.s. Prove this under the assumption
E Y  < ∞. Hint: The trick is to prove that sgn (X ) = sgn (E (X G )) a.s., and then
take X = Y − c to get the desired result. 4.1.3 Regular Conditional Probabilities* Let (Ω, F , P ) be a probability space, X : (Ω, F ) → (S, S ) a measurable map, and G
a σ ﬁeld ⊂ F . µ : Ω × S → [0, 1] is said to be a regular conditional distribution
for X given G if
(i) For each A, ω → µ(ω, A) is a version of P (X ∈ AG ).
(ii) For a.e. ω , A → µ(ω, A) is a probability measure on (S, S ).
When S = Ω and X is the identity map, µ is called a regular conditional probability.
Exercise 4.1.12. Continuation of Example 1.4. Suppose X and Y have a joint
density f (x, y ) > 0. Let
µ(y, A) = f (x, y ) dx f (x, y ) dx A Show that µ(Y (ω ), A) is a r.c.d. for X given σ (Y ).
Regular conditional distributions are useful because they allow us to simultaneously compute the conditional expectation of all functions of X and to generalize
properties of ordinary expectation in a more straightforward way.
Exercise 4.1.13. Let µ(ω, A) be a r.c.d. for X given F , and let f : (S, S ) → (R, R)
have E f (X ) < ∞. Start with simple functions and show that
E (f (X )F ) = µ(ω, dx)f (x) a.s. Exercise 4.1.14. Use regular conditional probability to get the conditional H¨lder
o
inequality from the unconditional one, i.e., show that if p, q ∈ (1, ∞) with 1/p + 1/q =
1 then
E (XY G ) ≤ E (X p G )1/p E (Y q G )1/q
Unfortunately, r.c.d.’s do not always exist. The ﬁrst example was due to Dieudonn´
e
(1948). See Doob (1953), p. 624, or Faden (1985) for more recent developments.
Without going into the details of the example, it is easy to see the source of the
problem. If A1 , A2 , . . . are disjoint, then (4.1.1) and (4.1.3) imply
P (X ∈ ∪n An G ) = P (X ∈ An G ) a.s. n but if S contains enough countable collections of disjoint sets, the exceptional sets
may pile up. Fortunately, 186 CHAPTER 4. MARTINGALES Theorem 4.1.9. r.c.d.’s exist if (S, S ) is nice.
Proof. By deﬁnition, there is a 11 map ϕ : S → R so that ϕ and ϕ−1 are measurable.
Using monotonicity (4.1.2) and throwing away a countable collection of null sets, we
ﬁnd there is a set Ωo with P (Ωo ) = 1 and a family of random variables G(q, ω ), q ∈ Q
so that q → G(q, ω ) is nondecreasing and ω → G(q, ω ) is a version of P (ϕ(X ) ≤ q G ).
Let F (x, ω ) = inf {G(q, ω ) : q > x}. The notation may remind the reader of the proof
of (2.5) in Chapter 2. The argument given there shows F is a distribution function.
Since G(qn , ω ) ↓ F (x, ω ), the remark after Theorem 4.1.2 implies that F (x, ω ) is a
version of P (ϕ(X ) ≤ xG ).
Now, for each ω ∈ Ωo , there is a unique measure ν (ω, ·) on (R, R) so that
ν (ω, (−∞, x]) = F (x, ω ). To check that for each B ∈ R , ν (ω, B ) is a version of
P (ϕ(X ) ∈ B G ), we observe that the class of B for which this statement is true (this
includes the measurability of ω → ν (ω, B )) is a λsystem that contains all sets of the
form (a1 , b1 ] ∪ · · · (ak , bk ] where −∞ ≤ ai < bi ≤ ∞, so the desired result follows from
the π − λ theorem. To extract the desired r.c.d., notice that if A ∈ S and B = ϕ(A),
then B = (ϕ−1 )−1 (A) ∈ R, and set µ(ω, A) = ν (ω, B ).
The following generalization of Theorem 4.1.9 will be needed in Section 5.1.
Exercise 4.1.15. Suppose X and Y take values in a nice space (S, S ) and G = σ (Y ).
There is a function µ : S × S → [0, 1] so that
(i) for each A, µ(Y (ω ), A) is a version of P (X ∈ AG )
(ii) for a.e. ω , A → µ(Y (ω ), A) is a probability measure on (S, S ). 4.2. MARTINGALES, ALMOST SURE CONVERGENCE 4.2 187 Martingales, Almost Sure Convergence In this section we will deﬁne martingales and their cousins supermartingales and submartingales, and take the ﬁrst steps in developing their theory. Let Fn be a ﬁltration,
i.e., an increasing sequence of σ ﬁelds. A sequence Xn is said to be adapted to Fn if
Xn ∈ Fn for all n. If Xn is sequence with
(i) E Xn  < ∞,
(ii) Xn is adapted to Fn ,
(iii) E (Xn+1 Fn ) = Xn for all n,
then X is said to be a martingale (with respect to Fn ). If in the last deﬁnition, =
is replaced by ≤ or ≥, then X is said to be a supermartingale or submartingale,
respectively.
Example 4.2.1. Simple random walk. Consider the successive tosses of a fair
coin and let ξn = 1 if the nth tossis heads and ξn = −1 if the nth toss is tails. Let
Xn = ξ1 + · · · + ξn and Fn = σ (ξ1 , . . . , ξn ) for n ≥ 1, X0 = 0 and F0 = {∅, Ω}. I
claim that Xn , n ≥ 0, is a martingale with respect to Fn . To prove this, we observe
that Xn ∈ Fn , E Xn  < ∞, and ξn+1 is independent of Fn , so using the linearity of
conditional expectation, (4.1.1), and Example 4.1.2,
E (Xn+1 Fn ) = E (Xn Fn ) + E (ξn+1 Fn ) = Xn + Eξn+1 = Xn
Note that, in this example, Fn = σ (X1 , . . . , Xn ) and Fn is the smallest ﬁltration that
Xn is adapted to. In what follows, when the ﬁltration is not mentioned, we will take
Fn = σ (X1 , . . . , Xn ).
Exercise 4.2.1. Suppose Xn is a martingale w.r.t. Gn and let Fn = σ (X1 , . . . , Xn ).
Then Gn ⊃ Fn and Xn is a martingale w.r.t. Fn .
Example 4.2.2. Superharmonic functions. If the coin tosses considered above
have P (ξn = 1) ≤ 1/2 then the computation just completed shows E (Xn+1 Fn ) ≤
Xn , i.e., Xn is a supermartingale. In this case, Xn corresponds to betting on an
unfavorable game so there is nothing “super” about a supermartingale. The name
comes from the fact that if f is superharmonic (i.e., f has continuous derivatives of
order ≤ 2 and ∂ 2 f /∂x2 + · · · + ∂ 2 f /∂x2 ≤ 0), then
1
d
f (x) ≥ 1
B (0, r) f (y ) dy
B (x,r ) where B (x, r) = {y : x − y  ≤ r} is the ball of radius r, and B (0, r) is the volume of
the ball of radius r.
Exercise 4.2.2. Suppose f is superharmonic on Rd . Let ξ1 , ξ2 , . . . be i.i.d. uniform
on B (0, 1), and deﬁne Sn by Sn = Sn−1 + ξn for n ≥ 1 and S0 = x. Show that
Xn = f (Sn ) is a supermartingale.
Our ﬁrst result is an immediate consequence of the deﬁnition of a supermartingale.
We could take the conclusion of the result as the deﬁnition of supermartingale, but
then the deﬁnition would be harder to check.
Theorem 4.2.1. If Xn is a supermartingale then for n > m, E (Xn Fm ) ≤ Xm . 188 CHAPTER 4. MARTINGALES Proof. The deﬁnition gives the result for n = m + 1. Suppose n = m + k with k ≥ 2.
By Theorem 4.1.2,
E (Xm+k Fm ) = E (E (Xm+k Fm+k−1 )Fm ) ≤ E (Xm+k−1 Fm )
by the deﬁnition and (4.1.2). The desired result now follows by induction.
Theorem 4.2.2. (i) If Xn is a submartingale then for n > m, E (Xn Fm ) ≥ Xm .
(ii) If Xn is a martingale then for n > m, E (Xn Fm ) = Xm .
Proof. To prove (i), note that −Xn is a supermartingale and use (4.1.1). For (ii),
observe that Xn is a supermartingale and a submartingale.
Remark. The idea in the proof of Theorem 4.2.2 can be used many times below. To
keep from repeating ourselves, we will just state the result for either supermartingales
or submartingales and leave it to the reader to translate the result for the other two.
Theorem 4.2.3. If Xn is a martingale w.r.t. Fn and ϕ is a convex function with
E ϕ(Xn ) < ∞ for all n then ϕ(Xn ) is a submartingale w.r.t. Fn . Consequently, if
p ≥ 1 and E Xn p < ∞ for all n, then Xn p is a submartingale w.r.t. Fn .
Proof By Jensen’s inequality and the deﬁnition
E (ϕ(Xn+1 )Fn ) ≥ ϕ(E (Xn+1 Fn )) = ϕ(Xn )
Theorem 4.2.4. If Xn is a submartingale w.r.t. Fn and ϕ is an increasing convex
function with E ϕ(Xn ) < ∞ for all n, then ϕ(Xn ) is a submartingale w.r.t. Fn .
Consequently (i) If Xn is a submartingale then (Xn − a)+ is a submartingale. (ii) If
Xn is a supermartingale then Xn ∧ a is a supermartingale.
Proof By Jensen’s inequality and the assumptions
E (ϕ(Xn+1 )Fn ) ≥ ϕ(E (Xn+1 Fn )) ≥ ϕ(Xn )
2
Exercise 4.2.3. Give an example of a submartingale Xn so that Xn is a supermartingale. Hint: Xn does not have to be random. Let Fn , n ≥ 0 be a ﬁltration. Hn , n ≥ 1 is said to be a predictable sequence if
Hn ∈ Fn−1 for all n ≥ 1. In words, the value of Hn may be predicted (with certainty)
from the information available at time n − 1. In this section, we will be thinking of
Hn as the amount of money a gambler will bet at time n. This can be based on the
outcomes at times 1, . . . , n − 1 but not on the outcome at time n!
Once we start thinking of Hn as a gambling system, it is natural to ask how much
money we would make if we used it. For concreteness, let us suppose that the game
consists of ﬂipping a coin and that for each dollar you bet you win one dollar when
the coin comes up heads and lose your dollar when the coin comes up tails. Let Xn
be the net amount of money you would have won at time n if you had bet one dollar
each time. If you bet according to a gambling system H then your winnings at time
n would be
n (H · X )n = Hm (Xm − Xm−1 )
m=1 since Xm − Xm−1 = +1 or −1 when the mth toss results in a win or loss, respectively.
Let ξm = Xm − Xm−1 . A famous gambling system called the “martingale” is
deﬁned by H1 = 1 and for n ≥ 2, Hn = 2Hn−1 if ξn−1 = −1 and Hn = 1 if ξn−1 = 1. 4.2. MARTINGALES, ALMOST SURE CONVERGENCE 189 In words, we double our bet when we lose, so that if we lose k times and then win,
our net winnings will be −1 − 2 . . . − 2k−1 + 2k = 1. This system seems to provide us
with a “sure thing” as long as P (ξm = 1) > 0. However, the next result says there is
no system for beating an unfavorable game.
Theorem 4.2.5. Let Xn , n ≥ 0, be a supermartingale. If Hn ≥ 0 is predictable and
each Hn is bounded then (H · X )n is a supermartingale.
Proof. Using the fact that conditional expectation is linear, (H · X )n ∈ Fn , Hn ∈
Fn−1 , and (4.1.7), we have
E ((H · X )n+1 Fn ) = (H · X )n + E (Hn+1 (Xn+1 − Xn )Fn )
= (H · X )n + Hn+1 E ((Xn+1 − Xn )Fn ) ≤ (H · X )n
since E ((Xn+1 − Xn )Fn ) ≤ 0 and Hn+1 ≥ 0.
Remark. The same result is obviously true for submartingales and for martingales
(in the last case, without the restriction Hn ≥ 0).
The notion of a stopping time, introduced in Section 3.1, is closely related to
the concept of a gambling system. Recall that a random variable N is said to be a
stopping time if {N = n} ∈ Fn for all n < ∞. If you think of N as the time a
gambler stops gambling, then the condition above says that the decision to stop at
time n must be measurable with respect to the information he has at that time. If we
let Hn = 1{N ≥n} , then {N ≥ n} = {N ≤ n − 1}c ∈ Fn−1 , so Hn is predictable, and it
follows from Theorem 4.2.5 that (H · X )n = XN ∧n − X0 is a supermartingale. Since the
constant sequence Yn = X0 is a supermartingale and the sum of two supermartingales
is also, we have:
Theorem 4.2.6. If N is a stopping time and Xn is a supermartingale, then XN ∧n
is a supermartingale.
Although you cannot make money with gambling systems, you can prove theorems
with them. Suppose Xn , n ≥ 0, is a submartingale. Let a < b, let N0 = −1, and for
k ≥ 1 let
N2k−1 = inf {m > N2k−2 : Xm ≤ a}
N2k = inf {m > N2k−1 : Xm ≥ b}
The Nj are stopping times and {N2k−1 < m ≤ N2k } = {N2k−1 ≤ m − 1} ∩ {N2k ≤
m − 1}c ∈ Fm−1 , so
Hm = 1 if N2k−1 < m ≤ N2k for some k
0 otherwise deﬁnes a predictable sequence. X (N2k−1 ) ≤ a and X (N2k ) ≥ b, so between times
N2k−1 and N2k , Xm crosses from below a to above b. Hm is a gambling system that
tries to take advantage of these “upcrossings.” In stock market terms, we buy when
Xm ≤ a and sell when Xm ≥ b, so every time an upcrossing is completed, we make
a proﬁt of ≥ (b − a). Finally, Un = sup{k : N2k ≤ n} is the number of upcrossings
completed by time n.
Theorem 4.2.7. Upcrossing inequality. If Xm , m ≥ 0, is a submartingale then
(b − a)EUn ≤ E (Xn − a)+ − E (X0 − a)+ 190 CHAPTER 4. MARTINGALES •
¡ b
¡
•
•
¢ •
a ¢
¢
• •
¢
• ¢
¢
• •
¡ • •
•
• ¡
• •
e
•e
• Figure 4.2: Upcrossings of (a, b). Lines indicate increments that are included in
(H · X )n . In Yn the points < a are moved up to a.
Proof. Let Ym = a + (Xm − a)+ . By Theorem 4.2.4, Ym is a submartingale. Clearly,
it upcrosses [a, b] the same number of times that Xm does, and we have (b − a)Un ≤
(H · Y )n , since each upcrossing results in a proﬁt ≥ (b − a) and a ﬁnal incomplete
upcrossing (if there is one) makes a nonnegative contribution to the righthand side.
Let Km = 1 − Hm . Clearly, Yn − Y0 = (H · Y )n + (K · Y )n , and it follows from
Theorem 4.2.5 that E (K · Y )n ≥ E (K · Y )0 = 0 so E (H · Y )n ≤ E (Yn − Y0 ), proving
the desired inequality.
We have proved the result in its classical form, even though this is a little misleading. The key fact is that E (K · Y )n ≥ 0, i.e., no matter how hard you try you can’t
lose money betting on a submartingale. From the upcrossing inequality, we easily get
Theorem 4.2.8. Martingale convergence theorem. If Xn is a submartingale
+
with sup EXn < ∞ then as n → ∞, Xn converges a.s. to a limit X with E X  < ∞.
Proof. Since (X − a)+ ≤ X + + a, Theorem 4.2.7 implies that
+
EUn ≤ (a + EXn )/(b − a) As n ↑ ∞, Un ↑ U the number of upcrossings of [a, b] by the whole sequence, so if
+
sup EXn < ∞ then EU < ∞ and hence U < ∞ a.s. Since the last conclusion holds
for all rational a and b,
∪a,b∈Q {lim inf Xn < a < b < lim sup Xn } has probability 0 and hence lim sup Xn = lim inf Xn a.s., i.e., lim Xn exists a.s. Fatou’s lemma guar+
antees EX + ≤ lim inf EXn < ∞, so X < ∞ a.s. To see X > −∞, we observe
that
−
+
+
EXn = EXn − EXn ≤ EXn − EX0
(since Xn is a submartingale), so another application of Fatou’s lemma shows
−
+
EX − ≤ lim inf EXn ≤ sup EXn − EX0 < ∞
n→∞ n and completes the proof.
Remark. To prepare for the proof of (6.1), the reader should note that we have
shown that if the number of upcrossings of (a, b) by Xn is ﬁnite for all a, b ∈ Q, then
the limit of Xn exists.
An important special case of Theorem 4.2.8 is 4.2. MARTINGALES, ALMOST SURE CONVERGENCE 191 Theorem 4.2.9. If Xn ≥ 0 is a supermartingale then as n → ∞, Xn → X a.s. and
EX ≤ EX0 .
+
Proof. Yn = −Xn ≤ 0 is a submartingale with EYn = 0. Since EX0 ≥ EXn , the
inequality follows from Fatou’s lemma. In the next section, we will give several applications of the last two results. We
close this one by giving two “counterexamples.”
Example 4.2.3. The ﬁrst shows that the assumptions of Theorem 4.2.9 (or 4.2.8)
do not guarantee convergence in L1 . Let Sn be a symmetric simple random walk with
S0 = 1, i.e., Sn = Sn−1 + ξn where ξ1 , ξ2 , . . . are i.i.d. with P (ξi = 1) = P (ξi = −1) =
1/2. Let N = inf {n : Sn = 0} and let Xn = SN ∧n . Theorem 4.2.6 implies that Xn is
a nonnegative martingale. Theorem 4.2.9 implies Xn converges to a limit X∞ < ∞
that must be ≡ 0, since convergence to k > 0 is impossible. (If Xn = k > 0 then
Xn+1 = k ± 1.) Since EXn = EX0 = 1 for all n and X∞ = 0, convergence cannot
occur in L1 .
Example 4.2.3 is an important counterexample to keep in mind as you read the
rest of this chapter. The next two are not as important.
Example 4.2.4. We will now give an example of a martingale with Xk → 0 in
probability but not a.s. Let X0 = 0. When Xk−1 = 0, let Xk = 1 or −1 with
probability 1/2k and = 0 with probability 1 − 1/k . When Xk−1 = 0, let Xk =
kXk−1 with probability 1/k and = 0 with probability 1 − 1/k . From the construction,
P (Xk = 0) = 1 − 1/k so Xk → 0 in probability. On the other hand, the second
BorelCantelli lemma implies P (Xk = 0 for k ≥ K ) = 0, and values in (−1, 1) − {0}
are impossible, so Xk does not converge to 0 a.s.
Exercise 4.2.4. Give an example of a martingale Xn with Xn → −∞ a.s. Hint: Let
Xn = ξ1 + · · · + ξn , where the ξi are independent (but not identically distributed)
with Eξi = 0.
Our ﬁnal result is useful in reducing questions about submartingales to questions
about martingales.
Theorem 4.2.10. Doob’s decomposition. Any submartingale Xn , n ≥ 0, can be
written in a unique way as Xn = Mn + An , where Mn is a martingale and An is a
predictable increasing sequence with A0 = 0.
Proof. We want Xn = Mn + An , E (Mn Fn−1 ) = Mn−1 , and An ∈ Fn−1 . So we must
have
E (Xn Fn−1 ) = E (Mn Fn−1 ) + E (An Fn−1 )
= Mn−1 + An = Xn−1 − An−1 + An
and it follows that
(a) An − An−1 = E (Xn Fn−1 ) − Xn−1
(b) Mn = Xn − An
Now A0 = 0 and M0 = X0 by assumption, so we have An and Mn deﬁned for all
time, and we have proved uniqueness. To check that our recipe works, we observe 192 CHAPTER 4. MARTINGALES that An − An−1 ≥ 0 since Xn is a submartingale and induction shows An ∈ Fn−1 . To
see that Mn is a martingale, we use (b), An ∈ Fn−1 and (a):
E (Mn Fn−1 ) = E (Xn − An Fn−1 )
= E (Xn Fn−1 ) − An = Xn−1 − An−1 = Mn−1
which completes the proof.
Exercise 4.2.5. Let Xn =
decomposition for Xn ? m≤n 1Bm and suppose Bn ∈ Fn . What is the Doob Exercises
2
4.2.6. Let ξ1 , ξ2 , . . . be independent with Eξi = 0 and var (ξm ) = σm < ∞, and let
n
2
2
2
2
sn = m=1 σm . Then Sn − sn is a martingale. 4.2.7. If ξ1 , ξ2 , . . . are independent and have Eξi = 0 then
(
Xnk) = ξi1 · · · ξik
1≤i1 <...<ik ≤n
(2) 2
is a martingale. When k = 2 and Sn = ξ1 + · · · + ξn , 2Xn = Sn − 2
m≤n ξm . 4.2.8. Generalize (i) of Theorem 4.2.4 by showing that if Xn and Yn are submartingales w.r.t. Fn then Xn ∨ Yn is also.
4.2.9. Let Y1 , Y2 , . . . be nonnegative i.i.d. random variables with EYm = 1 and
P (Ym = 1) < 1. (i) Show that Xn = m≤n Ym deﬁnes a martingale. (ii) Use
Theorem 4.2.9 and an argument by contradiction to show Xn → 0 a.s. (iii) Use the
strong law of large numbers to conclude (1/n) log Xn → c < 0.
4.2.10. Suppose yn > −1 for all n and yn  < ∞. Show that ∞
m=1 (1 + ym ) exists. 4.2.11. Let Xn and Yn be positive integrable and adapted to Fn . Suppose
E (Xn+1 Fn ) ≤ (1 + Yn )Xn
with
Yn < ∞ a.s. Prove that Xn converges a.s. to a ﬁnite limit by ﬁnding a closely
related supermartingale to which Theorem 4.2.9 can be applied.
4.2.12. Use the random walks in Exercise 4.2.2 to conclude that in d ≤ 2, nonnegative
superharmonic functions must be constant. The example f (x) = x2−d shows this is
false in d > 2.
1
2
4.2.13. The switching principle. Suppose Xn and Xn are supermartingales with
1
2
respect to Fn , and N is a stopping time so that XN ≥ XN . Then
1
2
Yn = Xn 1(N >n) + Xn 1(N ≤n) is a supermartingale.
1
2
Zn = Xn 1(N ≥n) + Xn 1(N <n) is a supermartingale. 4.2.14. Dubins’ inequality. For every positive supermartingale Xn , n ≥ 0, the
number of upcrossings U of [a, b] satisﬁes
P (U ≥ k ) ≤ a
b k E min(X0 /a, 1) 4.2. MARTINGALES, ALMOST SURE CONVERGENCE 193 To prove this, we let N0 = −1 and for j ≥ 1 let
N2j −1 = inf {m > N2j −2 : Xm ≤ a}
N2j = inf {m > N2j −1 : Xm ≥ b}
Let Yn = 1 for 0 ≤ n < N1 and for j ≥ 1
Yn = (b/a)j −1 (Xn /a) for N2j −1 ≤ n < N2j
(b/a)j
for N2j ≤ n < N2j +1 (i) Use the switching principle in the previous exercise and induction to show that
j
Zn = Yn∧Nj is a supermartingale. (ii) Use EYn∧N2k ≤ EY0 and let n → ∞ to get
Dubins’ inequality. 194 4.3 CHAPTER 4. MARTINGALES Examples In this section, we will apply the martingale convergence theorem to generalize the
second BorelCantelli lemma and to study Polya’s urn scheme, Radon Nikodym derivatives, and branching processes. The four topics are independent of each other and are
taken up in the order indicated. 4.3.1 Bounded Increments Our ﬁrst result shows that martingales with bounded increments either converge or
oscillate between +∞ and −∞.
Theorem 4.3.1. Let X1 , X2 , . . . be a martingale with Xn+1 − Xn  ≤ M < ∞. Let
C = {lim Xn exists and is ﬁnite}
D = {lim sup Xn = +∞ and lim inf Xn = −∞}
Then P (C ∪ D) = 1.
Proof. Since Xn − X0 is a martingale, we can without loss of generality suppose that
X0 = 0. Let 0 < K < ∞ and let N = inf {n : Xn ≤ −K }. Xn∧N is a martingale with
Xn∧N ≥ −K −M a.s. so applying Theorem 4.2.9 to Xn∧N +K +M shows lim Xn exists
on {N = ∞}. Letting K → ∞, we see that the limit exists on {lim inf Xn > −∞}.
Applying the last conclusion to −Xn , we see that lim Xn exists on {lim sup Xn < ∞}
and the proof is complete.
Exercise 4.3.1. Let Xn , n ≥ 0, be a submartingale with sup Xn < ∞. Let ξn =
+
Xn − Xn−1 and suppose E (sup ξn ) < ∞. Show that Xn converges a.s.
Exercise 4.3.2. Give an example of a martingale Xn with supn Xn  < ∞ and
P (Xn = a i.o.) = 1 for a = −1, 0, 1. This example shows that it is not enough to have
sup Xn+1 − Xn  < ∞ in Theorem 4.3.1.
Exercise 4.3.3. (Assumes familiarity with ﬁnite state Markov chains.) Fine tune
the example for the previous problem so that P (Xn = 0) → 1 − 2p and P (Xn = −1),
P (Xn = 1) → p, where p is your favorite number in (0, 1), i.e., you are asked to do
this for one value of p that you may choose. This example shows that a martingale
can converge in distribution without converging a.s. (or in probability).
Exercise 4.3.4. Let Xn and Yn be positive integrable and adapted to Fn . Suppose
E (Xn+1 Fn ) ≤ Xn + Yn , with
Yn < ∞ a.s. Prove that Xn converges a.s. to a ﬁnite
k
limit. Hint: Let N = inf k m=1 Ym > M , and stop your supermartingale at time N .
Theorem 4.3.2. Second BorelCantelli lemma, II. Let Fn , n ≥ 0 be a ﬁltration
with F0 = {∅, Ω} and An , n ≥ 1 a sequence of events with An ∈ Fn . Then
∞ {An i.o.} = P (An Fn−1 ) = ∞
n=1
n Proof. If we let X0 = 0 and Xn = m=1 1Am − P (Am Fm−1 ) for n ≥ 1 then Xn is a
martingale with Xn − Xn−1  ≤ 1. Using the notation of Theorem 4.3.1 we have:
∞ ∞ 1An = ∞ on C,
n=1
∞ n=1
∞ 1An = ∞ on D,
n=1 P (An Fn−1 ) = ∞ if and only if P (An Fn−1 ) = ∞ and
n=1 4.3. EXAMPLES 195 Since P (C ∪ D) = 1, the result follows.
Exercise 4.3.5. Let pm ∈ [0, 1). Use the BorelCantelli lemmas to show that
∞ ∞ (1 − pm ) = 0
m=1 Exercise 4.3.6. Show 4.3.2 p m = ∞. if and only if
m=1 ∞
n=2 P (An  ∩n−1 Ac ) = ∞ implies P (∩∞=1 Ac ) = 0.
m
m
m=1 m Polya’s Urn Scheme An urn contains r red and g green balls. At each time we draw a ball out, then replace
it, and add c more balls of the color drawn. Let Xn be the fraction of green balls after
the nth draw. To check that Xn is a martingale, note that if there are i red balls and
j green balls at time n, then
Xn+1 = (j + c)/(i + j + c)
j/(i + j + c) with probability j/(i + j )
with probability i/(i + j ) and we have
j+c
j
j
i
(j + c + i)j
j
·
+
·
=
=
i+j+c i+j
i+j+c i+j
(i + j + c)(i + j )
i+j
Since Xn ≥ 0, Theorem 4.2.9 implies that Xn → X∞ a.s. To compute the distribution of the limit, we observe (a) the probability of getting green on the ﬁrst m
draws then red on the next = n − m draws is
g+c
g + (m − 1)c
r
r + ( − 1)c
g
·
···
·
···
g+r g+r+c
g + r + (m − 1)c g + r + mc
g + r + (n − 1)c
and (b) any other outcome of the ﬁrst n draws with m green balls drawn and red
balls drawn has the same probability since the denominator remains the same and
the numerator is permuted. Consider the special case c = 1, g = 1, r = 1. Let Gn be
the number of green balls after the nth draw has been completed and the new ball
has been added. It follows from (a) and (b) that
P (Gn = m + 1) = n m!(n − m)!
1
=
m
(n + 1)!
n+1 so X∞ has a uniform distribution on (0,1).
If we suppose that c = 1, g = 2, and r = 1, then
P (Gn = m + 2) = n!
(m + 1)!(n − m)!
→ 2x
m!(n − m)!
(n + 2)!/2 if n → ∞ and m/n → x. In general, the distribution of X∞ has density
Γ((g + r)/c) (g/c)−1
x
(1 − x)(r/c)−1
Γ(g/c)Γ(r/c)
This is the beta distribution with parameters g/c and r/c. In Example 4.4.5 we
will see that the limit behavior changes drastically if, in addition to the c balls of the
color chosen, we always add one ball of the opposite color. 196 4.3.3 CHAPTER 4. MARTINGALES RadonNikodym Derivatives Let µ be a ﬁnite measure and ν a probability measure on (Ω, F ). Let Fn ↑ F be
σ ﬁelds (i.e., σ (∪Fn ) = F ). Let µn and νn be the restrictions of µ and ν to Fn .
Theorem 4.3.3. Suppose µn < νn for all n. Let Xn = dµn /dνn and let X =
<
lim sup Xn . Then
Xdν + µ(A ∩ {X = ∞}) µ(A) =
A Remark. µr (A) ≡ A X dν is a measure < ν . Since Theorem 4.2.9 implies ν (X =
<
∞) = 0, µs (A) ≡ µ(A ∩ {X = ∞}) is singular w.r.t. ν . Thus µ = µr + µs gives the
Lebesgue decomposition of µ (see Theorem A.8.5), and X∞ = dµr /dν , ν a.s. Here
and in the proof we need to keep track of the measure to which the a.s. refers.
Proof. As the reader can probably anticipate:
Lemma 4.3.4. Xn (deﬁned on (Ω, F , ν )) is a martingale w.r.t. Fn .
Proof. We observe that, by deﬁnition, Xn ∈ Fn . Let A ∈ Fn . Since Xn ∈ Fn and νn
is the restriction of ν to Fn
Xn dν =
A Xn dνn
A Using the deﬁnition of Xn and Exercise A.8.7
Xn dνn = µn (A) = µ(A)
A the last equality holding since A ∈ Fn and µn is the restriction of µ to Fn . If
A ∈ Fm−1 ⊂ Fm , using the last result for n = m and n = m − 1 gives
Xm dν = µ(A) =
A Xm−1 dν
A so E (Xm Fm−1 ) = Xm−1 .
Since Xn is a nonnegative martingale, Theorem 4.2.9 implies that Xn → X ν a.s.
We want to check that the equality in the theorem holds. Dividing µ(A) by µ(Ω), we
can without loss of generality suppose µ is a probability measure. Let ρ = (µ + ν )/2,
ρn = (µn + νn )/2 = the restriction of ρ to Fn . Let Yn = dµn /dρn , Zn = dνn /dρn . Yn ,
Zn ≥ 0 and Yn + Zn = 2 (by Exercise A.8.6), so Yn and Zn are bounded martingales
with limits Y and Z. As the reader can probably guess,
(∗) Y = dµ/dρ Z = dν/dρ It suﬃces to prove the ﬁrst equality. From the proof of Lemma 4.3.4, if A ∈ Fm ⊂ Fn
Yn dρ → µ(A) =
A Y dρ
A by the bounded convergence theorem. The last computation shows that
Y dρ for all A ∈ G = ∪m Fm µ(A) =
A 4.3. EXAMPLES 197 G is a π system, so the π − λ theorem implies the equality is valid for all A ∈ F = σ (G )
and (∗) is proved.
It follows from Exercises A.8.8 and A.8.9 that Xn = Yn /Zn . At this point, the
reader can probably leap to the conclusion that X = Y /Z . To get there carefully,
note Y + Z = 2 ρa.s., so ρ(Y = 0, Z = 0) = 0. Having ruled out 0/0 we have
X = Y /Z ρa.s. (Recall X ≡ lim sup Xn .) Let W = (1/Z ) · 1(Z>0) . Using (∗), then
1 = ZW + 1(Z =0) , we have
(a) µ(A) = Y dρ =
A Y W Z dρ +
A 1(Z =0) Y dρ
A Now (∗) implies dν = Z dρ, and it follows from the deﬁnitions that
Y W = X 1(Z>0) = X ν a.s. the second equality holding since ν ({Z = 0}) = 0. Combining things, we have
(b) Y W Z dρ = X dν A A To handle the other term, we note that (∗) implies dµ = Y dρ, and it follows from the
deﬁnitions that {X = ∞} = {Z = 0} µa.s. so
(c) 1(Z =0) Y dρ =
A 1(X =∞) dµ
A Combining (a), (b), and (c) gives the desired result.
Example 4.3.1. Suppose Fn = σ (Ik,n : 0 ≤ k < Kn ) where for each n, Ik,n is
a partition of Ω, and the (n + 1)th partition is a reﬁnement of the nth. In this
<
case, the condition µn < νn is ν (Ik,n ) = 0 implies µ(Ik,n ) = 0, and the martingale
Xn = µ(Ik,n )/ν (Ik,n ) on Ik,n is an approximation to the RadonNikodym derivative.
For a concrete example, consider Ω = [0, 1), Ik,n = [k 2−n , (k + 1)2−n ) for 0 ≤ k < 2n ,
and ν = Lebesgue measure.
Exercise 4.3.7. Check by direct computation that the Xn in Example 4.3.1 is a
<
martingale. Show that if we drop the condition µn < νn and set Xn = 0 when
ν (Ik,n ) = 0, then E (Xn+1 Fn ) ≤ Xn .
Exercise 4.3.8. Apply Theorem 4.3.3 to Example 4.3.1 to get a “probabilistic” proof
of the RadonNikodym theorem. To be precise, suppose F is countably generated
(i.e., there is a sequence of sets An so that F = σ (An : n ≥ 1)) and show that if µ and
ν are σ ﬁnite measures and µ < ν , then there is a function g so that µ(A) = A g dν .
<
Remark. Before you object to this as circular reasoning (the RadonNikodym theorem was used to deﬁne conditional expectation!), observe that the conditional expectations that are needed for Example 4.3.1 have elementary deﬁnitions.
Kakutani dichotomy for inﬁnite product measures. Let µ and ν be measures on sequence space (RN , RN ) that make the coordinates ξn (ω ) = ωn independent. Let Fn (x) = µ(ξn ≤ x), Gn (x) = ν (ξn ≤ x). Suppose Fn < Gn and let
<
qn = dFn /dGn . Let Fn = σ (ξm : m ≤ n), let µn and νn be the restrictions of µ and
ν to Fn , and let
n
dµn
=
qm .
Xn =
dνn
m=1 198 CHAPTER 4. MARTINGALES
∞
m=1 log(qm ) > −∞ is a tail event, so ν (X = 0) ∈ {0, 1} (4.3.1) Theorem 4.3.3 implies that Xn → X ν a.s.
the Kolmogorov 01 law implies and it follows from Theorem 4.3.3 that either µ < ν or µ ⊥ ν . The next result gives
<
a concrete criterion for which of the two alternatives occurs.
√
∞
qm dGm > 0 or = 0.
Theorem 4.3.5. µ < ν or µ ⊥ ν , according as m=1
<
Proof. Jensen’s inequality and Exercise A.8.7 imply
√ 2 qm dGm ≤ qm dGm = dFm = 1 so the inﬁnite product of the integrals is well deﬁned and ≤ 1. Let
Xn = qm (ωm )
m≤n as above, and recall that Xn → X ν a.s. If the inﬁnite product is 0 then
n
1
Xn/2 dν = √ qm dGm → 0 m=1 Fatou’s lemma implies
X 1/2 dν ≤ lim inf
n→∞ 1
Xn/2 dν = 0 so X = 0 ν a.s., and Theorem 4.3.3 implies µ ⊥ ν . To prove the other direction, let
1 /2
Yn = Xn . Now qm dGm = 1, so if we use E to denote expected value with respect
2
to ν , then EYm = EXm = 1, so
n+k √ 1 /2 1
E (Yn+k − Yn )2 = E (Xn+k + Xn − 2Xn/2 Xn+k ) = 2 1 − qm dGm m=n+1 Now a − b = a1/2 − b1/2  · (a1/2 + b1/2 ), so using CauchySchwarz and the fact
(a + b)2 ≤ 2a2 + 2b2 gives
E Xn+k − Xn  = E (Yn+k − Yn (Yn+k + Yn ))
≤ E (Yn+k − Yn )2 E (Yn+k + Yn )2
≤ 4E (Yn+k − Yn )2 1 /2 1 /2 From the last two equations, it follows that if the inﬁnite product is > 0, then Xn
converges to X in L1 (ν ), so ν (X = 0) < 1, (4.3.1) implies the probability is 0, and
the desired result follows from Theorem 4.3.3.
Bernoulli product measures. For the next three exercises, suppose Fn , Gn are
concentrated on {0, 1} and have Fn (0) = 1 − αn , Gn (0) = 1 − βn .
Exercise 4.3.9. (i) Use Theorem 4.3.5 to ﬁnd a necessary and suﬃcient condition
for µ < ν . (ii) Suppose that 0 < ≤ αn , βn ≤ 1 − < 1. Show that in this case the
<
condition is simply (αn − βn )2 < ∞. 4.3. EXAMPLES 199 Exercise 4.3.10. Show that if
αn < ∞ and
βn = ∞ in the previous exercise
then µ ⊥ ν . This shows that the condition (αn − βn )2 < ∞ is not suﬃcient for
µ < ν in general.
<
Exercise 4.3.11. Suppose 0 < αn , βn < 1. Show that
for µ < ν in general.
< 4.3.4 αn − βn  < ∞ is suﬃcient Branching Processes n
Let ξi , i, n ≥ 1, be i.i.d. nonnegative integervalued random variables. Deﬁne a
sequence Zn , n ≥ 0 by Z0 = 1 and
n
n
ξ1 +1 + · · · + ξZ+1
n
0 Zn+1 = if Zn > 0
if Zn = 0 (4.3.2) Zn is called a GaltonWatson process. The idea behind the deﬁnitions is that
Zn is the number of individuals in the nth generation, and each member of the nth
generation gives birth independently to an identically distributed number of children.
n
pk = P (ξi = k ) is called the oﬀspring distribution.
m
m
Lemma 4.3.6. Let Fn = σ (ξi : i ≥ 1, 1 ≤ m ≤ n) and µ = Eξi ∈ (0, ∞). Then
n
Zn /µ is a martingale w.r.t. Fn . Proof. Clearly, Zn ∈ Fn .
∞ E (Zn+1 1{Zn =k} Fn ) E (Zn+1 Fn ) =
k=1 by the linearity of conditional expectation, (4.1.1), and the monotone convergence
n
n
theorem, (4.1.3). On {Zn = k }, Zn+1 = ξ1 +1 + · · · + ξk +1 , so the sum is
∞ ∞
n
n
E ((ξ1 +1 + · · · + ξk +1 )1{Zn =k} Fn ) = k=1 n
n
1{Zn =k} E (ξ1 +1 + · · · + ξk +1 Fn )
k=1 n
by Theorem 4.1.7. Since each ξj +1 is independent of Fn , the last expression
∞ = 1{Zn =k} kµ = µZn
k=1 Dividing both sides by µn+1 now gives the desired result.
Remark. The reader should notice that in the proof of Lemma 4.3.6 we broke things
down according to the value of Zn to get rid of the random index. A simpler way of
doing the last argument (that we will use in the future) is to use Exercise 4.1.1 to
conclude that on {Zn = k }
n
n
E (Zn+1 Fn ) = E (ξ1 +1 + · · · + ξk +1 Fn ) = kµ = µZn Zn /µn is a nonnegative martingale, so Theorem 4.2.9 implies Zn /µn → a limit
a.s. We begin by identifying cases when the limit is trivial.
Theorem 4.3.7. If µ < 1 then Zn = 0 for all n suﬃciently large, so Zn /µn → 0. 200 CHAPTER 4. MARTINGALES Proof. E (Zn /µn ) = E (Z0 ) = 1, so E (Zn ) = µn . Now Zn ≥ 1 on {Zn > 0} so
P (Zn > 0) ≤ E (Zn ; Zn > 0) = E (Zn ) = µn → 0
exponentially fast if µ < 1.
The last answer should be intuitive. If each individual on the average gives birth
to less than one child, the species will die out. The next result shows that after we
exclude the trivial case in which each individual has exactly one child, the same result
holds when µ = 1.
m
Theorem 4.3.8. If µ = 1 and P (ξi = 1) < 1 then Zn = 0 for all n suﬃciently
large. Proof. When µ = 1, Zn is itself a nonnegative martingale. Since Zn is integer valued
and by Theorem 4.2.9 converges to an a.s. ﬁnite limit Z∞ , we must have Zn = Z∞
m
for large n. If P (ξi = 1) < 1 and k > 0 then P (Zn = k for all n ≥ N ) = 0 for any
N , so we must have Z∞ ≡ 0. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4.3: Generating function for Binomial(3,1/2).
When µ ≤ 1, the limit of Zn /µn is 0 because the branching process dies out. Our
next step is to show that if µ > 1 then P (Zn > 0 for all n) > 0. For s ∈ [0, 1], let
m
ϕ(s) = k≥0 pk sk where pk = P (ξi = k ). ϕ is the generating function for the
oﬀspring distribution pk .
Theorem 4.3.9. P (Zn = 0 for some n) = ρ the unique ﬁxed point of φ in [0, 1).
Proof. Diﬀerentiating and referring to Theorem A.9.2 for the justiﬁcation gives for
s<1
∞ k pk sk−1 ≥ 0 ϕ (s) =
k=1
∞ k (k − 1)pk sk−2 ≥ 0 ϕ (s) =
k=2 4.3. EXAMPLES 201
∞ So ϕ is increasing and convex, and lims↑1 ϕ (s) = k=1 kpk = µ.
Our interest in ϕ stems from the following facts.
(a) If θm = P (Zm = 0) then θm = ∞
k=0 pk (θm−1 )k . Proof of (a). If Z1 = k , an event with probability pk , then Zm = 0 if and only if all
k families die out in the remaining m − 1 units of time, an independent event with
k
probability θm−1 . Summing over the disjoint possibilities for each k gives the desired
result.
(b) If ϕ (1) = µ > 1 there is a unique ρ < 1 so that ϕ(ρ) = ρ.
Proof of (b). ϕ(0) ≥ 0, ϕ(1) = 1, and ϕ (1) > 1, so ϕ(1 − ) < 1 − for small . The
last two observations imply the existence of a ﬁxed point. To see it is unique, observe
that µ > 1 implies pk > 0 for some k > 1, so ϕ (θ) > 0 for θ > 0. Since ϕ is strictly
convex, it follows that if ρ < 1 is a ﬁxed point, then ϕ(x) < x for x ∈ (ρ, 1).
(c) As m ↑ ∞, θm ↑ ρ.
Proof of (c). θ0 = 0, ϕ(ρ) = ρ, and ϕ is increasing, so induction implies θm is
increasing and θm ≤ ρ. Let θ∞ = lim θm . Taking limits in θm = ϕ(θm−1 ), we see
θ∞ = ϕ(θ∞ ). Since θ∞ ≤ ρ, it follows that θ∞ = ρ.
Combining (a)–(c) shows P (Zn = 0 for some n) = lim θn = ρ < 1 and proves Theorem
4.3.9.
The last result shows that when µ > 1, the limit of Zn /µn has a chance of being
nonzero. The best result on this question is due to Kesten and Stigum:
Theorem 4.3.10. W = lim Zn /µn is not ≡ 0 if and only if pk k log k < ∞. For a proof, see Athreya and Ney (1972), p. 24–29. In the next section, we will show
k 2 pk < ∞ is suﬃcient for a nontrivial limit.
that
Exercise 4.3.12. Show that if P (lim Zn /µn = 0) < 1 then it is = ρ and hence
{lim Zn /µn > 0} = {Zn > 0 for all n} a.s. Exercise 4.3.13. Galton and Watson who invented the process that bears their
names were interested in the survival of family names. Suppose each family has
exactly 3 children but coin ﬂips determine their sex. In the 1800s, only male children
kept the family name so following the male oﬀspring leads to a branching process with
p0 = 1/8, p1 = 3/8, p2 = 3/8, p3 = 1/8. Compute the probability ρ that the family
name will die out when Z0 = 1. 202 4.4 CHAPTER 4. MARTINGALES Doob’s Inequality, Convergence in Lp We begin by proving a consequence of Theorem 4.2.6.
Theorem 4.4.1. If Xn is a submartingale and N is a stopping time with P (N ≤
k ) = 1 then
EX0 ≤ EXN ≤ EXk
Remark. Let Sn be a simple random walk with S0 = 1 and let N = inf {n : Sn = 0}.
(See Example 4.2.3 for more details.) ES0 = 1 > 0 = ESN so the ﬁrst inequality
need not hold for unbounded stopping times. In Section 4.7 we will give conditions
that guarantee EX0 ≤ EXN for unbounded N.
Proof. Theorem 4.2.6 implies XN ∧n is a submartingale, so it follows that
EX0 = EXN ∧0 ≤ EXN ∧k = EXN
To prove the other inequality, let Kn = 1{N <n} = 1{N ≤n−1} . Kn is predictable, so
Theorem 4.2.5 implies (K · X )n = Xn − XN ∧n is a submartingale and it follows that
EXk − EXN = E (K · X )k ≥ E (K · X )0 = 0
Exercise 4.4.1. Show that if j ≤ k then E (Xj ; N = j ) ≤ E (Xk ; N = j ) and sum
over j to get a second proof of EXN ≤ EXk .
Exercise 4.4.2. Generalize the proof of Theorem 4.4.1 to show that if Xn is a submartingale and M ≤ N are stopping times with P (N ≤ k ) = 1 then EXM ≤ EXN .
Exercise 4.4.3. Use the stopping times from the Exercise 3.1.7 to strengthen the
conclusion of the previous exercise to E (XN FM ) ≥ XM .
We will see below that Theorem 4.4.1 is very useful. The ﬁrst indication of this is:
Theorem 4.4.2. Doob’s inequality. Let Xm be a submartingale,
+
¯
Xn = max Xm
0≤m≤n ¯
λ > 0, and A = {Xn ≥ λ}. Then
+
λP (A) ≤ EXn 1A ≤ EXn Proof. Let N = inf {m : Xm ≥ λ or m = n}. Since XN ≥ λ on A,
λP (A) ≤ EXN 1A ≤ EXn 1A
The second inequality follows from the fact that Theorem 4.4.1 implies EXN ≤ EXn
and we have XN = Xn on Ac . The second inequality is trivial, so the proof is
complete.
Example 4.4.1. Random walks. If we let Sn = ξ1 + · · · + ξn where the ξm
2
2
are independent and have Eξm = 0, σm = Eξm < ∞, then Theorem 4.2.3 implies
2
2
Xn = Sn is a submartingale. If we let λ = x and apply Theorem 4.4.2 to Xn , we get
Kolmogorov’s maximal inequality, Theorem 1.8.2:
P max Sm  ≥ x 1≤m≤n ≤ x−2 var (Sn ) Using martingales, one can also prove a lower bound on the maximum that can be
used instead of the central limit theorem in our proof of the necessity of the conditions
in the three series theorem. (See Example 2.4.7.) 4.4. DOOB’S INEQUALITY, CONVERGENCE IN LP 203 Exercise 4.4.4. Suppose in addition to the conditions introduced above that ξm  ≤
2
2
K and let s2 = m≤n σm . Exercise 4.2.6 implies that Sn − s2 is a martingale. Use
n
n
this and Theorem 4.4.1 to conclude
P max Sm  ≤ x 1≤m≤n ≤ (x + K )2 / var (Sn ) 2
Exercise 4.4.5. Let Xn be a martingale with X0 = 0 and EXn < ∞. Show that P max Xm ≥ λ 1≤m≤n 2
2
≤ EXn /(EXn + λ2 ) Hint: Use the fact that (Xn + c)2 is a submartingale and optimize over c.
Integrating the inequality in Theorem 4.4.2 gives:
Theorem 4.4.3. Lp maximum inequality. If Xn is a submartingale then for
1 < p < ∞,
p
p
+
¯p
E (Xn ) ≤
E (Xn )p
p−1
∗
Consequently, if Yn is a martingale and Yn = max0≤m≤n Ym ,
p p
p−1 ∗
E Yn p ≤ E (Yn p ) Proof. The second inequality follows by applying the ﬁrst to Xn = Yn . To prove the
¯
ﬁrst we will, for reasons that will become clear in a moment, work with Xn ∧ M rather
¯
¯
¯
than Xn . Since {Xn ∧ M ≥ λ} is always {Xn ≥ λ} or ∅, this does not change the
application of Theorem 4.4.2. Using Lemma 1.5.8, Theorem 4.4.2, Fubini’s theorem,
and a little calculus gives
∞ ¯
E ((Xn ∧ M )p ) = ¯
pλp−1 P (Xn ∧ M ≥ λ) dλ
0
∞ pλp−1 λ−1 ≤
0 ¯
Xn ∧M = +
Xn +
Xn 1(Xn ∧M ≥λ) dP
¯ dλ pλp−2 dλ dP 0 = p
p−1 +¯
Xn (Xn ∧ M )p−1 dP If we let q = p/(p − 1) be the exponent conjugate to p and apply H¨lder’s inequality,
o
Theorem 1.3.3, we see that the above
+
¯
≤ q (E Xn p )1/p (E Xn ∧ M p )1/q ¯
If we divide both sides of the last inequality by (E Xn ∧ M p )1/q , we get
¯
E (Xn ∧ M p ) ≤ p
p−1 p
+
E (Xn )p Letting M → ∞ and using the monotone convergence theorem gives the desired
result. 204 CHAPTER 4. MARTINGALES Example 4.4.2. Theorem 4.4.3 is false when p = 1. Again, the counterexample
is provided by Example 4.2.3. Let Sn be a simple random walk starting from S0 = 1,
N = inf {n : Sn = 0}, and Xn = SN ∧n . Theorem 4.4.1 implies EXn = ESN ∧n =
ES0 = 1 for all n. Using hitting probabilities for simple random walk, (3.1.2) a = −1,
b = M − 1, we have
1
P max Xm ≥ M =
m
M
∞
∞
so E (maxm Xm ) = M =1 P (maxm Xm ≥ M ) = M =1 1/M = ∞. The monotone
convergence theorem implies that E maxm≤n Xm ↑ ∞ as n ↑ ∞.
The next result gives an extension of Theorem 4.4.2 to p = 1. Since this is not one
of the most important results, the proof is left to the reader.
Theorem 4.4.4. Let Xn be a submartingale and log+ x = max(log x, 0).
+
+
¯
E Xn ≤ (1 − e−1 )−1 {1 + E (Xn log+ (Xn ))} Remark. The last result is almost the best possible condition for sup Xn  ∈ L1 .
Gundy has shown that if Xn is a positive martingale that has Xn+1 ≤ CXn and
EX0 log+ X0 < ∞, then E (sup Xn ) < ∞ implies sup E (Xn log+ Xn ) < ∞. For a
proof, see Neveu (1975) p. 71–73.
Exercise 4.4.6. Prove Theorem 4.4.4 by carrying out the following steps: (i) Imitate
the proof of 4.4.2 but use the trivial bound P (A) ≤ 1 for λ ≤ 1 to show
¯
E (Xn ∧ M ) ≤ 1 + +
¯
Xn log(Xn ∧ M ) dP (ii) Use calculus to show a log b ≤ a log a + b/e ≤ a log+ a + b/e.
From Theorem 4.4.2, we get the following:
Theorem 4.4.5. Lp convergence theorem. If Xn is a martingale with sup E Xn p <
∞ where p > 1, then Xn → X a.s. and in Lp .
+
Proof. (EXn )p ≤ (E Xn )p ≤ E Xn p , so it follows from the martingale convergence
theorem (4.2.8) that Xn → X a.s. The second conclusion in Theorem 4.4.3 implies
p E sup Xm 
0≤m≤n ≤ p
p−1 p E Xn p Letting n → ∞ and using the monotone convergence theorem implies sup Xn  ∈ Lp .
Since Xn − X p ≤ (2 sup Xn )p , it follows from the dominated convergence theorem,
that E Xn − X p → 0.
The most important special case of the results in this section occurs when p = 2.
To treat this case, the next two results are useful.
Theorem 4.4.6. Orthogonality of martingale increments. Let Xn be a mar2
tingale with EXn < ∞ for all n. If m ≤ n and Y ∈ Fm has EY 2 < ∞ then
E ((Xn − Xm )Y ) = 0
Proof. The CauchySchwarz inequality implies E (Xn − Xm )Y  < ∞. Using (4.1.5),
Theorem 4.1.7, and the deﬁnition of a martingale,
E ((Xn − Xm )Y ) = E [E ((Xn − Xm )Y Fm )] = E [Y E ((Xn − Xm )Fm )] = 0 4.4. DOOB’S INEQUALITY, CONVERGENCE IN LP 205 Theorem 4.4.7. Conditional variance formula. If Xn is a martingale with
2
EXn < ∞ for all n,
2
2
E ((Xn − Xm )2 Fm ) = E (Xn Fm ) − Xm . Remark. This is the conditional analogue of E (X − EX )2 = EX 2 − (EX )2 and is
proved in exactly the same way.
Proof. Using the linearity of conditional expectation and then Theorem 4.1.7, we have
2
2
2
2
E (Xn − 2Xn Xm + Xm Fm ) = E (Xn Fm ) − 2Xm E (Xn Fm ) + Xm
2
2
2
= E (Xn Fm ) − 2Xm + Xm which gives the desired result.
2
2
Exercise 4.4.7. Let Xn and Yn be martingales with EXn < ∞ and EYn < ∞.
n EXn Yn − EX0 Y0 = E (Xm − Xm−1 )(Ym − Ym−1 )
m=1 The next two results generalize Theorems 1.8.3 and 1.8.7 from Chapter 1. Let Xn ,
n ≥ 0, be a martingale and let ξn = Xn − Xn−1 for n ≥ 1.
2
Exercise 4.4.8. If EX0 , ∞
m=1 2
Eξm < ∞ then Xn → X∞ a.s. and in L2 .
∞ 2
Exercise 4.4.9. If bm ↑ ∞ and m=1 Eξm /b2 < ∞ then Xn /bn → 0 a.s.
m
∞
2
In particular, if Eξn ≤ K < ∞ and m=1 b−2 < ∞ then Xn /bn → 0 a.s.
m Example 4.4.3. Branching processes. We continue the study begun at the end
m
of the last section. Using the notation introduced there, we suppose µ = E (ξi ) > 1
m
and var (ξi ) = σ 2 < ∞. Let Xn = Zn /µn . Taking m = n − 1 in Theorem 4.4.7 and
rearranging, we have
2
2
E (Xn Fn−1 ) = Xn−1 + E ((Xn − Xn−1 )2 Fn−1 ) To compute the second term, we observe
E ((Xn − Xn−1 )2 Fn−1 ) = E ((Zn /µn − Zn−1 /µn−1 )2 Fn−1 )
= µ−2n E ((Zn − µZn−1 )2 Fn−1 )
It follows from Exercise 4.1.1 that on {Zn−1 = k },
k E ((Zn − µZn−1 )2 Fn−1 ) = E n
ξi − µk 2 Fn−1 = kσ 2 = Zn−1 σ 2 i=1 Combining the last three equations gives
2
2
2
EXn = EXn−1 + E (Zn−1 σ 2 /µ2n ) = EXn−1 + σ 2 /µn+1
2
2
since E (Zn−1 /µn−1 ) = EZ0 = 1. Now EX0 = 1, so EX1 = 1 + σ 2 /µ2 , and induction
gives
n+1 µ−k 2
EXn = 1 + σ 2
k=2
2
sup EXn 2 This shows
< ∞, so Xn → X in L , and hence EXn → EX . EXn = 1 for
all n, so EX = 1 and X is not ≡ 0. It follows from Exercise 4.3.12 that {X > 0} =
{Zn > 0 for all n }. 206 CHAPTER 4. MARTINGALES 4.4.1 Square Integrable Martingales* For the rest of this section, we will suppose
2
Xn is a martingale with X0 = 0 and EXn < ∞ for all n
2
Theorem 4.2.3 implies Xn is a submartingale. It follows from Doob’s decomposition
2
Theorem 4.2.10 that we can write Xn = Mn + An , where Mn is a martingale, and
from formulas in Theorems 4.2.10 and 4.4.7 that
n n
2
2
E (Xm Fm−1 ) − Xm−1 = An =
m=1 E ((Xm − Xm−1 )2 Fm−1 )
m=1 An is called the increasing process associated with Xn . An can be thought of as
a path by path measurement of the variance at time n, and A∞ = lim An as the
total variance in the path. Theorems 4.4.9 and 4.4.10 describe the behavior of the
martingale on {An < ∞} and {An = ∞}, respectively. The key to the proof of the
ﬁrst result is the following:
Theorem 4.4.8. E supm Xm 2 ≤ 4EA∞ .
Proof. Applying the L2 maximum inequality (Theorem 4.4.3) to Xn gives
E sup Xm 2
0≤m≤n 2
≤ 4EXn = 4EAn 2
2
since EXn = EMn + EAn and EMn = EM0 = EX0 = 0. Using the monotone
convergence theorem now gives the desired result. Theorem 4.4.9. limn→∞ Xn exists and is ﬁnite a.s. on {A∞ < ∞}.
Proof. Let a > 0. Since An+1 ∈ Fn , N = inf {n : An+1 > a2 } is a stopping time.
Applying Theorem 4.4.8 to XN ∧n and noticing AN ∧n ≤ a2 gives
E sup XN ∧n 2 ≤ 4a2 n so the L2 convergence theorem (4.4.5) implies that lim XN ∧n exists and is ﬁnite a.s.
Since a is arbitrary, the desired result follows.
The next result is a variation on the theme of Exercise 4.4.9.
Theorem 4.4.10. Let f ≥ 1 be increasing with
0 a.s. on {A∞ = ∞}. ∞
0 f (t)−2 dt < ∞. Then Xn /f (An ) → Proof. Hm = f (Am )−1 is bounded and predictable, so Theorem 4.2.5 implies
n Yn ≡ (H · X )n = Xm − Xm−1
f (Am )
m=1 is a martingale If Bn is the increasing process associated with Yn then
Bn+1 − Bn = E ((Yn+1 − Yn )2 Fn )
=E (Xn+1 − Xn )2
Fn
f (An+1 )2 = An+1 − An
f (An+1 )2 4.4. DOOB’S INEQUALITY, CONVERGENCE IN LP 207 since f (An+1 ) ∈ Fn . Our hypotheses on f imply that
∞ ∞ An+1 − An
≤
f (An+1 )2
n=0
n=0 f (t)−2 dt < ∞
[An ,An+1 ) so it follows from Theorem 4.4.9 that Yn → Y∞ , and the desired conclusion follows
from Kronecker’s lemma, Theorem 1.8.5.
Example 4.4.4. Let > 0 and f (t) = (t log1+ t)1/2 ∨ 1. Then f satisﬁes the
hypotheses of Theorem 4.4.10. Let ξ1 , ξ2 , . . . be independent with Eξm = 0 and
2
2
Eξm = σm . In this case, Xn = ξ1 + · · · + ξn is a square integrable martingale with
∞
2
2
2
An = σ1 + · · · + σn , so if i=1 σi = ∞, Theorem 4.4.10 implies Xn /f (An ) → 0
generalizing Theorem 1.8.7.
From Theorem 4.4.10 we get a result due to Dubins and Freedman (1965) that
extends our two previous versions in Theorems 1.6.6 and 4.3.2.
Theorem 4.4.11. Second BorelCantelli Lemma, III. Suppose Bn is adapted to
Fn and let pn = P (Bn Fn−1 ). Then
n ∞ n pm → 1 1B (m)
m=1 pm = ∞ a.s. on m=1 m=1 Proof. Deﬁne a martingale by X0 = 0 and Xn − Xn−1 = 1Bn − P (Bn Fn−1 ) for n ≥ 1
so that we have
n n 1B (m)
m=1 n pm − 1 = Xn pm
m=1 m=1 The increasing process associated with Xn has
An − An−1 = E ((Xn − Xn−1 )2 Fn−1 )
= E (1Bn − pn )2 Fn−1 = pn − p2 ≤ pn
n
On {A∞ < ∞}, Xn → a ﬁnite limit by Theorem 4.4.9, so on {A∞ < ∞} ∩ {
∞} m pm = n pm → 0 Xn
m=1 {A∞ = ∞} = { m pm (1 − pm ) = ∞} ⊂ { m pm = ∞}, so on {A∞ = ∞} the
desired conclusion follows from Theorem 4.4.10 with f (t) = t ∨ 1.
Remark. The trivial example Bn = Ω for all n shows we may have A∞ < ∞ and
pm = ∞ a.s.
Example 4.4.5. Bernard Friedman’s urn. Consider a variant of Polya’s urn (see
Section 4.3) in which we add a balls of the color drawn and b balls of the opposite
color where a ≥ 0 and b > 0. We will show that if we start with g green balls and
r red balls, where g, r > 0, then the fraction of green balls gn → 1/2. Let Gn and
Rn be the number of green and red balls after the nth draw is completed. Let Bn be
the event that the nth ball drawn is green, and let Dn be the number of green balls
drawn in the ﬁrst n draws. It follows from Theorem 4.4.11 that
∞ n () gm−1 → 1 Dn
m=1 gm−1 = ∞ a.s. on
m=1 208 CHAPTER 4. MARTINGALES which always holds since gm ≥ g/(g + r + (a + b)m). At this point, the argument
breaks into three cases.
Case 1. a = b = c. In this case, the result is trivial since we always add c balls of
each color.
Case 2. a > b. We begin with the observation
(∗) gn+1 = Gn+1
g + aDn + b(n − Dn )
=
Gn+1 + Rn+1
g + r + n(a + b) If limsupn→∞ gn ≤ x then ( ) implies limsupn→∞ Dn /n ≤ x and (since a > b)
lim sup gn+1 ≤
n→∞ b + (a − b)x
ax + b(1 − x)
=
a+b
a+b The righthand side is a linear function with slope < 1 and ﬁxed point at 1/2,
so starting with the trivial upper bound x = 1 and iterating we conclude that
lim sup gn ≤ 1/2. Interchanging the roles of red and green shows lim inf n→∞ gn ≥ 1/2,
and the result follows.
Case 3. a < b. The result is easier to believe in this case since we are adding more
balls of the type not drawn but is a little harder to prove. The trouble is that when
b > a and Dn ≤ xn, the righthand side of (∗) is maximized by taking Dn = 0, so we
need to also use the fact that if rn is fraction of red balls, then
rn+1 = Rn+1
r + bDn + a(n − Dn )
=
Gn+1 + Rn+1
g + r + n(a + b) Combining this with the formula for gn+1 , it follows that if lim supn→∞ gn ≤ x and
lim supn→∞ rn ≤ y then
a + (b − a)y
a(1 − y ) + by
=
a+b
a+b
n→∞
bx + a(1 − x)
a + (b − a)x
lim sup rn ≤
=
a+b
a+b
n→∞ lim sup gn ≤ Starting with the trivial bounds x = 1, y = 1 and iterating (observe the two upper
bounds are always the same), we conclude as in Case 2 that both limsups are ≤ 1/2.
Remark. B. Friedman (1949) considered a number of diﬀerent urn models. The
result above is due to Freedman (1965), who proved the result by diﬀerent methods.
The proof above is due to Ornstein and comes from a remark in Freedman’s paper.
Theorem 4.4.8 came from using Theorem 4.4.3. If we use Theorem 4.4.2 instead,
we get a slightly better result.
1 /2 Theorem 4.4.12. E (supn Xn ) ≤ 3EA∞ .
Proof. As in the proof of Theorem 4.4.9 we let a > 0 and let N = inf {n : An+1 > a2 }.
This time, however, our starting point is
P sup Xm  > a
m ≤ P (N < ∞) + P sup XN ∧m  > a
m 4.4. DOOB’S INEQUALITY, CONVERGENCE IN LP 209 P (N < ∞) = P (A∞ > a2 ). To bound the second term, we apply Theorem 4.4.2 to
2
XN ∧m with λ = a2 to get
2
≤ a−2 EXN ∧n = a−2 EAN ∧n ≤ a−2 E (A∞ ∧ a2 ) sup XN ∧m  > a P m≤n Letting n → ∞ in the last inequality, substituting the result in the ﬁrst one, and
integrating gives
∞ ∞
m 0 ∞ a−2 E (A∞ ∧ a2 ) da P (A∞ > a2 ) da + sup Xm  > a da ≤ P 0 0 1 /2 1 /2 Since P (A∞ > a2 ) = P (A∞ > a), the ﬁrst integral is EA∞ . For the second, we use
Lemma 1.5.8 (in the ﬁrst and fourth steps), Fubini’s theorem, and calculus to get
∞ a2 ∞ a−2 E (A∞ ∧ a2 ) da =
0 a−2
0 ∞ = P (A∞ > b) db da
0
∞ ∞ P (A∞ > b)
0 √
b a−2 da db = b−1/2 P (A∞ > b) db = 2EA1/2
∞
0 which completes the proof.
2
Exercise 4.4.10. Let ξ1 , ξ2 , . . . be i.i.d. with Eξi = 0 and Eξi < ∞. Let Sn =
ξ1 + · · · + ξn . Theorem 4.4.1 implies that for any stopping time N , ESN ∧n = 0. Use
Theorem 4.4.12 to conclude that if EN 1/2 < ∞ then ESN = 0. Remark. Let ξi in Exercise 4.4.10 take the values ±1 with equal probability, and
let T = inf {n : Sn = −1}. Since ST = −1 does not have mean 0, it follows that
ET 1/2 = ∞. If we recall from (3.3.1) that P (T > t) ∼ Ct−1/2 , we see that the result
in Exercise 4.4.10 is almost the best possible. 210 CHAPTER 4. MARTINGALES 4.5 Uniform Integrability, Convergence in L1 In this section, we will give necessary and suﬃcient conditions for a martingale to
converge in L1 . The key to this is the following deﬁnition. A collection of random
variables Xi , i ∈ I , is said to be uniformly integrable if
sup E (Xi ; Xi  > M ) lim M →∞ =0 i∈I If we pick M large enough so that the sup < 1, it follows that
sup E Xi  ≤ M + 1 < ∞
i∈I This remark will be useful several times below.
A trivial example of a uniformly integrable family is a collection of random variables that are dominated by an integrable random variable, i.e., Xi  ≤ Y where
EY < ∞. Our ﬁrst result gives an interesting example that shows that uniformly
integrable families can be very large.
Theorem 4.5.1. Given a probability space (Ω, Fo , P ) and an X ∈ L1 , then {E (X F ) :
F is a σ ﬁeld ⊂ Fo } is uniformly integrable.
Proof. If An is a sequence of sets with P (An ) → 0 then the dominated convergence
theorem implies E (X ; An ) → 0. From the last result, it follows that if > 0, we can
pick δ > 0 so that if P (A) ≤ δ then E (X ; A) ≤ . (If not, there are sets An with
P (An ) ≤ 1/n and E (X ; An ) > , a contradiction.)
Pick M large enough so that E X /M ≤ δ . Jensen’s inequality and the deﬁnition
of conditional expectation imply
E ( E (X F ) ; E (X F ) > M ) ≤ E ( E (X F ) ; E (X F ) > M )
= E ( X  ; E (X F ) > M )
since {E (X F ) > M } ∈ F . Using Chebyshev’s inequality and recalling the deﬁnition
of M , we have
P {E (X F ) > M } ≤ E {E (X F )}/M = E X /M ≤ δ
So, by the choice of δ , we have
E (E (X F ); E (X F ) > M ) ≤
Since for all F was arbitrary, the collection is uniformly integrable. A common way to check uniform integrability is to use:
Exercise 4.5.1. Let ϕ ≥ 0 be any function with ϕ(x)/x → ∞ as x → ∞, e.g.,
ϕ(x) = xp with p > 1 or ϕ(x) = x log+ x. If Eϕ(Xi ) ≤ C for all i ∈ I , then
{Xi : i ∈ I } is uniformly integrable.
The relevance of uniform integrability to convergence in L1 is explained by:
Theorem 4.5.2. If Xn → X in probability then the following are equivalent:
(i) {Xn : n ≥ 0} is uniformly integrable.
(ii) Xn → X in L1 .
(iii) E Xn  → E X  < ∞. 4.5. UNIFORM INTEGRABILITY, CONVERGENCE IN L1 211 Proof. (i) implies (ii). Let M ϕM (x) = x −M if x ≥ M
if x ≤ M
if x ≤ −M The triangle inequality implies
Xn − X  ≤ Xn − ϕM (Xn ) + ϕM (Xn ) − ϕM (X ) + ϕM (X ) − X 
Since ϕM (Y ) − Y ) = (Y  − M )+ ≤ Y 1(Y >M ) , taking expected value gives
E Xn − X  ≤ E ϕM (Xn ) − ϕM (X ) + E (Xn ; Xn  > M ) + E (X ; X  > M )
Theorem 1.6.4 implies that ϕM (Xn ) → ϕM (X ) in probability, so the ﬁrst term → 0
by the bounded convergence theorem. (See Exercise 1.6.3.) If > 0 and M is large,
uniform integrability implies that the second term ≤ . To bound the third term, we
observe that uniform integrability implies sup E Xn  < ∞, so Fatou’s lemma (in the
form given in Exercise 1.6.2) implies E X  < ∞, and by making M larger we can make
the third term ≤ . Combining the last three facts shows lim sup E Xn − X  ≤ 2 .
Since is arbitrary, this proves (ii).
(ii) implies (iii). Jensen’s inequality implies
E Xn  − E X  ≤ E Xn  − X  ≤ E Xn − X  → 0
(iii) implies (i). Let x
on [0, M − 1], ψM (x) = 0
.
on [M, ∞) linear on [M − 1, M ]
The dominated convergence theorem implies that if M is large, E X  − EψM (X ) ≤
/2. As in the ﬁrst part of the proof, the bounded convergence theorem implies
EψM (Xn ) → EψM (X ), so using (iii) we get that if n ≥ n0
E (Xn ; Xn  > M ) ≤ E Xn  − EψM (Xn )
≤ E X  − EψM (X ) + /2 <
By choosing M larger, we can make E (Xn ; Xn  > M ) ≤
uniformly integrable. for 0 ≤ n < n0 , so Xn is We are now ready to state the main theorems of this section. We have already
done all the work, so the proofs are short.
Theorem 4.5.3. For a submartingale, the following are equivalent:
(i) It is uniformly integrable.
(ii) It converges a.s. and in L1 .
(iii) It converges in L1 .
Proof. (i) implies (ii). Uniform integrability implies sup E Xn  < ∞ so the martingale
convergence theorem implies Xn → X a.s., and Theorem 4.5.2 implies Xn → X in
L1 . (ii) implies (iii). Trivial. (iii) implies (i). Xn → X in L1 implies Xn → X in
probability, (see Lemma 1.5.2) so this follows from Theorem 4.5.2. 212 CHAPTER 4. MARTINGALES Before proving the analogue of Theorem 4.5.3 for martingales, we will isolate two
parts of the argument that will be useful later.
Lemma 4.5.4. If integrable random variables Xn → X in L1 then
E (Xn ; A) → E (X ; A)
Proof. EXm 1A − EX 1A  ≤ E Xm 1A − X 1A  ≤ E Xm − X  → 0
Lemma 4.5.5. If a martingale Xn → X in L1 then Xn = E (X Fn ).
Proof. The martingale property implies that if m > n, E (Xm Fn ) = Xn , so if A ∈
Fn , E (Xn ; A) = E (Xm ; A). Lemma 4.5.4 implies E (Xm ; A) → E (X ; A), so we
have E (Xn ; A) = E (X ; A) for all A ∈ Fn . Recalling the deﬁnition of conditional
expectation, it follows that Xn = E (X Fn ).
Theorem 4.5.6. For a martingale, the following are equivalent:
(i) It is uniformly integrable.
(ii) It converges a.s. and in L1 .
(iii) It converges in L1 .
(iv) There is an integrable random variable X so that Xn = E (X Fn ).
Proof. (i) implies (ii). Since martingales are also submartingales, this follows from
Theorem 4.5.3. (ii) implies (iii). Trivial. (iii) implies (iv). Follows from Lemma
4.5.5. (iv) implies (i). This follows from Theorem 4.5.1.
The next result is related to Lemma 4.5.5 but goes in the other direction.
Theorem 4.5.7. Suppose Fn ↑ F∞ , i.e., Fn is an increasing sequence of σ ﬁelds and
F∞ = σ (∪n Fn ). As n → ∞,
E (X Fn ) → E (X F∞ ) a.s. and in L1 Proof. The ﬁrst step is to note that if m > n then Theorem 4.1.6 implies
E (E (X Fm )Fn ) = E (X Fn )
so Yn = E (X Fn ) is a martingale. Theorem 4.5.1 implies that Yn is uniformly integrable, so Theorem 4.5.6 implies that Yn converges a.s. and in L1 to a limit Y∞ . The
deﬁnition of Yn and Lemma 4.5.5 imply E (X Fn ) = Yn = E (Y∞ Fn ), and hence
X dP =
A Y∞ dP for all A ∈ Fn A Since X and Y∞ are integrable, and ∪n Fn is a π system, the π − λ theorem implies
that the last result holds for all A ∈ F∞ . Since Y∞ ∈ F∞ , it follows that Y∞ =
E (X F∞ ).
Exercise 4.5.2. Let Z1 , Z2 , . . . be i.i.d. with E Zi  < ∞, let θ be an independent
r.v. with ﬁnite mean, and let Yi = Zi + θ. If Zi is normal(0,1) then in statistical terms
we have a sample from a normal population with variance 1 and unknown mean. The
distribution of θ is called the prior distribution, and P (θ ∈ ·Y1 , . . . , Yn ) is called
the posterior distribution after n observations. Show that E (θY1 , . . . , Yn ) → θ
a.s.
In the next two exercises, Ω = [0, 1), Ik,n = [k 2−n , (k + 1)2−n ), and Fn = σ (Ik,n :
0 ≤ k < 2n ). 4.5. UNIFORM INTEGRABILITY, CONVERGENCE IN L1 213 Exercise 4.5.3. f is said to be Lipschitz continuous if f (t) − f (s) ≤ K t − s
for 0 ≤ s, t < 1. Show that Xn = (f ((k + 1)2−n ) − f (k 2−n ))/2−n on Ik,n deﬁnes a
martingale, Xn → X∞ a.s. and in L1 , and
b f (b) − f (a) = X∞ (ω ) dω
a Exercise 4.5.4. Suppose f is integrable on [0,1). E (f Fn ) is a step function and
→ f in L1 . From this it follows immediately that if > 0, there is a step function g on
[0,1] with f − g  dx < . This approximation is much simpler than the barehands
approach we used in Exercise A.4.3, but of course we are using a lot of machinery.
An immediate consequence of Theorem 4.5.7 is:
Theorem 4.5.8. L´vy’s 01 law. If Fn ↑ F∞ and A ∈ F∞ then E (1A Fn ) → 1A
e
a.s.
To steal a line from Chung: “The reader is urged to ponder over the meaning of this
result and judge for himself whether it is obvious or incredible.” We will now argue
for the two points of view.
“It is obvious.” 1A ∈ F∞ , and Fn ↑ F∞ , so our best guess of 1A given the information
in Fn should approach 1A (the best guess given F∞ ).
“It is incredible.” Let X1 , X2 , . . . be independent and suppose A ∈ T , the tail σ ﬁeld.
For each n, A is independent of Fn , so E (1A Fn ) = P (A). As n → ∞, the lefthand
side converges to 1A a.s., so P (A) = 1A a.s., and it follows that P (A) ∈ {0, 1}, i.e.,
we have proved Kolmogorov’s 01 law.
The last argument may not show that Theorem 4.5.8 is “too unusual or improbable
to be possible,” but this and other applications of Theorem 4.5.8 below show that it
is a very useful result.
Exercise 4.5.5. Let Xn be r.v.’s taking values in [0, ∞). Let D = {Xn = 0 for some
n ≥ 1} and assume
P (DX1 , . . . , Xn ) ≥ δ (x) > 0 a.s. on {Xn ≤ x} Use Theorem 4.5.8 to conclude that P (D ∪ {limn Xn = ∞}) = 1.
Exercise 4.5.6. Let Zn be a branching process with oﬀspring distribution pk (see
the end of Section 4.3 for deﬁnitions). Use the last result to show that if p0 > 0 then
P (limn Zn = 0 or ∞) = 1.
Exercise 4.5.7. Let Xn ∈ [0, 1] be adapted to Fn . Let α, β > 0 with α + β = 1 and
suppose
P (Xn+1 = α + βXn Fn ) = Xn P (Xn+1 = βXn Fn ) = 1 − Xn Show P (limn Xn = 0 or 1) = 1 and if X0 = θ then P (limn Xn = 1) = θ.
A more technical consequence of Theorem 4.5.7 is:
Theorem 4.5.9. Dominated convergence theorem for conditional expectations. Suppose Yn → Y a.s. and Yn  ≤ Z for all n where EZ < ∞. If Fn ↑ F∞
then
E (Yn Fn ) → E (Y F∞ ) a.s. 214 CHAPTER 4. MARTINGALES Proof. Let WN = sup{Yn − Ym  : n, m ≥ N }. WN ≤ 2Z , so EWN < ∞. Using
monotonicity (4.1.2) and applying Theorem 4.5.7 to WN gives
lim sup E (Yn − Y Fn ) ≤ lim E (WN Fn ) = E (WN F∞ )
n→∞ n→∞ The last result is true for all N and WN ↓ 0 as N ↑ ∞, so (4.1.3) implies E (WN F∞ ) ↓
0, and Jensen’s inequality gives us
E (Yn Fn ) − E (Y Fn ) ≤ E (Yn − Y Fn ) → 0 a.s. as n → ∞ Theorem 4.5.7 implies E (Y Fn ) → E (Y F∞ ) a.s. The desired result follows from the
last two conclusions and the triangle inequality.
Exercise 4.5.8. Show that if Fn ↑ F∞ and Yn → Y in L1 then E (Yn Fn ) →
E (Y F∞ ) in L1 .
Example 4.5.1. Suppose X1 , X2 , . . . are uniformly integrable and → X a.s. Theorem
4.5.2 implies Xn → X in L1 and combining this with Exercise 4.5.8 shows E (Xn F ) →
E (X F ) in L1 . We will now show that E (Xn F ) need not converge a.s. Let Y1 , Y2 , . . .
and Z1 , Z2 , . . . be independent r.v.’s with
P (Yn = 1) = 1/n P (Yn = 0) = 1 − 1/n P (Zn = n) = 1/n P (Zn = 0) = 1 − 1/n Let Xn = Yn Zn . P (Xn > 0) = 1/n2 so the BorelCantelli lemma implies Xn → 0 a.s.
E (Xn ; Xn  ≥ 1) = n/n2 , so Xn is uniformly integrable. Let F = σ (Y1 , Y2 , . . .).
E (Xn F ) = Yn E (Zn F ) = Yn EZn = Yn
Since Yn → 0 in L1 but not a.s., the same is true for E (Xn F ). 4.6. BACKWARDS MARTINGALES 4.6 215 Backwards Martingales A backwards martingale (some authors call them reversed) is a martingale indexed
by the negative integers, i.e., Xn , n ≤ 0, adapted to an increasing sequence of σ ﬁelds
Fn with
E (Xn+1 Fn ) = Xn for n ≤ −1
Because the σ ﬁelds decrease as n ↓ −∞, the convergence theory for backwards martingales is particularly simple.
Theorem 4.6.1. X−∞ = limn→−∞ Xn exists a.s. and in L1 .
Proof. Let Un be the number of upcrossings of [a, b] by X−n , . . . , X0 . The upcrossing
inequality, Theorem 4.2.7 implies (b − a)EUn ≤ E (X0 − a)+ . Letting n → ∞ and
using the monotone convergence theorem, we have EU∞ < ∞, so by the remark after
the proof of Theorem 4.2.8, the limit exists a.s. The martingale property implies
Xn = E (X0 Fn ), so Theorem 4.5.1 implies Xn is uniformly integrable and Theorem
4.5.2 tells us that the convergence occurs in L1 .
Exercise 4.6.1. Show that if X0 ∈ Lp the convergence occurs in Lp .
The next result identiﬁes the limit in Theorem 4.6.1.
Theorem 4.6.2. If X−∞ = limn→−∞ Xn and F−∞ = ∩n Fn , then X−∞ = E (X0 F−∞ ).
Proof. Clearly, X−∞ ∈ F−∞ . Xn = E (X0 Fn ), so if A ∈ F−∞ ⊂ Fn then
Xn dP =
A X0 dP
A Theorem 4.6.1 and Lemma 4.5.4 imply E (Xn ; A) → E (X−∞ ; A), so
X−∞ dP =
A X0 dP
A for all A ∈ F−∞ , proving the desired conclusion.
The next result is Theorem 4.5.7 backwards.
Theorem 4.6.3. If Fn ↓ F−∞ as n ↓ −∞ (i.e., F−∞ = ∩n Fn ), then
E (Y Fn ) → E (Y F−∞ ) a.s. and in L1 Proof. Xn = E (Y Fn ) is a backwards martingale, so Theorem 4.6.1 and 4.6.2 imply
that as n ↓ −∞, Xn → X−∞ a.s. and in L1 , where
X−∞ = E (X0 F−∞ ) = E (E (Y F0 )F−∞ ) = E (Y F−∞ )
Exercise 4.6.2. Prove the backwards analogue of Theorem 4.5.9. Suppose Yn →
Y−∞ a.s. as n → −∞ and Yn  ≤ Z a.s. where EZ < ∞. If Fn ↓ F−∞ , then
E (Yn Fn ) → E (Y−∞ F−∞ ) a.s.
Even though the convergence theory for backwards martingales is easy, there are
some nice applications. For the rest of the section, we return to the special space
utilized in Section 3.1, so we can utilize deﬁnitions given there. That is, we suppose
Ω = {(ω1 , ω2 , . . .) : ωi ∈ S }
F = S × S × ...
Xn (ω ) = ωn
Let En be the σ ﬁeld generated by events that are invariant under permutations that
leave n + 1, n + 2, . . . ﬁxed and let E = ∩n En be the exchangeable σ ﬁeld. 216 CHAPTER 4. MARTINGALES Example 4.6.1. Strong law of large numbers. Let ξ1 , ξ2 , . . . be i.i.d. with E ξi  <
∞. Let Sn = ξ1 + · · · + ξn , let X−n = Sn /n, and let
F−n = σ (Sn , Sn+1 , Sn+2 , . . .) = σ (Sn , ξn+1 , ξn+2 , . . .)
To compute E (X−n F−n−1 ), we observe that if j, k ≤ n + 1, symmetry implies
E (ξj F−n−1 ) = E (ξk F−n−1 ), so
1
E (ξn+1 F−n−1 ) =
n+1
= n+1 E (ξk F−n−1 )
k=1 Sn+1
1
E (Sn+1 F−n−1 ) =
n+1
n+1 Since X−n = (Sn+1 − ξn+1 )/n, it follows that
E (X−n F−n−1 ) = E (Sn+1 /nF−n−1 ) − E (ξn+1 /nF−n−1 )
Sn+1
Sn+1
Sn+1
=
−
=
= X−n−1
n
n(n + 1)
n+1
The last computation shows X−n is a backwards martingale, so it follows from Theorems 4.6.1 and 4.6.2 that limn→∞ Sn /n = E (X−1 F−∞ ). Since F−n ⊂ En , F−∞ ⊂ E .
The HewittSavage 01 law ((1.1) in Chapter 3) says E is trivial, so we have
lim Sn /n = E (X−1 ) n→∞ a.s. Example 4.6.2. Ballot theorem. Let {ξj , 1 ≤ j ≤ n} be i.i.d. nonnegative integervalued r.v.’s, let Sk = ξ1 + · · · + ξk , and let G = {Sj < j for 1 ≤ j ≤ n}. Then
P (GSn ) = (1 − Sn /n)+ (4.6.1) Remark. To explain the name, let ξ1 , ξ2 , . . . , ξn be i.i.d. and take values 0 or 2 with
probability 1/2 each. Interpreting 0’s and 2’s as votes for candidates A and B , we see
that G = {A leads B throughout the counting} so if n = α + β
P (GB gets β votes ) = 1− 2β
n + = α−β
α+β the result in Theorem 3.3.2.
Proof. The result is trivial when Sn ≥ n, so suppose Sn < n. Computations in
Example 4.6.1 show that X−j = Sj /j is a martingale w.r.t. F−j = σ (Sj , . . . , Sn ). Let
T = inf {k ≥ −n : Xk ≥ 1} and set T = −1 if the set is ∅. We claim that XT = 1 on
Gc . To check this, note that if Sj +1 < j + 1 then Sj ≤ Sj +1 ≤ j . Since G ⊂ {T = −1}
and S1 < 1 implies S1 = 0, we have XT = 0 on G. Noting F−n = σ (Sn ) and using
Exercise 4.4.3, we see that on {Sn < n}
P (Gc Sn ) = E (XT F−n ) = X−n = Sn /n
Example 4.6.3. HewittSavage 01 law. If X1 , X2 , . . . are i.i.d. and A ∈ E then
P (A) ∈ {0, 1}.
The key to the new proof is: 4.6. BACKWARDS MARTINGALES 217 Lemma 4.6.4. Suppose X1 , X2 , . . . are i.i.d. and let
An (ϕ) = 1
(n)k ϕ(Xi1 , . . . , Xik )
i where the sum is over all sequences of distinct integers 1 ≤ i1 , . . . , ik ≤ n and
(n)k = n(n − 1) · · · (n − k + 1)
is the number of such sequences. If ϕ is bounded, An (ϕ) → Eϕ(X1 , . . . , Xk ) a.s.
Proof. An (ϕ) ∈ En , so
An (ϕ) = E (An (ϕ)En ) = 1
(n)k E (ϕ(Xi1 , . . . , Xik )En )
i = E (ϕ(X1 , . . . , Xk )En )
since all the terms in the sum are the same. Theorem 4.6.3 with F−m = Em for m ≥ 1
implies that
E (ϕ(X1 , . . . , Xk )En ) → E (ϕ(X1 , . . . , Xk )E )
We want to show that the limit is E (ϕ(X1 , . . . , Xk )). The ﬁrst step is to observe that
there are k (n − 1)k−1 terms in An (ϕ) involving X1 and ϕ is bounded so if we let 1 ∈ i
denote the sum over sequences that contain 1.
1
(n)k ϕ(Xi1 , . . . , Xik ) ≤
1∈i k (n − 1)k−1
sup φ → 0
(n)k This shows that
E (ϕ(X1 , . . . , Xk )E ) ∈ σ (X2 , X3 , . . .)
Repeating the argument for 2, 3, . . . , k shows
E (ϕ(X1 , . . . , Xk )E ) ∈ σ (Xk+1 , Xk+2 , . . .)
Intuitively, if the conditional expectation of a r.v. is independent of the r.v. then
(a) E (ϕ(X1 , . . . , Xk )E ) = E (ϕ(X1 , . . . , Xk )) To show this, we prove:
(b) If EX 2 < ∞ and E (X G ) ∈ F with X independent of F then E (X G ) = EX.
Proof. Let Y = E (X G ) and note that Theorem 4.1.4 implies EY 2 ≤ EX 2 < ∞.
By independence, EXY = EX EY = (EY )2 since EY = EX . From the geometric
interpretation of conditional expectation, Theorem 4.1.8, E ((X − Y )Y ) = 0, so EY 2 =
EXY = (EY )2 and var (Y ) = EY 2 − (EY )2 = 0.
(a) holds for all bounded ϕ, so E is independent of Gk = σ (X1 , . . . , Xk ). Since
this holds for all k , and ∪k Gk is a π system that contains Ω, (4.1) in Chapter 1
implies E is independent of σ (∪k Gk ) ⊃ E , and we get the usual 01 law punch line.
If A ∈ E , it is independent of itself, and hence P (A) = P (A ∩ A) = P (A)P (A), i.e.,
P (A) ∈ {0, 1}.
Example 4.6.4. de Finetti’s Theorem. A sequence X1 , X2 , . . . is said to be
exchangeable if for each n and permutation π of {1, . . . , n}, (X1 , . . . , Xn ) and
(Xπ(1) , . . . , Xπ(n) ) have the same distribution. 218 CHAPTER 4. MARTINGALES Theorem 4.6.5. de Finetti’s Theorem. If X1 , X2 , . . . are exchangeable then conditional on E , X1 , X2 , . . . are independent and identically distributed.
Proof. Repeating the ﬁrst calculation in the proof of Lemma 4.6.4 and using the
notation introduced there shows that for any exchangeable sequence:
An (ϕ) = E (An (ϕ)En ) = 1
(n)k E (ϕ(Xi1 , . . . , Xik )En )
i = E (ϕ(X1 , . . . , Xk )En )
since all the terms in the sum are the same. Again, Theorem 4.6.3 implies that
An (ϕ) → E (ϕ(X1 , . . . , Xk )E ) (4.6.2) This time, however, E may be nontrivial, so we cannot hope to show that the limit is
E (ϕ(X1 , . . . , Xk )).
Let f and g be bounded functions on Rk−1 and R, respectively. If we let In,k be
the set of all sequences of distinct integers 1 ≤ i1 , . . . , ik ≤ n, then
(n)k−1 An (f ) nAn (g ) = g (Xm ) f (Xi1 , . . . , Xik−1 )
m i∈In,k−1 = f (Xi1 , . . . , Xik−1 )g (Xik )
i∈In,k
k−1 + f (Xi1 , . . . , Xik−1 )g (Xij )
i∈In,k−1 j =1 If we let ϕ(x1 , . . . , xk ) = f (x1 , . . . , xk−1 )g (xk ), note that
(n)k−1 n
n
=
(n)k
(n − k + 1) and 1
(n)k−1
=
(n)k
(n − k + 1) then rearrange, we have
An (ϕ) = n
1
An (f )An (g ) −
n−k+1
n−k+1 k−1 An (ϕj )
j =1 where ϕj (x1 , . . . , xk−1 ) = f (x1 , . . . , xk−1 )g (xj ). Applying (4.6.2) to ϕ, f , g , and all
the ϕj gives
E (f (X1 , . . . , Xk−1 )g (Xk )E ) = E (f (X1 , . . . , Xk−1 )E )E (g (Xk )E )
It follows by induction that k k E (fj (Xj )E ) fj (Xj ) E = E
j =1 j =1 When the Xi take values in a nice space, there is a regular conditional distribution for (X1 , X2 , . . .) given E , and the sequence can be represented as a mixture
of i.i.d. sequences. Hewitt and Savage (1956) call the sequence presentable in this
case. For the usual measure theoretic problems, the last result is not valid when the
Xi take values in an arbitrary measure space. See Dubins and Freedman (1979) and
Freedman (1980) for counterexamples.
The simplest special case of Theorem 4.6.5 occurs when the Xi ∈ {0, 1}. In this
case 4.6. BACKWARDS MARTINGALES 219 Theorem 4.6.6. If X1 , X2 , . . . are exchangeable and take values in {0, 1} then there
is a probability distribution on [0, 1] so that
1 θk (1 − θ)n−k dF (θ) P (X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0) =
0 This result is useful for people concerned about the foundations of statistics (see
Section 3.7 of Savage (1972)), since from the palatable assumption of symmetry one
gets the powerful conclusion that the sequence is a mixture of i.i.d. sequences. Theorem 4.6.6 has been proved in a variety of diﬀerent ways. See Feller, Vol. II (1971),
p. 228–229 for a proof that is related to the moment problem. Diaconis and Freedman
(1980) have a nice proof that starts with the trivial observation that the distribution
of a ﬁnite exchangeable sequence Xm , 1 ≤ m ≤ n has the form p0 H0,n + · · · + pn Hn,n
where Hm,n is “drawing without replacement from an urn with m ones and n − m
zeros.” If m → ∞ and m/n → p then Hm,n approaches product measure with density
p. Theorem 4.6.6 follows easily from this, and one can get bounds on the rate of
convergence.
Exercises
4.6.3. Prove directly from the deﬁnition that if X1 , X2 , . . . ∈ {0, 1} are exchangeable
P (X1 = 1, . . . , Xk = 1Sn = m) = n−k
n−m n
m 2
4.6.4. If X1 , X2 , . . . ∈ R are exchangeable with EXi < ∞ then E (X1 X2 ) ≥ 0. 4.6.5. Use the ﬁrst few lines of the proof of Lemma 4.6.4 to conclude that if X1 , X2 , . . .
are i.i.d. with EXi = µ and var (Xi ) = σ 2 < ∞ then
n
2 −1 (Xi − Xj )2 → 2σ 2
1≤i<j ≤n 220 CHAPTER 4. MARTINGALES 4.7 Optional Stopping Theorems In this section, we will prove a number of results that allow us to conclude that if Xn
is a submartingale and M ≤ N are stopping times, then EXM ≤ EXN . Example
4.2.3 shows that this is not always true, but Exercise 4.4.2 shows this is true if N is
bounded, so our attention will be focused on the case of unbounded N.
Theorem 4.7.1. If Xn is a uniformly integrable submartingale then for any stopping
time N , XN ∧n is uniformly integrable.
+
+
+
+
Proof. Xn is a submartingale, so Theorem 4.4.1 implies EXN ∧n ≤ EXn . Since Xn
is uniformly integrable, it follows from the remark after the deﬁnition that
+
+
sup EXN ∧n ≤ sup EXn < ∞
n n Using the martingale convergence theorem (4.2.8) now gives XN ∧n → XN a.s. (here
X∞ = limn Xn ) and E XN  < ∞. With this established, the rest is easy. We write
E (XN ∧n ; XN ∧n  > K ) = E (XN ; XN  > K, N ≤ n)
+ E (Xn ; Xn  > K, N > n)
Since E XN  < ∞ and Xn is uniformly integrable, if K is large then each term is
< /2.
From the last computation in the proof of Theorem 4.7.1, we get:
Theorem 4.7.2. If E XN  < ∞ and Xn 1(N >n) is uniformly integrable, then XN ∧n
is uniformly integrable.
From Theorem 4.7.1, we immediately get:
Theorem 4.7.3. If Xn is a uniformly integrable submartingale then for any stopping
time N ≤ ∞, we have EX0 ≤ EXN ≤ EX∞ , where X∞ = lim Xn .
Proof. Theorem 4.4.1 implies EX0 ≤ EXN ∧n ≤ EXn . Letting n → ∞ and observing
that Theorem 4.7.1 and 4.5.3 imply XN ∧n → XN and Xn → X∞ in L1 gives the
desired result.
From Theorem 4.7.3, we get the following useful corollary.
Theorem 4.7.4. Optional Stopping Theorem. If L ≤ M are stopping times and
YM ∧n is a uniformly integrable submartingale, then EYL ≤ EYM and
YL ≤ E (YM FL )
Proof. Use the inequality EXN ≤ EX∞ in Theorem 4.7.3 with Xn = YM ∧n and
N = L. To prove the second result, let A ∈ FL and
N= L
M on A
on Ac is a stopping time by Exercise 3.1.7. Using the ﬁrst result now shows EYN ≤ EYM .
Since N = M on Ac , it follows from the last inequality and the deﬁnition of conditional
expectation that
E (YL ; A) ≤ E (YM ; A) = E (E (YM FL ); A)
Taking A = {YL − E (YM FL ) > }, we conclude P (A ) = 0 for all
desired result follows. > 0 and the 4.7. OPTIONAL STOPPING THEOREMS 221 The last result is the one we use the most (usually the ﬁrst inequality with L =
0). Theorem 4.7.2 is useful in checking the hypothesis. A typical application is the
following generalization of Wald’s equation, Theorem 3.1.5.
Theorem 4.7.5. Suppose Xn is a submartingale and E (Xn+1 − Xn Fn ) ≤ B a.s.
If N is a stopping time with EN < ∞ then XN ∧n is uniformly integrable and hence
EXN ≥ EX0 .
Remark. As usual, using the last result twice shows that if X is a martingale
then EXN = EX0 . To recover Wald’s equation, let Sn be a random walk, let µ =
E (Sn − Sn−1 ), and apply the martingale result to Xn = Sn − nµ.
Proof. We begin by observing that
∞ XN ∧n  ≤ X0  + Xm+1 − Xm 1(N >m)
m=0 To prove uniform integrability, it suﬃces to show that the righthand side has ﬁnite
expectation for then XN ∧n  is dominated by an integrable r.v. Now, {N > m} ∈ Fm ,
so
E (Xm+1 − Xm ; N > m) = E (E (Xm+1 − Xm Fm ); N > m) ≤ BP (N > m)
and E ∞
m=0 Xm+1 − Xm 1(N >m) ≤ B ∞
m=0 P (N > m) = BEN < ∞. Before we delve further into applications, we pause to prove one last stopping
theorem that does not require uniform integrability.
Theorem 4.7.6. If Xn is a nonnegative supermartingale and N ≤ ∞ is a stopping
time, then EX0 ≥ EXN where X∞ = lim Xn , which exists by Theorem 4.2.9.
Proof. By Theorem 4.4.1, EX0 ≥ EXN ∧n . The monotone convergence theorem
implies
E (XN ; N < ∞) = lim E (XN ; N ≤ n)
n→∞ and Fatou’s lemma implies
E (XN ; N = ∞) ≤ lim inf E (Xn ; N > n)
n→∞ Adding the last two lines and using our ﬁrst observation,
EXN ≤ lim inf EXN ∧n ≤ EX0
n→∞ Exercise 4.7.1. If Xn ≥ 0 is a supermartingale then P (sup Xn > λ) ≤ EX0 /λ.
Applications to random walks. For the rest of the section, including all the
exercises below, ξ1 , ξ2 , . . . are i.i.d., Sn = ξ1 + · · · + ξn , and Fn = σ (ξ1 , . . . , ξn ).
Theorem 4.7.7. Asymmetric simple random walk refers to the special case in
which P (ξi = 1) = p and P (ξi = −1) = q ≡ 1 − p with p = q . Without loss of
generality we assume 1/2 < p < 1.
(a) If ϕ(x) = {(1 − p)/p}x then ϕ(Sn ) is a martingale. 222 CHAPTER 4. MARTINGALES (b) If we let Tx = inf {n : Sn = x} then for a < 0 < b
P (Ta < Tb ) = φ(b) − φ(0)
φ(b) − φ(a) (c) If a < 0 then P (minn Sn ≤ a) = P (Ta < ∞) = {(1 − p)/p}−a .
(d) If b > 0 then P (Tb < ∞) = 1 and ETb = b/(2p − 1).
Proof. Since Sn and ξn+1 are independent, Example 4.1.6 implies that on {Sn = m},
E (φ(Sn+1 )Fn ) = p · 1−p
p = {1 − p + p} m+1 + (1 − p)
1−p
p 1−p
p m−1 m = φ(Sn ) which proves (a).
Let N = Ta ∧ Tb . We showed in Example 3.1.5 that N < ∞. Since φ(SN ∧n ) is
bounded, it is uniformly integrable and Theorem 4.7.4 with L = 0, M = N implies
φ(0) = Eφ(SN ) = P (Ta < Tb )φ(a) + P (Tb < Ta )φ(b)
Using P (Ta < Tb ) + P (Tb < Ta ) = 1 and solving gives (b).
Letting b → ∞ and noting φ(b) → 0 gives the result in (c), since Ta < ∞ if and
only if Ta < Tb for some b. To start to prove (d) we note that φ(a) → ∞ as a → −∞,
so P (Tb < ∞) = 1. For the second conclusion, we note that Xn = Sn − (p − q )n is a
martingale. Since Tb ∧ n is a bounded stopping time, Theorem 4.4.1 implies
0 = E (STb ∧n − (p − q )(Tb ∧ n))
Now b ≥ STb ∧n ≥ minm Sm and (c) implies E (inf m Sm ) > −∞, so the dominated
convergence theorem implies ESTb ∧n → ESTb as n → ∞. The monotone convergence
theorem implies E (Tb ∧ n) ↑ ETb , so we have b = (p − q )ETb .
Remark. The reader should study the technique in this proof of (d) because it is
useful in a number of situations (e.g., the exercises below). We apply Theorem 4.4.1 to
the bounded stopping time Tb ∧ n, then let n → ∞, and use appropriate convergence
theorems. Here this is an alternative to showing that XTb ∧n is uniformly integrable.
Exercise 4.7.2. Let Sn be an asymmetric simple random walk with 1/2 < p < 1,
and let σ 2 = pq . Use the fact that Xn = (Sn − (p − q )n)2 − σ 2 n is a martingale to
show var (Tb ) = bσ 2 /(p − q )3 .
Exercise 4.7.3. Let Sn be a symmetric simple random walk starting at 0, and let T =
2
inf {n : Sn ∈ (−a, a)} where a is an integer. (i) Use the fact that Sn − n is a martingale
/
4
2
to show that ET = a2 . (ii) Find constants b and c so that Yn = Sn − 6nSn + bn2 + cn
is a martingale, and use this to compute ET 2 .
The last ﬁve exercises are devoted to the study of exponential martingales.
Exercise 4.7.4. Suppose ξi is not constant. Let ϕ(θ) = E exp(θξ1 ) < ∞ for θ ∈
θ
(−δ, δ ), and let ψ (θ) = log ϕ(θ). (i) Xn = exp(θSn − nψ (θ)) is a martingale. (ii) ψ is
θ → 0 and conclude that X θ → 0 a.s.
strictly convex. (iii) Show E Xn
n 4.7. OPTIONAL STOPPING THEOREMS 223 Exercise 4.7.5. Let Sn be asymmetric simple random walk with p ≥ 1/2. Let
T1 = inf {n : Sn = 1}. Use the martingale of Exercise 7.4 to conclude (i) if θ > 0 then
1 = eθ Eϕ(θ)−T1 , where ϕ(θ) = peθ + qe−θ and q = 1 − p. (ii) Set peθ + qe−θ = 1/s
and then solve for x = e−θ to get
EsT1 = (1 − {1 − 4pqs2 }1/2 )/2qs
Exercise 4.7.6. Suppose ϕ(θo ) = E exp(θo ξ1 ) = 1 for some θo < 0 and ξi is not
constant. It follows from the result in Exercise 4.7.4 that Xn = exp(θo Sn ) is a
martingale. Let T = inf {n : Sn ∈ (a, b)} and Yn = Xn∧T . Use Theorem 4.7.4 to
/
conclude that EXT = 1 and P (ST ≤ a) ≤ exp(−θo a).
Exercise 4.7.7. Suppose the ξi are integer valued with P (ξi < −1) = 0 and EXi > 0.
Show that ϕ(θo ) = E exp(θo ξ1 ) = 1 for some θo < 0. Use the martingale Xn =
exp(θo Sn ) to conclude that P (ST ≤ a) = exp(−θo a).
Exercise 4.7.8. Let Sn be the total assets of an insurance company at the end of
year n. In year n, premiums totaling c > 0 are received and claims ζn are paid where
ζn is Normal(µ, σ 2 ) and µ < c. To be precise, if ξn = c − ζn then Sn = Sn−1 + ξn . The
company is ruined if its assets drop to 0 or less. Show that if S0 > 0 is nonrandom,
then
P ( ruin ) ≤ exp(−2(c − µ)S0 /σ 2 )
Exercise 4.7.9. Let Zn be a branching process with oﬀspring distribution pk , deﬁned
in part d of Section 4.3, and let ϕ(θ) =
pk θk . Suppose ρ < 1 has ϕ(ρ) = ρ. Show
Zn
that ρ is a martingale and use this to conclude P (Zn = 0 for some n ≥ 1Z0 = x) =
ρx . 224 CHAPTER 4. MARTINGALES Chapter 5 Markov Chains
The main object of study in this chapter is (temporally homogeneous) Markov chains
on a countable state space S. That is, a sequence of r.v.’s Xn , n ≥ 0, with
P (Xn+1 = j Fn ) = p(Xn , j )
where Fn = σ (X0 , . . . , Xn ), p(i, j ) ≥ 0 and j p(i, j ) = 1. The theory focuses on the
asymptotic behavior of pn (i, j ) ≡ P (Xn = j X0 = i). The basic results are that
n 1
pm (i, j )
n→∞ n
m=1
lim exists always and under a mild assumption called aperiodicity:
lim pn (i, j ) n→∞ exists In nice situations, i.e., Xn is irreducible and positive recurrent, the limits above are
a probability distribution that is independent of the starting state i. In words, the
chain converges to equilibrium as n → ∞. One of the attractions of Markov chain
theory is that these powerful conclusions come out of assumptions that are satisﬁed
in a large number of examples. 5.1 Deﬁnitions Let (S, S ) be a measurable space.
A function p : S × S → R is said to be a transition probability if:
(i) For each x ∈ S , A → p(x, A) is a probability measure on (S, S ).
(ii) For each A ∈ S , x → p(x, A) is a measurable function.
We say Xn is a Markov chain (w.r.t. Fn ) with transition probability p if
P (Xn+1 ∈ B Fn ) = p(Xn , B )
Given a transition probability p and an initial distribution µ on (S, S ), we can
deﬁne a consistent set of ﬁnite dimensional distributions by
P (Xj ∈ Bj , 0 ≤ j ≤ n) = µ(dx0 )
B0 ··· p(xn−1 , dxn )
Bn 225 p(x0 , dx1 )
B1 (5.1.1) 226 CHAPTER 5. MARKOV CHAINS If we suppose that (S, S ) is nice, Kolmogorov’s extenson theorem, Theorem 1.4.13, allows us to construct a probability measure Pµ on sequence space (S {0,1,...} , S {0,1,...} )
so that the coordinate maps Xn (ω ) = ωn have the desired distributions.
Notation. When µ = δx , a point mass at x, we use Px as an abbreviation for Pδx .
The measures Px are the basic objects because, once they are deﬁned, we can deﬁne
the Pµ (even for inﬁnite measures µ) by
Pµ (A) = µ(dx) Px (A) Our next step is to show
Theorem 5.1.1. Xn is a Markov chain (with respect to Fn = σ (X0 , X1 , . . . , Xn ))
with transition probability p.
Proof. To prove this, we let A = {X0 ∈ B0 , X1 B1 , . . . , Xn ∈ Bn }, Bn+1 = B , and
observe that using the deﬁnition of the integral, the deﬁnition of A, and the deﬁnition
of Pµ
1(Xn+1 ∈B ) dPµ = Pµ (A, Xn+1 ∈ B )
A = Pµ (X0 ∈ B0 , X1 ∈ B1 , . . . , Xn ∈ Bn , Xn+1 ∈ B )
= p(x0 , dx1 ) · · · µ(dx0 )
B0 B1 p(xn−1 , dxn ) p(xn , Bn+1 )
Bn We would like to assert that the last expression is
= p(Xn , B ) dPµ
A To do this, replace p(xn , Bn ) by a general function f (xn ). If f is an indicator function,
the desired equality is true. Linearity implies that it is valid for simple functions, and
the bounded convergence theorem implies that it is valid for bounded measurable f ,
e.g., f (x) = p(x, Bn+1 ).
The collection of sets for which
1(Xn+1 ∈B ) dPµ =
A pn (Xn , B ) dPµ
A holds is a λsystem, and the collection for which it has been proved is a π system,
so it follows from the π − λ theorem, Theorem 1.4.2, that the equality is true for all
A ∈ Fn . This shows that
P (Xn+1 ∈ B Fn ) = p(Xn , B )
and proves the desired result.
At this point, we have shown that given a sequence of transition probabilities and
an initial distribution, we can construct a Markov chain. Conversely,
Theorem 5.1.2. If Xn is a Markov chain with transition probabilities p and initial
distribution µ, then the ﬁnite dimensional distributions are given by (5.1.1). 5.1. DEFINITIONS 227 Proof. Our ﬁrst step is to show that if Xn has transition probability p then for any
bounded measurable f
E (f (Xn+1 )Fn ) = p(Xn , dy )f (y ) (5.1.2) The desired conclusion is a consequence of the next result. Let H = the collection of
bounded functions for which the identity holds.
Theorem 5.1.3. Monotone class theorem. Let A be a π system that contains Ω
and let H be a collection of realvalued functions that satisﬁes:
(i) If A ∈ A, then 1A ∈ H.
(ii) If f, g ∈ H, then f + g , and cf ∈ H for any real number c.
(iii) If fn ∈ H are nonnegative and increase to a bounded function f , then f ∈ H.
Then H contains all bounded functions measurable with respect to σ (A).
Proof. The assumption Ω ∈ A, (ii), and (iii) imply that G = {A : 1A ∈ H} is a
λsystem so by (i) and the π − λ theorem, Theorem 1.4.2, G ⊃ σ (A). (ii) implies H
contains all simple functions, and (iii) implies that H contains all bounded measurable
functions.
Returning to our main topic, we observe that familiar properties of conditional
expectation and (5.1.2) imply
n n E fm (Xm ) fm (Xm ) Fn−1 =EE m=0 m=0
n−1 fm (Xm )E (fn (Xn )Fn−1 ) =E
m=0
n−1 =E fm (Xm ) pn−1 (Xn−1 , dy )fn (y ) m=0 The last integral is a bounded measurable function of Xn−1 , so it follows by induction
that if µ is the distribution of X0 , then
n E fm (Xm ) = µ(dx0 )f0 (x0 ) p0 (x0 , dx1 )f1 (x1 ) m=0 ··· pn−1 (xn−1 , dxn )fn (xn ) (5.1.3) that is, the ﬁnite dimensional distributions coincide with those in (5.1.1).
With Theorem 5.1.2 established, it follows that we can describe a Markov chain
by giving a transition probabilities p. Having done this, we can and will suppose that
the random variables Xn are the coordinate maps (Xn (ω ) = ωn ) on sequence space
(Ωo , F ) = (S {0,1,...} , S {0,1,...} )
We choose this representation because it gives us two advantages in investigating the
Markov chain: (i) For each initial distribution µ we have a measure Pµ deﬁned by
(5.1.1) that makes Xn a Markov chain with Pµ (X0 ∈ A) = µ(A). (ii) We have the
shift operators θn deﬁned in Section 3.1: (θn ω )(m) = ωm+n . 228 5.2 CHAPTER 5. MARKOV CHAINS Examples Having introduced on the framework in which we will investigate things, we can ﬁnally
give some more examples.
Example 5.2.1. Random walk. Let ξ1 , ξ2 , . . . ∈ Rd be independent with distribution µ. Let X0 = x ∈ Rd and let Xn = X0 + ξ1 + · · · + ξn . Then Xn is a Markov
chain with transition probability.
p(x, A) = µ(A − x)
where A − x = {y − x : y ∈ A}.
To prove this we will use an extension of Example 4.1.6.
Lemma 5.2.1. Let X and Y take values in (S, S ). Suppose F and Y are independent.
Let X ∈ F , ϕ be a function with E ϕ(X, Y ) < ∞ and let g (x) = E (ϕ(x, Y )).
E (ϕ(X, Y )F ) = g (X )
Proof. Suppose ﬁrst that φ(x, y ) = 1A (x)1B (y ) and let C ∈ F .
E (ϕ(X, Y ); C ) = P ({X ∈ A} ∩ C ∩ {Y ∈ B })
= P ({X ∈ A} ∩ C )P ({Y ∈ B })
since {X ∈ A} ∩ C ∈ F and {Y ∈ B } are independent. g (x) = 1A (x)P (Y ∈ B ), so
the above
= E (g (X ); C )
We now apply the monotone class theorem, Theorem 5.1.3. Let A be the subsets
of S × S of the form A × B with A, B ∈ S . A is a π system that contains Ω. Let H be
the collection of φ for which the result holds. We have shown (i). Properties (ii) and
(iii) follow from the bounded convergence theorem which completes the proof.
To get the desired result from Lemma 5.2.1, we let F = Fn , X = Xn , Y = ξn+1 ,
and φ(x, y ) = 1{x+y∈A} . In this case g (x) = µ(A − x) and the desired result follows.
In the next four examples, S is a countable set and S = all subsets of S . Let
p(i, j ) ≥ 0 and suppose
j p(i, j ) = 1 for all i. Intuitively, p(i, j ) = P (Xn+1 =
j Xn = i). From p(i, j ) we can deﬁne a transition probability by
p(i, A) = p(i, j )
j ∈A In each case, we will not be as formal in checking the Markov property, but simply
give the transition probability and leave the rest to the reader. The details are much
simpler because all we have to show is that
P (Xn+1 = j Xn = i, Xn−1 = in−1 , . . . X0 = i0 ) = p(i, j )
and these are elementary conditional probabilities.
Example 5.2.2. Branching processes. S = {0, 1, 2, . . .}
i p(i, j ) = P ξm = j
m=1 where ξ1 , ξ2 , . . . are i.i.d. nonnegative integervalued random variables. In words each
of the i individuals at time n (or in generation n) gives birth to an independent and
identically distributed number of oﬀspring. 5.2. EXAMPLES 229 To make the connection with our earlier discussion of branching processes, do:
Exercise 5.2.1. Let Zn be the process deﬁned in (4.3.2). Check that Zn is a Markov
chain with the indicated transition probability.
Example 5.2.3. Renewal chain. S = {0, 1, 2, . . .}, fk ≥ 0, and
p(0, j ) = fj +1 fk = 1. for j ≥ 0 p(i, i − 1) = 1
p(i, j ) = 0 ∞
k=1 for i ≥ 1
otherwise To explain the deﬁnition, let ξ1 , ξ2 , . . . be i.i.d. with P (ξm = j ) = fj , let T0 = i0 and
for k ≥ 1 let Tk = Tk−1 + ξk . Tk is the time of the k th arrival in a renewal process
that has its ﬁrst arrival at time i0 . Let
Ym = 1 if m ∈ {T0 , T1 , T2 , . . .}
0 otherwise and let Xn = inf {m − n : m ≥ n, Ym = 1}. Ym = 1 if a renewal occurs at time m,
and Xn is the amount of time until the ﬁrst renewal ≥ n.
An example should help clarify the deﬁnition:
Yn
Xn 0
3 0
2 0
1 1
0 0
2 0
1 1
0 1
0 0
4 0
3 0
2 0
1 1
0 It is clear that if Xn = i > 0 then Xn+1 = i − 1. When Xn = 0, we have TNn = n,
where Nn = inf {k : Tk ≥ n} is a stopping time, so Theorem 3.1.3 implies ξNn +1
is independent of σ (X0 , ξ1 , . . . , ξNn ) ⊃ σ (X0 , . . . , Xn ). We have p(0, j ) = fj +1 since
ξNn +1 = j + 1 implies Xn+1 = j .
Example 5.2.4. M/G/1 queue. In this model, customers arrive according to
a Poisson process with rate λ. (M is for Markov and refers to the fact that in a
Poisson process the number of arrivals in disjoint time intervals is independent.) Each
customer requires an independent amount of service with distribution F . (G is for
general service distribution. 1 indicates that there is one server.) Let Xn be the
number of customers waiting in the queue at the time the nth customer enters service.
To be precise, when X0 = x, the chain starts with x people waiting in line and
customer 0 just beginning her service.
To understand the deﬁnition the following picture is useful: ξ1 = 1 •
X0 = 0 ξ2 = −1 •
X1 = 1 ξ3 = −1 •
X2 = 0 •
X3 = 0 Figure 5.1: Realization of the M/G/1 queue. Black dots indicate the times at which the customers enter service 230 CHAPTER 5. MARKOV CHAINS
To deﬁne our Markov chain Xn , let
∞ e−λt ak =
0 (λt)k
dF (t)
k! be the probability that k customers arrive during a service time. Let ξ1 , ξ2 , . . . be
i.i.d. with P (ξi = k − 1) = ak . We think of ξi as the net number of customers to
arrive during the ith service time, subtracting one for the customer who completed
service, so we deﬁne Xn by
Xn+1 = (Xn + ξn+1 )+ (5.2.1) The positive part only takes eﬀect when Xn = 0 and ξn+1 = −1 (e.g., X2 = 0,
ξ3 = −1) and reﬂects the fact that when the queue has size 0 and no one arrives
during the service time the next queue size is 0, since we do not start counting until
the next customer arrives and then the queue length will be 0.
It is easy to see that the sequence deﬁned in (∗) is a Markov chain with transition
probability
p(0, 0) = a0 + a1
p(j, j − 1 + k ) = ak if j ≥ 1 or k > 1 The formula for ak is rather complicated, and its exact form is not important, so we
will simplify things by assuming only that ak > 0 for all k ≥ 0 and k≥0 ak = 1. ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦
◦◦ ◦ ◦
◦ ◦
◦ ◦ ◦ ◦
◦
◦◦ ◦
◦ Figure 5.2: Physical motivation for the Ehrenfest chain. Example 5.2.5. Ehrenfest chain. S = {0, 1, . . . , r}
p(k, k + 1) = (r − k )/r
p(k, k − 1) = k/r
p(i, j ) = 0 otherwise In words, there is a total of r balls in two urns; k in the ﬁrst and r − k in the second.
We pick one of the r balls at random and move it to the other urn. Ehrenfest used
this to model the division of air molecules between two chambers (of equal size and
shape) that are connected by a small hole. For an interesting account of this chain,
see Kac (1947a). 5.2. EXAMPLES 231 Example 5.2.6. Birth and death chains. S = {0, 1, 2, . . .} These chains are
deﬁned by the restriction p(i, j ) = 0 when i − j  > 1. The fact that these processes
cannot jump over any integers makes it particularly easy to compute things for them.
That should be enough examples for the moment. We conclude this section with
some simple calculations. For a Markov chain on a countable state space, (5.1.1) says
n Pµ (Xk = ik , 0 ≤ k ≤ n) = µ(i0 ) p(im−1 , im )
m=1 When n = 1
Pµ (X1 = j ) = µ(i)p(i, j ) = µp(j )
i i.e., the product of the row vector µ with the matrix p. When n = 2,
p(i, j )p(j, k ) = p2 (i, k ) Pi (X2 = k ) =
j i.e., the second power of the matrix p. Combining the two formulas and generalizing
µ(i)pn (i, j ) = µpn (j ) Pµ (Xn = j ) =
i Exercises
5.2.2. Suppose S = {1, 2, 3} and .1
p = .7
0 0
.3
.4 .9
0
.6 Compute p2 (1, 2) and p3 (2, 3) by considering the diﬀerent ways to get from 1 to 2 in
two steps and from 2 to 3 in three steps.
5.2.3. Suppose S = {0, 1} and
p= 1−α
β α
1−β Use induction to show that
Pµ (Xn = 0) = β
β
+ (1 − α − β )n µ(0) −
α+β
α+β 5.2.4. Let ξ0 , ξ1 , . . . be i.i.d. ∈ {H, T }, taking each value with probability 1/2. Show
that Xn = (ξn , ξn+1 ) is a Markov chain and compute its transition probability p.
What is p2 ?
5.2.5. Brothersister mating. In this scheme, two animals are mated, and among
their direct descendants two individuals of opposite sex are selected at random. These
animals are mated and the process continues. Suppose each individual can be one of
three genotypes AA, Aa, aa, and suppose that the type of the oﬀspring is determined
by selecting a letter from each parent. With these rules, the pair of genotypes in the
nth generation is a Markov chain with six states:
AA, AA AA, Aa Compute its transition probability. AA, aa Aa, Aa Aa, aa aa, aa 232 CHAPTER 5. MARKOV CHAINS 5.2.6. BernoulliLaplace model of diﬀusion. Suppose two urns, which we will
call left and right, have m balls each. b (which we will assume is ≤ m) balls are black,
and 2m − b are white. At each time, we pick one ball from each urn and interchange
them. Let the state at time n be the number of black balls in the left urn. Compute
the transition probability.
5.2.7. Let ξ1 , ξ2 , . . . be i.i.d. ∈ {1, 2, . . . , N } and taking each value with probability
1/N . Show that Xn = {ξ1 , . . . , ξn } is a Markov chain and compute its transition
probability.
5.2.8. Let ξ1 , ξ2 , . . . be i.i.d. ∈ {−1, 1}, taking each value with probability 1/2. Let
S0 = 0, Sn = ξ1 + · · · ξn and Xn = max{Sm : 0 ≤ m ≤ n}. Show that Xn is not a
Markov chain.
5.2.9. Let θ, U1 , U2 , ... be independent and uniform on (0, 1). Let Xi = 1 if Ui ≤ θ,
= −1 if Ui > θ, and let Sn = X1 + · · · + Xn . In words, we ﬁrst pick θ according to
the uniform distribution and then ﬂip a coin with probability θ of heads to generate
a random walk. Compute P (Xn+1 = 1X1 , . . . , Xn ) and conclude Sn is a temporally
inhomogeneous Markov chain. This is due to the fact that “Sn is a suﬃcient statistic
for estimating θ.” 5.3. EXTENSIONS OF THE MARKOV PROPERTY 5.3 233 Extensions of the Markov Property If Xn is a Markov chain with transition probability p, then by deﬁnition,
P (Xn+1 ∈ B Fn ) = p(Xn , B )
In this section, we will prove two extensions of the last equality in which {Xn+1 ∈ B }
is replaced by a bounded function of the future, h(Xn , Xn+1 , . . .), and n is replaced by
a stopping time N . These results, especially the second, will be the keys to developing
the theory of Markov chains.
As mentioned in Section 5.1, we can and will suppose that the Xn are the coordinate maps on sequence space
(Ωo , F ) = (S {0,1,...} , S {0,1,...} )
Fn = σ (X0 , X1 , . . . , Xn ), and for each initial distribution µ we have a measure Pµ
deﬁned by (5.1.1) that makes Xn a Markov chain with Pµ (X0 ∈ A) = µ(A). Deﬁne
the shift operators θn : Ωo → Ωo by (θn ω )(m) = ω (m + n).
Theorem 5.3.1. The Markov property. Let Y : Ωo → R be bounded and measurable.
Eµ (Y ◦ θn Fn ) = EXn Y
Remark. Here the subscript µ on the lefthand side indicates that the conditional
expectation is taken with respect to Pµ . The righthand side is the function ϕ(x) =
Ex Y evaluated at x = Xn . To make the connection with the introduction of this
section, let
Y (ω ) = h(ω0 , ω1 , . . .)
We denote the function by Y , a letter usually used for random variables, because
that’s exactly what Y is, a measurable function deﬁned on our probability space Ωo .
Proof. We begin by proving the result in a special case and then use the π − λ and
monotone class theorems to get the general result. Let A = {ω : ω0 ∈ A0 , . . . , ωm ∈
Am } and g0 , . . . gn be bounded and measurable. Applying (5.1.3) with fk = 1Ak for
k < m, fm = 1Am g0 , and fk = gk−m for m < k ≤ m + n gives
n Eµ gk (Xm+k ); A
k=0 = p(x0 , dx1 ) · · · µ(dx0 )
A0 A1 · g0 (xm )
··· p(xm−1 , dxm )
Am p(xm , dxm+1 )g1 (xm+1 ) p(xm+n−1 , dxm+n )gn (xm+n )
n = Eµ EXm gk (Xk ) ; A
k=0 The collection of sets for which the last formula holds is a λsystem, and the collection
for which it has been proved is a π system, so using the π − λ theorem, Theorem 1.4.2,
shows that the last identity holds for all A ∈ Fm .
Fix A ∈ Fm and let H be the collection of bounded measurable Y for which
(∗) Eµ (Y ◦ θm ; A) = Eµ (EXm Y ; A) 234 CHAPTER 5. MARKOV CHAINS The last computation shows that (∗) holds when
Y (ω ) = gk (ωk )
0≤k≤n To ﬁnish the proof, we will apply the monotone class theorem, Theorem 5.1.3. Let A
be the collection of sets of the form {ω : ω0 ∈ A0 , . . . , ωk ∈ Ak }. A is a π system, so
taking gk = 1Ak shows (i) holds. H clearly has properties (ii) and (iii), so Theorem
5.1.3 implies that H contains the bounded functions measurable w.r.t σ (A), and the
proof is complete.
Exercise 5.3.1. Use the Markov property to show that if A ∈ σ (X0 , . . . , Xn ) and
B ∈ σ (Xn , Xn+1 , . . .), then for any initial distribution µ
Pµ (A ∩ B Xn ) = Pµ (AXn )Pµ (B Xn )
In words, the past and future are conditionally independent given the present.
Hint: Write the lefthand side as Eµ (Eµ (1A 1B Fn )Xn ).
The next two results illustrate the use of Theorem 5.3.1. We will see many other
applications below.
Theorem 5.3.2. ChapmanKolmogorov equation.
Px (Xm = y )Py (Xn = z ) Px (Xm+n = z ) =
y Proof. Px (Xn+m = z ) = Ex (Px (Xn+m = z Fm )) = Ex (PXm (Xn = z )) by the Markov
property, Theorem 5.3.1 since 1(Xn =z) ◦ θm = 1(Xn+m =z) .
Theorem 5.3.3. Let Xn be a Markov chain and suppose
P ∪∞=n+1 {Xm ∈ Bm } Xn ≥ δ > 0
m on {Xn ∈ An } Then P ({Xn ∈ An i.o.} − {Xn ∈ Bn i.o.}) = 0.
Remark. To quote Chung, “The intuitive meaning of the preceding theorem has
been given by Doeblin as follows: if the chance of a pedestrian’s getting run over is
greater than δ > 0 each time he crosses a certain street, then he will not be crossing
it indeﬁnitely (since he will be killed ﬁrst)!”
Proof. Let Λn = {Xn+1 ∈ Bn+1 } ∪ {Xn+2 ∈ Bn+2 } ∪ . . .
Λ = ∩Λn = {Xn ∈ Bn i.o.}
and Γ = {Xn ∈ An i.o.}. Let Fn = σ (X0 , X1 , . . . , Xn ) and F∞ = σ (∪Fn ). Using
the Markov property and the dominated convergence theorem for conditional expectations, Theorem 4.5.9,
E (1Λn Xn ) = E (1Λn Fn ) → E (1Λ F∞ ) = 1Λ
On Γ, the lefthand side is ≥ δ i.o. This is only possible if Γ ⊂ Λ.
Exercise 5.3.2. A state a is called absorbing if Pa (X1 = a) = 1. Let D = {Xn = a
for some n ≥ 1} and let h(x) = Px (D). (i) Use Theorem 5.3.3 to conclude that
h(Xn ) → 0 a.s. on Dc . Here a.s. means Pµ a.s. for any initial distribution µ. (ii)
Obtain the result in Exercise 4.5.5 as a special case. 5.3. EXTENSIONS OF THE MARKOV PROPERTY 235 We are now ready for our second extension of the Markov property. Recall N is
said to be a stopping time if {N = n} ∈ Fn . As in Chapter 3, let
FN = {A : A ∩ {N = n} ∈ Fn for all n}
be the information known at time N , and let
θN ω = θn ω
∆ on {N = n}
on {N = ∞} where ∆ is an extra point that we add to Ωo . In the next result and its applications,
we will explicitly restrict our attention to {N < ∞}, so the reader does not have to
worry about the second part of the deﬁnition of θN .
Theorem 5.3.4. Strong Markov property. Suppose that for each n, Yn : Ω → R
is measurable and Yn  ≤ M for all n. Then
Eµ (YN ◦ θN FN ) = EXN YN on {N < ∞}
where the righthand side is ϕ(x, n) = Ex Yn evaluated at x = XN , n = N.
Proof. Let A ∈ FN . Breaking things down according to the value of N .
∞ Eµ (Yn ◦ θn ; A ∩ {N = n}) Eµ (YN ◦ θN ; A ∩ {N < ∞}) =
n=0 Since A ∩ {N = n} ∈ Fn , using Theorem 5.3.1 now converts the right side into
∞ Eµ (EXn Yn ; A ∩ {N = n}) = Eµ (EXN YN ; A ∩ {N < ∞})
n=0 Remark. The reader should notice that the proof is trivial. All we do is break things
down according to the value of N , replace N by n, apply the Markov property, and
reverse the process. This is the standard technique for proving results about stopping
times.
The next example illustrates the use of Theorem 5.3.4, and explains why we want
to allow the Y that we apply to the shifted path to depend on n.
Theorem 5.3.5. Reﬂection principle. Let ξ1 , ξ2 , . . . be independent and identically
distributed with a distribution that is symmetric about 0. Let Sn = ξ1 + · · · + ξn . If
a > 0 then
P sup Sm > a ≤ 2P (Sn > a) m≤n Remark. First, a trivial comment: The strictness of the inequality is not important.
If the result holds for >, it holds for ≥ and vice versa.
A second more important one: We do the proof in two steps because that is how
formulas like this are derived in practice. First, one computes intuitively and then
ﬁgures out how to extract the desired formula from Theorem 5.3.4.
Proof in words. First note that if Z has a distribution that is symmetric about 0,
then
1
1
P (Z ≥ 0) ≥ P (Z > 0) + P (Z = 0) =
2
2 236 CHAPTER 5. MARKOV CHAINS If we let N = inf {m ≤ n : Sm > a} (with inf ∅ = ∞), then on {N < ∞}, Sn − SN is
independent of SN and has P (Sn − SN ≥ 0) ≥ 1/2. So
P (Sn > a) ≥ 1
P (N ≤ n)
2 Proof. Let Ym (ω ) = 1 if m ≤ n and ωn−m > a, Ym (ω ) = 0 otherwise. The deﬁnition
of Ym is chosen so that (YN ◦ θN )(ω ) = 1 if ωn > a (and hence N ≤ n), and = 0
otherwise. The strong Markov property implies
E0 (YN ◦ θN FN ) = ESN YN on {N < ∞} = {N ≤ n} To evaluate the righthand side, we note that if y > a, then
Ey Ym = Py (Sn−m > a) ≥ Py (Sn−m ≥ y ) ≥ 1/2
So integrating over {N ≤ n} and using the deﬁnition of conditional expectation gives
1
P (N ≤ n) ≤ E0 (E0 (YN ◦ θN FN ); N ≤ n) = E0 (YN ◦ θN ; N ≤ n)
2
since {N ≤ n} ∈ FN . Recalling that YN ◦ θN = 1{Sn >a} , the last quantity
= E0 (1{Sn >a} ; N ≤ n) = P0 (Sn > a)
since {Sn > a} ⊂ {N ≤ n}.
Exercises
The next ﬁve exercises concern the hitting times
τA = inf {n ≥ 0 : Xn ∈ A} τy = τ{ y } TA = inf {n ≥ 1 : Xn ∈ A} T y = T {y } To keep the two deﬁnitions straight, note that the symbol τ is smaller than T . Some
of the results below are valid for a general S but for simplicity.
We will suppose throughout that S is countable.
5.3.3. First entrance decomposition. Let Ty = inf {n ≥ 1 : Xn = y }. Show that
n Px (Ty = m)pn−m (y, y ) pn (x, y ) =
m=1 5.3.4. Show that n
m=0 Px (Xm = x) ≥ n+k
m=k Px (Xm = x). 5.3.5. Suppose that S − C is ﬁnite and for each x ∈ S − C Px (τC < ∞) > 0. Then
there is an N < ∞ and > 0 so that Py (τC > kN ) ≤ (1 − )k .
5.3.6. Let h(x) = Px (τA < τB ). Suppose A ∩ B = ∅, S − (A ∪ B ) is ﬁnite, and
Px (τA∪B < ∞) > 0 for all x ∈ S − (A ∪ B ). (i) Show that
(∗) h(x) = p(x, y )h(y ) for x ∈ A ∪ B
/ y (ii) Show that if h satisﬁes (∗) then h(X (n ∧ τA∪B )) is a martingale. (iii) Use this and
Exercise 5.3.5 to conclude that h(x) = Px (τA < τB ) is the only solution of (∗) that is
1 on A and 0 on B. 5.3. EXTENSIONS OF THE MARKOV PROPERTY 237 5.3.7. Let Xn be a Markov chain with S = {0, 1, . . . , N } and suppose that Xn is a
martingale and Px (τ0 ∧ τN < ∞) > 0 for all x. (i) Show that 0 and N are absorbing
states, i.e., p(0, 0) = p(N, N ) = 1. (ii) Show Px (τN < τ0 ) = x/N.
5.3.8. WrightFisher model. Suppose S = {0, 1, . . . , N } and consider
p(i, j ) = N
(i/N )j (1 − i/N )N −j
j Show that this chain satisﬁes the hypotheses of Exercise 5.3.7.
5.3.9. In brothersister mating described in Exercise 5.2.5, AA, AA and aa, aa are
absorbing states. Show that the number of A’s in the pair is a martingale and use
this to compute the probability of getting absorbed in AA, AA starting from each of
the states.
5.3.10. Let τA = inf {n ≥ 0 : Xn ∈ A} and g (x) = Ex τA . Suppose that S − A is ﬁnite
and for each x ∈ S − A, Px (τA < ∞) > 0. (i) Show that
(∗) p(x, y )g (y ) g (x) = 1 + for x ∈ A
/ y (ii) Show that if g satisﬁes (∗), g (X (n ∧ τA )) + n ∧ τA is a martingale. (iii) Use this
to conclude that g (x) = Ex τA is the only solution of (∗) that is 0 on A.
5.3.11. Let ξ0 , ξ1 , . . . be i.i.d. ∈ {H, T }, taking each value with probability 1/2, and
let Xn = (ξn , ξn+1 ) be the Markov chain from Exercise 5.2.4. Let N1 = inf {n ≥ 0 :
(ξn , ξn+1 ) = (H, H )}. Use the results in the last exercise to compute EN1 . [No, there
is no missing subscript on E , but you will need to ﬁrst compute g (x).]
5.3.12. Consider simple random walk Sn , the Markov chain with p(x, x + 1) = 1/2
and p(x, x − 1) = 1/2.let τ = min{n : Sn ∈ (0, N )}. Use the result from Exercise to
show that Ex τ = x(N − x). 238 5.4 CHAPTER 5. MARKOV CHAINS Recurrence and Transience In this section and the next two, we will consider only Markov chains on a countable
0
state space. Let Ty = 0, and for k ≥ 1, let
k
k
Ty = inf {n > Ty −1 : Xn = y }
k
1
Ty is the time of the k th return to y . The reader should note that Ty > 0 so any
1
visit at time 0 does not count. We adopt this convention so that if we let Ty = Ty
and ρxy = Px (Ty < ∞), then
k
Theorem 5.4.1. Px (Ty < ∞) = ρxy ρk−1 .
yy Intuitively, in order to make k visits to y , we ﬁrst have to go from x to y and then
return k − 1 times to y.
Proof. When k = 1, the result is trivial, so we suppose k ≥ 2. Let Y (ω ) = 1 if ωn = y
k
k
for some n ≥ 1, Y (ω ) = 0 otherwise. If N = Ty −1 then Y ◦ θN = 1 if Ty < ∞. The
strong Markov property, Theorem 5.3.4, implies
Ex (Y ◦ θN FN ) = EXN Y on {N < ∞} On {N < ∞}, XN = y , so the righthand side is Py (Ty < ∞) = ρyy , and it follows
that
k
Px (Ty < ∞) = Ex (Y ◦ θN ; N < ∞) = Ex (Ex (Y ◦ θN FN ); N < ∞)
k
= Ex (ρyy ; N < ∞) = ρyy Px (Ty −1 < ∞) The result now follows by induction.
A state y is said to be recurrent if ρyy = 1 and transient if ρyy < 1. If y is
k
recurrent, Theorem 5.4.1 implies Py (Ty < ∞) = 1 for all k , so Py (Xn = y i.o.) = 1.
k
Exercise 5.4.1. Suppose y is recurrent and for k ≥ 0, let Rk = Ty be the time
of the k th return to y , and for k ≥ 1 let rk = Rk − Rk−1 be the k th interarrival
time. Use the strong Markov property to conclude that under Py , the vectors vk =
(rk , XRk−1 , . . . , XRk −1 ), k ≥ 1 are i.i.d. If y is transient and we let N (y ) =
positive times, then ∞
n=1 1(Xn =y) be the number of visits to y at ∞ ∞
k
Px (Ty < ∞) Px (N (y ) ≥ k ) = Ex N (y ) =
k=1
∞ k=1 ρxy ρk−1
yy =
k=1 ρxy
<∞
=
1 − ρyy (5.4.1) Combining the last computation with our result for recurrent states gives a result
that generalizes Theorem 3.2.2.
Theorem 5.4.2. y is recurrent if and only if Ey N (y ) = ∞.
Exercise 5.4.2. Let a ∈ S , fn = Pa (Ta = n), and un = Pa (Xn = a). (i) Show that
un = 1≤m≤n fm un−m . (ii) Let u(s) = n≥0 un sn , f (s) = n≥1 fn sn , and show
u(s) = 1/(1 − f (s)). Setting s = 1 gives (5.4.1) for x = y = a. 5.4. RECURRENCE AND TRANSIENCE 239 Exercise 5.4.3. Consider asymmetric simple random walk on Z, i.e., we have p(i, i +
1) = p, p(i, i − 1) = q = 1 − p. In this case,
p2m (0, 0) = 2m m m
pq
m and p2m+1 (0, 0) = 0 (i) Use the Taylor series expansion for h(x) = (1 − x)−1/2 to show u(s) = (1 −
4pqs2 )−1/2 and use the last exercise to conclude f (s) = 1 − (1 − 4pqs2 )1/2 . (ii) Set
s = 1 to get the probability the random walk will return to 0 and check that this is
the same as the answer given in part (c) of Theorem 4.7.7.
The next result shows that recurrence is contagious.
Theorem 5.4.3. If x is recurrent and ρxy > 0 then y is recurrent and ρyx = 1.
Proof. We will ﬁrst show ρyx = 1 by showing that if ρxy > 0 and ρyx < 1 then
ρxx < 1. Let K = inf {k : pk (x, y ) > 0}. There is a sequence y1 , . . . , yK −1 so that
p(x, y1 )p(y1 , y2 ) · · · p(yK −1 , y ) > 0
Since K is minimal, yi = x for 1 ≤ i ≤ K − 1. If ρyx < 1, we have
Px (Tx = ∞) ≥ p(x, y1 )p(y1 , y2 ) · · · p(yK −1 , y )(1 − ρyx ) > 0
a contradiction. So ρyx = 1.
To prove that y is recurrent, observe that ρyx > 0 implies there is an L so that
pL (y, x) > 0. Now
pL+n+K (y, y ) ≥ pL (y, x)pn (x, x)pK (x, y )
Summing over n, we see
∞ ∞ pn (x, x) = ∞ pL+n+K (y, y ) ≥ pL (y, x)pK (x, y )
n=1 n=1 so Theorem 5.4.2 implies y is recurrent.
Exercise 5.4.4. Use the strong Markov property to show that ρxz ≥ ρxy ρyz .
The next fact will help us identify recurrent states in examples. First we need
two deﬁnitions. C is closed if x ∈ C and ρxy > 0 implies y ∈ C . The name comes
from the fact that if C is closed and x ∈ C then Px (Xn ∈ C ) = 1 for all n. D is
irreducible if x, y ∈ D implies ρxy > 0.
Theorem 5.4.4. Let C be a ﬁnite closed set. Then C contains a recurrent state. If
C is irreducible then all states in C are recurrent.
Proof. In view of Theorem 5.4.3, it suﬃces to prove the ﬁrst claim. Suppose it is
false. Then for all y ∈ C , ρyy < 1 and Ex N (y ) = ρxy /(1 − ρyy ), but this is ridiculous
since it implies
∞ ∞ ∞> pn (x, y ) = Ex N (y ) =
y ∈C y ∈C n=1 ∞ pn (x, y ) =
n=1 y ∈C 1
n=1 The ﬁrst inequality follows from the fact that C is ﬁnite and the last equality from
the fact that C is closed. 240 CHAPTER 5. MARKOV CHAINS
To illustrate the use of the last result consider: Example 5.4.1. A Sevenstate chain. Consider the transition probability: 1
2
3
4
5
6
7 1
.3
.1
0
0
.6
0
0 2
0
.2
0
0
0
0
0 3
0
.3
.5
0
0
0
0 4
0
.4
.5
.5
0
0
1 5
.7
0
0
0
.4
0
0 6
0
0
0
.5
0
.2
0 7
0
0
0
0
0
.8
0 To identify the states that are recurrent and those that are transient, we begin by
drawing a graph that will contain an arc from i to j if p(i, j ) > 0 and i = j . We do
not worry about drawing the selfloops corresponding to states with p(i, i) > 0 since
such transitions cannot help the chain get somewhere new.
In the case under consideration we draw arcs from 1 → 5, 2 → 1, 2 → 3, 2 → 4,
3 → 4, 4 → 6, 4 → 7, 5 → 1, 6 → 4, 6 → 7, 7 → 4. 1'
T
c
5 E4'E6
T © c 3
7 2 (i) ρ21 > 0 and ρ12 = 0 so 2 must be transient, or we would contradict Theorem 5.4.3.
Similarly, ρ43 > 0 and ρ34 = 0 so 4 must be transient
(ii) {1, 5} and {4, 6, 7} are irreducible closed sets, so Theorem 5.4.4 implies these
states are recurrent.
The last reasoning can be used to identify transient and recurrent states when
S is ﬁnite since for x ∈ S either: (i) there is a y with ρxy > 0 and ρyx = 0 and x
must be transient, or (ii) ρxy > 0 implies ρyx > 0 . In case (ii), Exercise 5.4.4 implies
Cx = {y : ρxy > 0} is an irreducible closed set. (If y, z ∈ Cx then ρyz ≥ ρyx ρxz > 0. If
ρyw > 0 then ρxw ≥ ρxy ρyw > 0, so w ∈ Cx .) So Theorem 5.4.4 implies x is recurrent.
Exercise 5.4.5. Show that in the Ehrenfest chain (Example 5.2.5), all states are
recurrent.
Example 5.4.1 motivates the following:
Theorem 5.4.5. Decomposition theorem. Let R = {x : ρxx = 1} be the recurrent
states of a Markov chain. R can be written as ∪i Ri , where each Ri is closed and
irreducible.
Remark. This result shows that for the study of recurrent states we can, without
loss of generality, consider a single irreducible closed set. 5.4. RECURRENCE AND TRANSIENCE 241 Proof. If x ∈ R let Cx = {y : ρxy > 0}. By Theorem 5.4.3, Cx ⊂ R, and if y ∈ Cx
then ρyx > 0. From this it follows easily that either Cx ∩ Cy = ∅ or Cx = Cy . To
prove the last claim, suppose Cx ∩ Cy = ∅. If z ∈ Cx ∩ Cy then ρxy ≥ ρxz ρzy > 0,
so if w ∈ Cy we have ρxw ≥ ρxy ρyw > 0 and it follows that Cx ⊃ Cy . Interchanging
the roles of x and y gives Cy ⊃ Cx , and we have proved our claim. If we let Ri be a
listing of the sets that appear as some Cx , we have the desired decomposition.
The rest of this section is devoted to examples. Speciﬁcally we concentrate on the
question: How do we tell whether a state is recurrent or transient? Reasoning based
on Theorem 5.4.3 works occasionally when S is inﬁnite.
Example 5.4.2. Branching process. If the probability of no children is positive
then ρk0 > 0 and ρ0k = 0 for k ≥ 1, so Theorem 5.4.4 implies all states k ≥ 1 are
transient. The state 0 has p(0, 0) = 1 and is recurrent. It is called an absorbing
state to reﬂect the fact that once the chain enters 0, it remains there for all time.
If S is inﬁnite and irreducible, all that Theorem 5.4.3 tells us is that either all
the states are recurrent or all are transient, and we are left to ﬁgure out which case
occurs.
Example 5.4.3. Renewal chain. Since p(i, i − 1) = 1 for i ≥ 1, it is clear that
ρi0 = 1 for all i ≥ 1 and hence also for i = 0, i.e., 0 is recurrent. If we recall that
p(0, j ) = fj +1 and suppose that {k : fk > 0} is unbounded, then ρ0i > 0 for all i and
all states are recurrent. If K = sup{k : fk > 0} < ∞ then {0, 1, . . . , K − 1} is an
irreducible closed set of recurrent states and all states k ≥ K are transient.
Example 5.4.4. Birth and death chains on {0, 1, 2, . . .}. Let
p(i, i + 1) = pi p(i, i − 1) = qi p(i, i) = ri where q0 = 0. Let N = inf {n : Xn = 0}. To analyze this example, we are going to
deﬁne a function ϕ so that ϕ(XN ∧n ) is a martingale. We start by setting ϕ(0) = 0
and ϕ(1) = 1. For the martingale property to hold when Xn = k ≥ 1, we must have
ϕ(k ) = pk ϕ(k + 1) + rk ϕ(k ) + qk ϕ(k − 1)
Using rk = 1 − (pk + qk ), we can rewrite the last equation as
qk (ϕ(k ) − ϕ(k − 1)) = pk (ϕ(k + 1) − ϕ(k ))
or ϕ(k + 1) − ϕ(k ) = qk
(ϕ(k ) − ϕ(k − 1))
pk Here and in what follows, we suppose that pk , qk > 0 for k ≥ 1. Otherwise, the chain
is not irreducible. Since ϕ(1) − ϕ(0) = 1, iterating the last result gives
m ϕ(m + 1) − ϕ(m) = qj
pj
j =1 for m ≥ 1 n−1 m ϕ(n) = qj
pj
m=0 j =1 for n ≥ 1 if we interpret the product as 1 when m = 0. Let Tc = inf {n ≥ 1 : Xn = c}. Now I
claim that: 242 CHAPTER 5. MARKOV CHAINS Theorem 5.4.6. If a < x < b then
Px (Ta < Tb ) = ϕ(b) − ϕ(x)
ϕ(b) − ϕ(a) Px (Tb < Ta ) = ϕ(x) − ϕ(a)
ϕ(b) − ϕ(a) Proof. If we let T = Ta ∧ Tb then ϕ(Xn∧T ) is a bounded martingale and T < ∞ a.s.
by Theorem 5.3.3, so ϕ(x) = Ex ϕ(XT ) by Theorem 4.7.4. Since XT ∈ {a, b} a.s.,
ϕ(x) = ϕ(a)Px (Ta < Tb ) + ϕ(b)[1 − Px (Ta < Tb )]
and solving gives the indicated formula.
Remark. The answer and the proof should remind the reader of Example 3.1.5 and
Theorem 4.7.7. To help remember the formula, observe that for any α and β , if
we let ψ (x) = αϕ(x) + β then ψ (Xn∧T ) is also a martingale and the answer we get
using ψ must be the same. The last observation explains why the answer is a ratio of
diﬀerences. To help remember which one, observe that the answer is 1 if x = a and 0
if x = b.
Letting a = 0 and b = M in Theorem 5.4.6 gives
Px (T0 > TM ) = ϕ(x)/ϕ(M )
Letting M → ∞ and observing that TM ≥ M − x, Px a.s. we have proved:
Theorem 5.4.7. 0 is recurrent if and only if ϕ(M ) → ∞ as M → ∞, i.e.,
∞ ϕ(∞) ≡ m qj
=∞
p
m=0 j =1 j If ϕ(∞) < ∞ then Px (T0 = ∞) = ϕ(x)/ϕ(∞).
We will now see what Theorem 5.4.7 says about some concrete cases.
Example 5.4.5. Asymmetric simple random walk. Suppose pj = p and qj =
1 − p for j ≥ 1. In this case,
n−1 1−p
p ϕ(n) =
m=0 m From Theorem 5.4.7, it follows that 0 is recurrent if and only if p ≤ 1/2, and if
p > 1/2, then
x
ϕ(∞) − ϕ(x)
1−p
Px (T0 < ∞) =
=
ϕ(∞)
p
Exercise 5.4.6. A gambler is playing roulette and betting $1 on black each time.
The probability she wins $1 is 18/38, and the probability she loses $1 is 20/38. (i)
Calculate the probability that starting with $20 she reaches $40 before losing her
money. (ii) Use the fact that Xn + 2n/38 is a martingale to calculate E (T40 ∧ T0 ).
Example 5.4.6. To probe the boundary between recurrence and transience, suppose
pj = 1/2 + j where j ∼ Cj −α as j → ∞, and qj = 1 − pj . A little arithmetic shows
qj
1/2 −
=
pj
1/2 + j
j =1− 2j
1/2 + ≈ 1 − 4Cj −α
j for large j 5.4. RECURRENCE AND TRANSIENCE 243 Case 1: α > 1. It is easy to show that if 0 < δj < 1, then j (1 − δj ) > 0 if and only
if j δj < ∞, (see Exercise 4.3.5), so if α > 1, j ≤k (qj /pj ) ↓ a positive limit, and 0
is recurrent.
Case 2: α < 1. Using the fact that log(1 − δ ) ∼ −δ as δ → 0, we see that
k k 4Cj −α ∼ − qj /pj ∼ − log
j =1 j =1 4C 1−α
k
1−α as k → ∞ k so, for k ≥ K , j =1 qj /pj ≤ exp(−2Ck 1−α /(1 − α)) and
hence 0 is transient. ∞
k=0
k qj
k
j =1 pj < ∞ and q Case 3: α = 1. Repeating the argument for Case 2 shows log j =1 pj ∼ −4C log k .
j
So, if C > 1/4, 0 is transient, and if C < 1/4, 0 is recurrent. The case C = 1/4 can
go either way.
Example 5.4.7. M/G/1 queue. Let µ = k ak be the mean number of customers
that arrive during one service time. We will now show that if µ > 1, the chain is
transient (i.e., all states are), but if µ ≤ 1, it is recurrent. For the case µ > 1,
we observe that if ξ1 , ξ2 , . . . are i.i.d. with P (ξm = j ) = aj +1 for j ≥ −1 and Sn =
ξ1 +· · ·+ξn , then X0 +Sn and Xn behave the same until time N = inf {n : X0 +Sn = 0}.
When µ > 1, Eξm = µ − 1 > 0, so Sn → ∞ a.s., and inf Sn > −∞ a.s. It follows from
the last observation that if x is large, Px (N < ∞) < 1, and the chain is transient.
To deal with the case µ ≤ 1, we observe that it follows from arguments in the
last paragraph that Xn∧N is a supermartingale. Let T = inf {n : Xn ≥ M }. Since
Xn∧N is a nonnegative supermartingale, using Theorem 4.7.6 at time τ = T ∧ N , and
observing Xτ ≥ M on {T < N }, Xτ = 0 on {N < T } gives
x ≥ M Px (T < N )
Letting M → ∞ shows Px (N < ∞) = 1, so the chain is recurrent.
Remark. There is another way of seeing that the M/G/1 queue is transient when
µ > 1. If we consider the customers that arrive during a person’s service time to be
her children, then we get a branching process. Results in Section 4.3 imply that when
µ ≤ 1 the branching process dies out with probability one (i.e., the queue becomes
empty), so the chain is recurrent. When µ > 1, Theorem 4.3.9 implies Px (T0 < ∞) =
∞
ρx , where ρ is the unique ﬁxed point ∈ (0, 1) of the function ϕ(θ) = k=0 ak θk .
The next result encapsulates the techniques we used for birth and death chains
and the M/G/1 queue.
Theorem 5.4.8. Suppose S is irreducible, and ϕ ≥ 0 with Ex ϕ(X1 ) ≤ ϕ(x) for
x ∈ F , a ﬁnite set, and ϕ(x) → ∞ as x → ∞, i.e., {x : ϕ(x) ≤ M } is ﬁnite for any
/
M < ∞, then the chain is recurrent.
Proof. Let τ = inf {n > 0 : Xn ∈ F }. Our assumptions imply that Yn = ϕ(Xn∧τ )
is a supermartingale. Let TM = inf {n > 0 : Xn ∈ F or ϕ(Xn ) > M }. Since
{x : ϕ(x) ≤ M } is ﬁnite and the chain is irreducible, TM < ∞ a.s. Using Theorem
4.7.6 4 now, we see that
ϕ(x) ≥ Ex ϕ(XTM ) ≥ M Px (TM < τ )
since ϕ(XTM ) ≥ M when TM < τ . Letting M → ∞, we see that Px (τ < ∞) = 1 for all
x ∈ F . So Py (Xn ∈ F i.o.) = 1 for all y ∈ S , and since F is ﬁnite, Py (Xn = z i.o.) = 1
/
for some z ∈ F. 244 CHAPTER 5. MARKOV CHAINS Exercise 5.4.7. Show that if we replace “ϕ(x) → ∞” by “ϕ(x) → 0” in the last
theorem and assume that ϕ(x) > 0 for x ∈ F , then we can conclude that the chain is
transient.
Exercise 5.4.8. Let Xn be a birth and death chain with pj − 1/2 ∼ C/j as j → ∞
and qj = 1 − pj . (i) Show that if we take C < 1/4 then we can pick α > 0 so that
ϕ(x) = xα satisﬁes the hypotheses of Theorem 5.4.8. (ii) Show that when C > 1/4,
we can take α < 0 and apply Exercise 5.4.7.
Remark. An advantage of the method of Exercise 5.4.8 over that of Example 5.4.6
is that it applies if we assume Px (X1 − x ≤ M ) = 1 and Ex (X1 − x) ∼ 2C/x.
Exercise 5.4.9. f is said to be superharmonic if f (x) ≥ y p(x, y )f (y ), or equivalently f (Xn ) is a supermartingale. Suppose p is irreducible. Show that p is recurrent
if and only if every nonnegative superharmonic function is constant.
Exercise 5.4.10. M/M/∞ queue. Consider a telephone system with an inﬁnite
number of lines. Let Xn = the number of lines in use at time n, and suppose
Xn Xn+1 = ξn,m + Yn+1
m=1 where the ξn,m are i.i.d. with P (ξn,m = 1) = p and P (ξn,m = 0) = 1 − p, and Yn is an
independent i.i.d. sequence of Poisson mean λ r.v.’s. In words, for each conversation
we ﬂip a coin with probability p of heads to see if it continues for another minute.
Meanwhile, a Poisson mean λ number of conversations start between time n and n +1.
Use Theorem 5.4.8 with ϕ(x) = x to show that the chain is recurrent for any p < 1. 5.5. STATIONARY MEASURES 5.5 245 Stationary Measures A measure µ is said to be a stationary measure if
µ(x)p(x, y ) = µ(y )
x The last equation says Pµ (X1 = y ) = µ(y ). Using the Markov property and induction,
it follows that Pµ (Xn = y ) = µ(y ) for all n ≥ 1. If µ is a probability measure, we
call µ a stationary distribution, and it represents a possible equilibrium for the
chain. That is, if X0 has distribution µ then so does Xn for all n ≥ 1. If we stretch
our imagination a little, we can also apply this interpretation when µ is an inﬁnite
measure. (When the total mass is ﬁnite, we can divide by µ(S ) to get a stationary
distribution.) Before getting into the theory, we consider some examples.
Example 5.5.1. Random walk. S = Zd . p(x, y ) = f (y − x), where f (z ) ≥ 0 and
f (z ) = 1. In this case, µ(x) ≡ 1 is a stationary measure since
f (y − x) = 1 p(x, y ) =
x x A transition probability that has x p(x, y ) = 1 is called doubly stochastic. This is
obviously a necessary and suﬃcient condition for µ(x) ≡ 1 to be a stationary measure.
Example 5.5.2. Asymmetric simple random walk. S = Z.
p(x, x − 1) = q = 1 − p p(x, x + 1) = p By the last example, µ(x) ≡ 1 is a stationary measure. When p = q , µ(x) = (p/q )x is
a second one. To check this, we observe that
µ(x)p(x, y ) = µ(y + 1)p(y + 1, y ) + µ(y − 1)p(y − 1, y )
x = (p/q )y+1 q + (p/q )y−1 p = (p/q )y [p + q ] = (p/q )y
Example 5.5.3. The Ehrenfest chain. S = {0, 1, . . . , r}.
p(k, k + 1) = (r − k )/r p(k, k − 1) = k/r r
In this case, µ(x) = 2−r x is a stationary distribution. One can check this without
pencil and paper by observing that µ corresponds to ﬂipping r coins to determine
which urn each ball is to be placed in, and the transitions of the chain correspond
to picking a coin at random and turning it over. Alternatively, you can pick up your
pencil and check that µ(k + 1)p(k + 1, k ) + µ(k − 1)p(k − 1, k ) = µ(k ). Example 5.5.4. Birth and death chains. S = {0, 1, 2, . . .}
p(x, x + 1) = px p(x, x) = rx p(x, x − 1) = qx with q0 = 0 and p(i, j ) = 0 otherwise. In this case, there is the measure
x µ(x) =
k=1 which has x µ(x)p(x, x + 1) = px
k=1 pk−1
qk pk−1
= µ(x + 1)p(x + 1, x)
qk 246 CHAPTER 5. MARKOV CHAINS Since p(x, y ) = 0 when x − y  > 1, it follows that
µ(x)p(x, y ) = µ(y )p(y, x) for all x, y (5.5.1) Summing over x gives
µ(x)p(x, y ) = µ(y )
x so (5.5.1) is stronger than being a stationary measure. (5.5.1) asserts that the amount
of mass that moves from x to y in one jump is exactly the same as the amount that
moves from y to x. A measure µ that satisﬁes (5.5.1) is said to be a reversible
measure. Since Examples 5.5.2 and 5.5.3 are birth and death chains, they have
reversible measures. In Example 5.5.1 (random walks), µ(x) ≡ 1 is a reversible
measure if and only if p(x, y ) = p(y, x).
The next exercise explains the name “reversible.”
Exercise 5.5.1. Let µ be a stationary measure and suppose X0 has “distribution”
µ. Then Ym = Xn−m , 0 ≤ m ≤ n is a Markov chain with initial measure µ and
transition probability
q (x, y ) = µ(y )p(y, x)/µ(x)
q is called the dual transition probability. If µ is a reversible measure then q = p.
Exercise 5.5.2. Find the stationary distribution for the BernoulliLaplace model of
diﬀusion from Exercise 5.2.6.
Example 5.5.5. Random walks on graphs. A graph is described by giving a
countable set of vertices S and an adjacency matrix aij that has aij = 1 if i and j are
adjacent and 0 otherwise. To have an undirected graph with no loops, we suppose
aij = aji and aii = 0. If we suppose that
aij < ∞ µ(i) = and let p(i, j ) = aij /µ(i) j then p is a transition probability that corresponds to picking an edge at random and
jumping to the other end. It is clear from the deﬁnition that
µ(i)p(i, j ) = aij = aji = µ(j )p(j, i)
so µ is a reversible measure for p. A little thought reveals that if we assume only that
aij = aji ≥ 0, aij < ∞ µ(i) = and p(i, j ) = aij /µ(i) j the same conclusion is valid. This is the most general example because if µ is a
reversible measure for p, we can let aij = µ(i)p(i, j ).
Reviewing the last ﬁve examples might convince you that most chains have reversible measures. This is a false impression. The M/G/1 queue has no reversible
measures because if x > y + 1, p(x, y ) = 0 but p(y, x) > 0. The renewal chain has
similar problems.
Theorem 5.5.1. Suppose p is irreducible. A necessary and suﬃcient condition for
the existence of a reversible measure is that (i) p(x, y ) > 0 implies p(y, x) > 0, and
(ii) for any loop x0 , x1 , . . . , xn = x0 with 1≤i≤n p(xi , xi−1 ) > 0,
n p(xi−1 , xi )
=1
p(xi , xi−1 )
i=1 5.5. STATIONARY MEASURES 247 Proof. To prove the necessity of this cycle condition, due to Kolmogorov, we note
that irreducibility implies that any stationary measure has µ(x) > 0 for all x, so
(5.5.1) implies (i) holds. To check (ii), note that (5.5.1) implies that for the sequences
considered above
n
n
p(xi−1 , xi )
µ(xi )
=
=1
p(xi , xi−1 ) i=1 µ(xi−1 )
i=1
To prove suciency, ﬁx a ∈ S , set µ(a) = 1, and if x0 = a, x1 , . . . , xn = x is a sequence
with 1≤i≤n p(xi , xi−1 ) > 0 (irreducibility implies such a sequence will exist), we let
n µ(x) = p(xi−1 , xi )
p(xi , xi−1 )
i=1 The cycle condition guarantees that the last deﬁnition is independent of the path. To
check (5.5.1) now, observe that if p(y, x) > 0 then adding xn+1 = y to the end of a
path to x we have
p(x, y )
= µ(y )
µ(x)
p(y, x)
Only special chains have reversible measures, but as the next result shows, many
Markov chains have stationary measures.
Theorem 5.5.2. Let x be a recurrent state, and let T = inf {n ≥ 1 : Xn = x}. Then
∞ T −1 1{Xn =y} µx (y ) = Ex Px (Xn = y, T > n) =
n=0 n=0 deﬁnes a stationary measure.
Proof. This is called the “cycle trick.” The proof in words is simple. µx (y ) is the
expected number of visits to y in {0, . . . , T − 1}. µx p(y ) ≡
µx (z )p(z, y ) is the
expected number of visits to y in {1, . . . , T }, which is = µx (y ) since XT = X0 = x.
To translate this intuition into a proof, let pn (x, y ) = Px (Xn = y, T > n) and use
¯
Fubini’s theorem to get
∞ pn (x, y )p(y, z )
¯ µx (y )p(y, z ) =
n=0 y y Case 1. z = x.
pn (x, y )p(y, z ) =
¯
y Px (Xn = y, T > n, Xn+1 = z )
y = Px (T > n + 1, Xn+1 = z ) = pn+1 (x, z )
¯
so ∞
n=0 y pn (x, y )p(y, z ) =
¯ ∞
n=0 pn+1 (x, z ) = µx (z ) since p0 (x, z ) = 0.
¯
¯ Case 2. z = x.
pn (x, y )p(y, x) =
¯
y
∞ Px (Xn = y, T > n, Xn+1 = x) = Px (T = n + 1)
y so n=0 y pn (x, y )p(y, x) =
¯
Px (T = 0) = 0. ∞
n=0 Px (T = n + 1) = 1 = µx (x) since by deﬁnition 248 CHAPTER 5. MARKOV CHAINS Remark. If x is transient, then we have µx p(z ) ≤ µx (z ) with equality for all z = x.
Technical Note. To show that we are not cheating, we should prove that µx (y ) < ∞
for all y . First, observe that µx p = µx implies µx pn = µx for all n ≥ 1, and µx (x) = 1,
so if pn (y, x) > 0 then µx (y ) < ∞. Since the last result is true for all n, we see that
µx (y ) < ∞ whenever ρyx > 0, but this is good enough. By Theorem 5.4.3, when x
is recurrent ρxy > 0 implies ρyx > 0, and it follows from the argument above that
µx (y ) < ∞. If ρxy = 0 then µx (y ) = 0.
Exercise 5.5.3. Use the construction in the proof of Theorem 5.5.2 to show that
µ(j ) = k≥j fk+1 deﬁnes a stationary measure for the renewal chain (Example 5.2.3).
Theorem 5.5.2 allows us to construct a stationary measure for each closed set of
recurrent states. Conversely, we have:
Theorem 5.5.3. If p is irreducible and recurrent (i.e., all states are) then the stationary measure is unique up to constant multiples.
Proof. Let ν be a stationary measure and let a ∈ S.
ν (y )p(y, z ) = ν (a)p(a, z ) + ν (z ) =
y ν (y )p(y, z )
y =a Using the last identity to replace ν (y ) on the righthand side,
ν (z ) = ν (a)p(a, z ) + ν (a)p(a, y )p(y, z )
y =a + ν (x)p(x, y )p(y, z )
x=a y =a = ν (a)Pa (X1 = z ) + ν (a)Pa (X1 = a, X2 = z )
+ Pν (X0 = a, X1 = a, X2 = z )
Continuing in the obvious way, we get
n Pa (Xk = a, 1 ≤ k < m, Xm = z ) ν (z ) = ν (a)
m=1 + Pν (Xj = a, 0 ≤ j < n, Xn = z )
The last term is ≥ 0. Letting n → ∞ gives ν (z ) ≥ ν (a)µa (z ), where µa is the measure
deﬁned in Theorem 5.5.2 for x = a. It follows from Theorem 5.5.2 that µa is a
stationary distribution with µa (a) = 1. (Here we are summing from 1 to T rather
than from 0 to T − 1.) To turn the ≥ in the last equation into =, we observe
ν (x)pn (x, a) ≥ ν (a) ν (a) =
x µa (x)pn (x, a) = ν (a)µa (a) = ν (a)
x Since ν (x) ≥ ν (a)µa (x) and the left and righthand sides are equal we must have
ν (x) = ν (a)µa (x) whenever pn (x, a) > 0. Since p is irreducible, it follows that ν (x) =
ν (a)µa (x) for all x ∈ S , and the proof is complete.
Theorems 5.5.2 and 5.5.3 make a good team. The ﬁrst result gives us a formula for
a stationary distribution we call µx , and the second shows it is unique up to constant
multiples. Together they allow us to derive a lot of formulas. 5.5. STATIONARY MEASURES 249 Exercise 5.5.4. Let wxy = Px (Ty < Tx ). Show that µx (y ) = wxy /wyx .
Exercise 5.5.5. Show that if p is irreducible and recurrent then
µx (y )µy (z ) = µx (z )
Exercise 5.5.6. Use Theorems 5.5.2 and 5.5.3 to show that for simple random walk,
(i) the expected number of visits to k between successive visits to 0 is 1 for all k , and
(ii) if we start from k the expected number of visits to k before hitting 0 is 2k .
Exercise 5.5.7. Another proof of Theorem 5.5.3. Suppose p is irreducible and
recurrent and let µ be the stationary measure constructed in Theorem 5.5.2. µ(x) > 0
for all x, and
q (x, y ) = µ(y )p(y, x)/µ(x) ≥ 0
deﬁnes a “dual” transition probability. (See Exercise 5.5.1.) (i) Show that q is irreducible and recurrent. (ii) Suppose ν (y ) ≥ x ν (x)p(x, y ) (i.e, ν is an excessive
measure) and let h(x) = ν (x)/µ(x). Verify that h(y ) ≥
q (y, x)h(x) and use
Exercise 5.4.9 to conclude that h is constant, i.e., ν = cµ.
Remark. The last result is stronger than Theorem 5.5.3 since it shows that in
the recurrent case any excessive measure is a constant multiple of one stationary
measure. The remark after the proof of Theorem 5.5.3 shows that if p is irreducible
and transient, there is an excessive measure for each x ∈ S.
Having examined the existence and uniqueness of stationary measures, we turn
our attention now to stationary distributions, i.e., probability measures π with
πp = π . Stationary measures may exist for transient chains, e.g., random walks in
d ≥ 3, but
Theorem 5.5.4. If there is a stationary distribution then all states y that have π (y ) >
0 are recurrent.
Proof. Since πpn = π , Fubini’s theorem implies
∞ ∞ pn (x, y ) = π (x)
x n=1 π (y ) = ∞
n=1 when π (y ) > 0. Using Theorem 5.4.2 now gives
∞= π (x)
x ρxy
1
≤
1 − ρyy
1 − ρyy since ρxy ≤ 1 and π is a probability measure. So ρyy = 1.
Theorem 5.5.5. If p is irreducible and has stationary distribution π , then
π (x) = 1/Ex Tx
Remark. Recycling Chung’s quote regarding Theorem 4.5.8, we note that the proof
will make π (x) = 1/Ex Tx obvious, but it seems incredible that x 1
1
p(x, y ) =
Ex T x
Ey T y 250 CHAPTER 5. MARKOV CHAINS Proof. Irreducibility implies π (x) > 0 so all states are recurrent by Theorem 5.5.4.
From Theorem 5.5.2,
∞ µx (y ) = Px (Xn = y, Tx > n)
n=0 deﬁnes a stationary measure with µx (x) = 1, and Fubini’s theorem implies
∞ µx (y ) =
y Px (Tx > n) = Ex Tx
n=0 By Theorem 5.5.3, the stationary measure is unique up to constant multiples, so
π (x) = µx (x)/Ex Tx . Since µx (x) = 1 by deﬁnition, the desired result follows.
Exercise 5.5.8. Compute the expected number of moves it takes a knight to return
to its initial position if it starts in a corner of the chessboard, assuming there are no
other pieces on the board, and each time it chooses a move at random from its legal
moves. (Note: A chessboard is {0, 1, . . . , 7}2 . A knight’s move is Lshaped; two steps
in one direction followed by one step in a perpendicular direction.)
If a state x has Ex Tx < ∞, it is said to be positive recurrent. A recurrent
state with Ex Tx = ∞ is said to be null recurrent. Theorem 5.6.1 will explain these
names. The next result helps us identify positive recurrent states.
Theorem 5.5.6. If p is irreducible then the following are equivalent:
(i) Some x is positive recurrent.
(ii) There is a stationary distribution.
(iii) All states are positive recurrent.
Proof. (i) implies (ii). If x is positive recurrent then
∞ π (y ) = Px (Xn = y, Tx > n)/Ex Tx
n=0 deﬁnes a stationary distribution.
(ii) implies (iii). Theorem 5.5.5 implies π (y ) = 1/Ey Ty , and irreducibility tells us
π (y ) > 0 for all y , so Ey Ty < ∞.
(iii) implies (i). Trivial.
Exercise 5.5.9. Suppose p is irreducible and positive recurrent. Then Ex Ty < ∞
for all x, y.
Exercise 5.5.10. Suppose p is irreducible and has a stationary measure µ with
x µ(x) = ∞. Then p is not positive recurrent.
Theorem 5.5.6 shows that being positive recurrent is a class property. If it holds
for one state in an irreducible set, then it is true for all. Turning to our examples, since
µ(x) ≡ 1 is a stationary measure, Exercise 5.5.10 implies that random walks (Example
5.5.1) are never positive recurrent. Random walks on graphs (Example 5.5.5) are
irreducible if and only if the graph is connected. Since µ(i) ≥ 1 in the connected case,
we have positive recurrence if and only if the graph is ﬁnite. The Ehrenfest chain
(Example 5.5.3) is positive recurrent. To see this note that the state space is ﬁnite,
so there is a stationary distribution and the conclusion follows from Theorem 5.5.4. 5.5. STATIONARY MEASURES 251 A renewal chain is irreducible if {k : fk > 0} is unbounded (see Example 5.4.3) it is
positive recurrent (i.e., all the states are) if and only if E0 T0 = k fk < ∞.
Birth and death chains (Example 5.5.4) have a stationary distribution if and only
if
x
pk−1
<∞
qk
x
k=1 By Theorem 5.4.7, the chain is recurrent if and only if
∞ m qj
=∞
p
m=0 j =1 j
When pj = p and qj = (1 − p) for j ≥ 1, there is a stationary distribution if and
only if p < 1/2 and the chain is transient when p > 1/2. In Section 5.4, we probed
the boundary between recurrence and transience by looking at examples with pj =
1/2 + j , where j ∼ C j −α as j → ∞ and C, α ∈ (0, ∞). Since j ≥ 0 and hence
pj −1 /qj ≥ 1 for large j , none of these chains have stationary distributions. If we look
at chains with pj = 1/2 − j , then all we have done is interchange the roles of p and
q , and results from the last section imply that the chain is positive recurrent when
α < 1, or α = 1 and C > 1/4.
Example 5.5.6. M/G/1 queue. Let µ = k ak be the mean number of customers
that arrive during one service time. In Example 5.4.7, we showed that the chain is
recurrent if and only if µ ≤ 1. We will now show that the chain is positive recurrent
if and only if µ < 1. First, suppose that µ < 1. When Xn > 0, the chain behaves like
a random walk that has jumps with mean µ − 1, so if N = inf {n ≥ 0 : Xn = 0} then
XN ∧n − (µ − 1)(N ∧ n) is a martingale. If X0 = x > 0 then the martingale property
implies
x = Ex XN ∧n + (1 − µ)Ex (N ∧ n) ≥ (1 − µ)Ex (N ∧ n)
since XN ∧n ≥ 0, and it follows that Ex N ≤ x/(1 − µ).
To prove that there is equality, observe that Xn decreases by at most one each
time and for x ≥ 1, Ex Tx−1 = E1 T0 , so Ex N = cx. To identify the constant, observe
that
∞ E1 N = 1 + ak Ek N
k=0 so c = 1 + µc and c = 1/(1 − µ). If X0 = 0 then p(0, 0) = a0 + a1 and p(0, k − 1) = ak
for k ≥ 2. By considering what happens on the ﬁrst jump, we see that (the ﬁrst term
may look wrong, but recall k − 1 = 0 when k = 1)
∞ E0 T 0 = 1 + ak
k=1 µ − (1 − a0 )
a0
k−1
=1+
=
<∞
1−µ
1−µ
1−µ This shows that the chain is positive recurrent if µ < 1. To prove the converse,
observe that the arguments above show that if E0 T0 < ∞ then Ek N < ∞ for all k ,
Ek N = ck , and c = 1/(1 − µ), which is impossible if µ ≥ 1.
The last result when combined with Theorem 5.5.2 and 5.5.5 allows us to conclude
that the stationary distribution has π (0) = (1 − µ)/a0 . This may not seem like much,
but the equations in πp = π are:
π (0) = π (0)(a0 + a1 ) + π (1)a0
π (1) = π (0)a2 + π (1)a1 + π (2)a0
π (2) = π (0)a3 + π (1)a2 + π (2)a1 + π (3)a0 252 CHAPTER 5. MARKOV CHAINS or, in general, for j ≥ 1
j +1 π (j ) = π (i)aj +1−i
i=0 The equations have a “triangular” form, so knowing π (0), we can solve for π (1), π (2), . . .
The ﬁrst expression,
π (1) = π (0)(1 − (a0 + a1 ))/a0
is simple, but the formulas get progressively messier, and there is no nice closed form
solution.
Exercise 5.5.11. Let ξ1 , ξ2 , . . . be i.i.d. with P (ξm = k ) = ak+1 for k ≥ −1, let
Sn = x + ξ1 + · · · + ξn , where x ≥ 0, and let
− X n = Sn + min Sm m≤n (5.2.1) shows that Xn has the same distribution as the M/G/1 queue starting from
X0 = x. Use this representation to conclude that if µ = k ak < 1, then as n → ∞
1
{m ≤ n : Xm−1 = 0, ξm = −1} → (1 − µ)
n a.s. and hence π (0) = (1 − µ)/a0 as proved above.
Example 5.5.7. M/M/∞ queue. In this chain, introduced in Exercise 5.4.10,
Xn ξn,m + Yn+1 Xn+1 =
m=1 where ξn,m are i.i.d. Bernoulli with mean p and Yn+1 is an independent Poisson mean
λ. It follows from properties of the Poisson distribution that if Xn is Poisson with
mean µ, then Xn+1 is Poisson with mean µp + λ. Setting µ = µp + λ, we ﬁnd that a
Poisson distribution with mean µ = λ/(1 − p) is a stationary distribution.
There is a general result that handles Examples 5.5.6 and 5.5.7 and is useful in a
number of other situations. This will be developed in the next two exercises.
Exercise 5.5.12. Let Xn ≥ 0 be a Markov chain and suppose Ex X1 ≤ x − for
x > K , where > 0. Let Yn = Xn + n and τ = inf {n : Xn ≤ K }. Yn∧τ is a positive
supermartingale and the optional stopping theorem implies Ex τ ≤ x/ .
Exercise 5.5.13. Suppose that Xn has state space {0, 1, 2, . . .}, the conditions of the
last exercise hold when K = 0, and E0 X1 < ∞. Then 0 is positive recurrent. We
leave it to the reader to formulate and prove a similar result when K > 0.
To close the section, we will give a selfcontained proof of
Theorem 5.5.7. If p is irreducible and has a stationary distribution π then any other
stationary measure is a multiple of π .
Remark. This result is a consequence of Theorems 5.5.4 and Theorem 5.5.3, but we
ﬁnd the method of proof amusing. 5.5. STATIONARY MEASURES 253 Proof. Since p is irreducible, π (x) > 0 for all x. Let ϕ be a concave function that is
bounded on (0, ∞), e.g., ϕ(x) = x/(x + 1). Deﬁne the entropy of µ by
E (µ) = ϕ
y µ(y )
π (y ) π (y ) The reason for the name will become clear during the proof.
E (µp) = ϕ
y x ≥ ϕ
y x µ(x)p(x, y )
π (y )
µ(x)
π (x) π (y ) = ϕ
y x µ(x) π (x)p(x, y )
·
π (x)
π (y ) π (y ) π (x)p(x, y )
π (y )
π (y ) since ϕ is concave, and ν (x) = π (x)p(x, y )/π (y ) is a probability distribution. Since
the π (y )’s cancel and y p(x, y ) = 1, the last expression = E (µ), and we have shown
E (µp) ≥ E (µ), i.e., the entropy of an arbitrary initial measure µ is increased by an
application of p.
If p(x, y ) > 0 for all x and y , and µp = µ, it follows that µ(x)/π (x) must be
constant, for otherwise there would be strict inequality in the application of Jensen’s
inequality. To get from the last special case to the general result, observe that if p is
irreducible
∞ 2−n pn (x, y ) > 0 p(x, y ) =
¯
n=1 and µp = µ implies µp = µ.
¯ for all x, y 254 5.6 CHAPTER 5. MARKOV CHAINS Asymptotic Behavior The ﬁrst topic in this section is to investigate the asymptotic behavior of pn (x, y ). If y
is transient, n pn (x, y ) < ∞, so pn (x, y ) → 0 as n → ∞. To deal with the recurrent
states, we let
n Nn (y ) = 1{Xm =y}
m=1 be the number of visits to y by time n.
Theorem 5.6.1. Suppose y is recurrent. For any x ∈ S , as n → ∞
Nn (y )
1
1{Ty <∞}
→
n
Ey T y Px a.s. Here 1/∞ = 0.
Proof. Suppose ﬁrst that we start at y . Let R(k ) = min{n ≥ 1 : Nn (y ) = k } = the
time of the k th return to y . Let tk = R(k ) − R(k − 1), where R(0) = 0. Since we have
assumed X0 = y , t1 , t2 , . . . are i.i.d. and the strong law of large numbers implies
R(k )/k → Ey Ty Py a.s. Since R(Nn (y )) ≤ n < R(Nn (y ) + 1),
n
R(Nn (y ) + 1) Nn (y ) + 1
R(Nn (y ))
≤
<
·
Nn (y )
Nn (y )
Nn (y ) + 1
Nn (y )
Letting n → ∞, and recalling Nn (y ) → ∞ a.s. since y is recurrent, we have
n
→ Ey T y
Nn (y ) Py a.s. To generalize now to x = y , observe that if Ty = ∞ then Nn (y ) = 0 for all n and
hence
Nn (y )/n → 0 on {Ty = ∞}
The strong Markov property implies that conditional on {Ty < ∞}, t2 , t3 , . . . are
i.i.d. and have Px (tk = n) = Py (Ty = n), so
R(k )/k = t1 /k + (t2 + · · · + tk )/k → 0 + Ey Ty Px a.s. Repeating the proof for the case x = y shows
Nn (y )/n → 1/Ey Ty Px a.s. on {Ty < ∞} and combining this with the result for {Ty = ∞} completes the proof.
Remark. Theorem 5.6.1 should help explain the terms positive and null recurrent.
If we start from x, then in the ﬁrst case the asymptotic fraction of time spent at x is
positive and in the second case it is 0.
Since 0 ≤ Nn (y )/n ≤ 1, it follows from the bounded convergence theorem that
Ex Nn (y )/n → Ex (1{Ty <∞} /Ey Ty ), so
n 1
pm (x, y ) → ρxy /Ey Ty
n m=1 (5.6.1) 5.6. ASYMPTOTIC BEHAVIOR 255 The last result was proved for recurrent y but also holds for transient y , since in that
case, Ey Ty = ∞, and the limit is 0, since m pm (x, y ) < ∞.
(5.6.1) shows that the sequence pn (x, y ) always converges in the Cesaro sense. The
next example shows that pn (x, y ) need not converge.
Example 5.6.1.
p= 0
1 1
0 p2 = 1
0 0
1 p3 = p, p4 = p2 , . . . A similar problem also occurs in the Ehrenfest chain. In that case, if X0 is even, then
X1 is odd, X2 is even, . . . so pn (x, x) = 0 unless n is even. It is easy to construct
examples with pn (x, x) = 0 unless n is a multiple of 3 or 17 or . . .
Theorem 5.6.4 below will show that this “periodicity” is the only thing that can
prevent the convergence of the pn (x, y ). First, we need a deﬁnition and two preliminary results. Let x be a recurrent state, let Ix = {n ≥ 1 : pn (x, x) > 0}, and let dx
be the greatest common divisor of Ix . dx is called the period of x. The ﬁrst result
says that the period is a class property.
Lemma 5.6.2. If ρxy > 0 then dy = dx .
Proof. Let K and L be such that pK (x, y ) > 0 and pL (y, x) > 0. (x is recurrent, so
ρyx > 0.)
pK +L (y, y ) ≥ pL (y, x)pK (x, y ) > 0
so dy divides K + L, abbreviated dy (K + L). Let n be such that pn (x, x) > 0.
pK +n+L (y, y ) ≥ pL (y, x)pn (x, x)pK (x, y ) > 0
so dy (K + n + L), and hence dy n. Since n ∈ Ix is arbitrary, dy dx . Interchanging
the roles of y and x gives dx dy , and hence dx = dy .
If a chain is irreducible and dx = 1 it is said to be aperiodic. The easiest way to
check this is to ﬁnd a state with p(x, x) > 0. The M/G/1 queue has ak > 0 for all
k ≥ 0, so it has this property. The renewal chain is aperiodic if g.c.d.{k : fk > 0} = 1.
Lemma 5.6.3. If dx = 1 then pm (x, x) > 0 for m ≥ m0 .
Proof by example. Suppose 4, 7 ∈ Ix . pm+n (x, x) ≥ pm (x, x)pn (x, x) so Ix is closed
under addition, i.e., if m, n ∈ Ix then m + n ∈ Ix . A little calculation shows that in
the example
Ix ⊃ { 4, 7, 8, 11, 12, 14, 15, 16, 18, 19, 20, 21, . . . } so the result is true with m0 = 18. (Once Ix contains four consecutive integers, it will
contain all the rest.)
Proof. Our ﬁrst goal is to prove that Ix contains two consecutive integers. Let n0 ,
n0 + k ∈ Ix . If k = 1, we are done. If not, then since the greatest common divisor of
Ix is 1, there is an n1 ∈ Ix so that k is not a divisor of n1 . Write n1 = mk + r with
0 < r < k . Since Ix is closed under addition, (m + 1)(n0 + k ) > (m + 1)n0 + n1 are
both in Ix . Their diﬀerence is
k (m + 1) − n1 = k − r < k
Repeating the last argument (at most k times), we eventually arrive at a pair of
consecutive integers N, N + 1 ∈ Ix . It is now easy to show that the result holds
for m0 = N 2 . Let m ≥ N 2 and write m − N 2 = kN + r with 0 ≤ r < N . Then
m = r + N 2 + kN = r(1 + N ) + (N − r + k )N ∈ Ix . 256 CHAPTER 5. MARKOV CHAINS Theorem 5.6.4. Convergence theorem. Suppose p is irreducible, aperiodic (i.e.,
all states have dx = 1), and has stationary distribution π . Then, as n → ∞,
pn (x, y ) → π (y ).
Proof. Let S 2 = S × S . Deﬁne a transition probability p on S × S by
¯
p((x1 , y1 ), (x2 , y2 )) = p(x1 , x2 )p(y1 , y2 )
¯
i.e., each coordinate moves independently. Our ﬁrst step is to check that p is irre¯
ducible. This may seem like a silly thing to do ﬁrst, but this is the only step that
requires aperiodicity. Since p is irreducible, there are K, L, so that pK (x1 , x2 ) > 0
and pL (y1 , y2 ) > 0. From Lemma 5.6.3 it follows that if M is large pL+M (x2 , x2 ) > 0
and pK +M (y2 , y2 ) > 0, so
pK +L+M ((x1 , y1 ), (x2 , y2 )) > 0
¯
Our second step is to observe that since the two coordinates are independent,
π (a, b) = π (a)π (b) deﬁnes a stationary distribution for p, and Theorem 5.5.4 implies
¯
¯
that for p all states are recurrent. Let (Xn , Yn ) denote the chain on S × S , and let
¯
T be the ﬁrst time that this chain hits the diagonal {(y, y ) : y ∈ S }. Let T(x,x) be
the hitting time of (x, x). Since p is irreducible and recurrent, T(x,x) < ∞ a.s. and
¯
hence T < ∞ a.s. The ﬁnal step is to observe that on {T ≤ n}, the two coordinates
Xn and Yn have the same distribution. By considering the time and place of the ﬁrst
intersection and then using the Markov property,
n P (Xn = y, T ≤ n) = P (T = m, Xm = x, Xn = y )
m=1 x
n P (T = m, Xm = x)P (Xn = y Xm = x) =
m=1 x
n P (T = m, Ym = x)P (Yn = y Ym = x) =
m=1 x = P (Yn = y, T ≤ n)
To ﬁnish up, we observe that
P (Xn = y ) = P (Yn = y, T ≤ n) + P (Xn = y, T > n)
≤ P (Yn = y ) + P (Xn = y, T > n)
and similarly, P (Yn = y ) ≤ P (Xn = y ) + P (Yn = y, T > n). So
P (Xn = y ) − P (Yn = y ) ≤ P (Xn = y, T > n) + P (Yn = y, T > n)
and summing over y gives
P (Xn = y ) − P (Yn = y ) ≤ 2P (T > n)
y If we let X0 = x and let Y0 have the stationary distribution π , then Yn has distribution
π , and it follows that
pn (x, y ) − π (y ) ≤ 2P (T > n) → 0
y 5.6. ASYMPTOTIC BEHAVIOR 257 proving the desired result. If we recall the deﬁnition of the total variation distance
given in Section 2.6, the last conclusion can be written as
pn (x, ·) − π (·) ≤ P (T > n) → 0
At ﬁrst glance, it may seem strange to prove the convergence theorem by running
independent copies of the chain. An approach that is slightly more complicated but
explains better what is happening is to deﬁne p(x1 , x2 )p(y1 , y2 ) if x1 = y1 q ((x1 , y1 ), (x2 , y2 )) = p(x1 , x2 )
if x1 = y1 , x2 = y2 0
otherwise
In words, the two coordinates move independently until they hit and then move
together. It is easy to see from the deﬁnition that each coordinate is a copy of the
original process. If T is the hitting time of the diagonal for the new chain (Xn , Yn ),
then Xn = Yn on T ≤ n, so it is clear that
P (Xn = y ) − P (Yn = y ) ≤ 2 P (Xn = Yn ) = 2P (T > n)
y On the other hand, T and T have the same distribution so P (T > n) → 0, and the
conclusion follows as before. The technique used in the last proof is called coupling.
Generally, this term refers to building two sequences Xn and Yn on the same space
to conclude that Xn converges in distribution by showing P (Xn = Yn ) → 0, or more
generally, that for some metric ρ, ρ(Xn , Yn ) → 0 in probability.
Finite state space
The convergence theorem is much easier when the state space is ﬁnite.
Exercise 5.6.1. Show that if S is ﬁnite and p is irreducible and aperiodic, then there
is an m so that pm (x, y ) > 0 for all x, y .
Exercise 5.6.2. Show that if S is ﬁnite, p is irreducible and aperiodic, and T is the
coupling time deﬁned in the proof of (5.5) then P (T > n) ≤ Crn for some r < 1
and C < ∞. So the convergence to equilibrium occurs exponentially rapidly in this
case. Hint: First consider the case in which p(x, y ) > 0 for all x and y and reduce the
general case to this one by looking at a power of p.
Exercise 5.6.3. For any transition matrix p, deﬁne
αn = sup
i,j 1
2 pn (i, k ) − pn (j, k )
k The 1/2 is there because for any i and j we can deﬁne r.v.’s X and Y so that
P (X = k ) = pn (i, k ), P (Y = k ) = pn (j, k ), and
pn (i, k ) − pn (j, k ) P (X = Y ) = (1/2)
k Show that αm+n ≤ αn αm . Here you may ﬁnd the coupling interpretation may help
you from getting lost in the algebra. Using Lemma 1.9.1 in Chapter 1, we can conclude
that
1
1
log αn → inf
log αm
m≥1 m
n
so if αm < 1 for some m, it approaches 0 exponentially fast. 258 CHAPTER 5. MARKOV CHAINS As the last two exercises show, Markov chains on ﬁnite state spaces converge
exponentially fast to their stationary distributions. In applications, however, it is
important to have rates of convergence. The next two problems are a taste of an
exciting research area.
Example 5.6.2. Shuﬄing cards. The state of a deck of n cards can be represented
by a permutation, π (i) giving the location of the ith card. Consider the following
method of mixing the deck up. The top card is removed and inserted under one of
the n − 1 cards that remain. I claim that by following the bottom card of the deck
we can see that it takes about n log n moves to mix up the deck. This card stays
at the bottom until the ﬁrst time (T1 ) a card is inserted below it. It is easy to see
that when the k th card is inserted below the original bottom card (at time Tk ), all
k ! arrangements of the cards below are equally likely, so at time τn = Tn−1 + 1 all n!
arrangements are equally likely. If we let T0 = 0 and tk = Tk − Tk−1 for 1 ≤ k ≤ n − 1,
then these r.v.’s are independent, and tk has a geometric distribution with success
probability k/(n − 1). These waiting times are the same as the ones in the coupon
collector’s problem (Example 1.5.3), so τn /(n log n) → 1 in probability as n → ∞.
For more on card shuﬄing, see Aldous and Diaconis (1986).
Example 5.6.3. Random walk on the hypercube. Consider {0, 1}d as a graph
with edges connecting each pair of points that diﬀer in only one coordinate. Let
Xn be a random walk on {0, 1}d that stays put with probability 1/2 and jumps
to one of its d neighbors with probability 1/2d each. Let Yn be another copy of
the chain in which Y0 (and hence Yn , n ≥ 1) is uniformly distributed on {0, 1}d .
We construct a coupling of Xn and Yn by letting U1 , U2 , . . . be i.i.d. uniform on
{1, 2, . . . , d}, and letting V1 , V2 , . . . be independent i.i.d. uniform on {0, 1} At time n,
the Un th coordinates of X and Y are each set equal to Vn . The other coordinates are
unchanged. Let Td = inf {m : {U1 , . . . , Um } = {1, 2, . . . , d}}. When n ≥ Td , Xn = Yn .
Results for the coupon collectors problem (Example 1.5.3) show that Td /(d log d) → 1
in probability as d → ∞.
Exercises
5.6.4. Strong law for additive functionals. Suppose p is irreducible and has
k
stationary distribution π . Let f be a function that has
f (y )π (y ) < ∞. Let Tx be
the time of the k th return to x. (i) Show that
f
k
k
Vk = f (X (Tx )) + · · · + f (X (Tx +1 − 1)), k ≥ 1 are i.i.d. f
k
with E Vk  < ∞. (ii) Let Kn = inf {k : Tx ≥ n} and show that
K EV1f
1nf
=
Vm →
1
n m=1
Ex T x f (y )π (y ) Pµ − a.s. f  (iii) Show that max1≤m≤n Vm /n → 0 and conclude
n 1
f (Xm ) →
n m=1
for any initial distribution µ. f (y )π (y ) Pµ − a.s.
y 5.6. ASYMPTOTIC BEHAVIOR 259 5.6.5. Central limit theorem for additive functionals. Suppose in addition to
f 
the conditions in the Exercise 5.6.4 that
f (y )π (y ) = 0, and Ex (Vk )2 < ∞. (i)
Use the random index central limit theorem (Exercise 4.6 in Chapter 2) to conclude
that for any initial distribution µ
K n
1
√
V f ⇒ cχ under Pµ
n m=1 m f  √
(ii) Show that max1≤m≤n Vm / n → 0 in probability and conclude
n 1
√
f (Xm ) ⇒ cχ under Pµ
n m=1
5.6.6. Ratio Limit Theorems. Theorem 5.6.1 does not say much in the null
recurrent case. To get a more informative limit theorem, suppose that y is recurrent
and m is the (unique up to constant multiples) stationary measure on Cy = {z :
ρyz > 0}. Let Nn (z ) = {m ≤ n : Xn = z }. Break up the path at successive returns
to y and show that Nn (z )/Nn (y ) → m(z )/m(y ) Px a.s. for all x, z ∈ Cy . Note that
n → Nn (z ) is increasing, so this is much easier than the previous problem.
5.6.7. We got (5.6.1) from Theorem 5.6.1 by taking expected value. This does not
work for the ratio in the previous exercise, so we need another approach. Suppose
z = y . (i) Let pn (x, z ) = Px (Xn = z, Ty > n) and decompose pm (x, z ) according to
¯
the value of J = sup{j ∈ [1, m) : Xj = y } to get
n pj (x, y ) pm (x, z ) +
¯
m=1 m=1 n−j n−1 n pm (x, z ) = j =1 pk (y, z )
¯
k=1 (ii) Show that
n n pm (x, z )
m=1 pm (x, y ) →
m=1 m(z )
m(y ) 260 5.7 CHAPTER 5. MARKOV CHAINS Periodicity, Tail σ ﬁeld* Lemma 5.7.1. Suppose p is irreducible, recurrent, and all states have period d. Fix
x ∈ S , and for each y ∈ S , let Ky = {n ≥ 1 : pn (x, y ) > 0}. (i) There is an
ry ∈ {0, 1, . . . , d − 1} so that if n ∈ Ky then n = ry mod d, i.e., the diﬀerence n − ry
is a multiple of d. (ii) Let Sr = {y : ry = r} for 0 ≤ r < d. If y ∈ Si , z ∈ Sj , and
pn (y, z ) > 0 then n = (j − i) mod d. (iii) S0 , S1 , . . . , Sd−1 are irreducible classes for
pd , and all states have period 1.
Proof. (i) Let m(y ) be such that pm(y) (y, x) > 0. If n ∈ Ky then pn+m(y) (x, x) is
positive so d(n + m). Let ry = (d − m(y )) mod d. (ii) Let m, n be such that pn (y, z ),
pm (x, y ) > 0. Since pn+m (x, z ) > 0, it follows from (i) that n + m = j mod d.
Since m = i mod d, the result follows. The irreducibility in (iii) follows immediately
from (ii). The aperiodicity follows from the deﬁnition of the period as the g.c.d.
{x : pn (x, x) > 0}.
A partition of the state space S0 , S1 , . . . , Sd−1 satisfying (ii) in Lemma 5.7.1 is
called a cyclic decomposition of the state space. Except for the choice of the set
to put ﬁrst, it is unique. (Pick an x ∈ S . It lies in some Sj , but once the value of j
is known, irreducibility and (ii) allow us to calculate all the sets.)
Exercise 5.7.1. Find the decomposition
ability
123
1000
2 .3 0 0
3000
4001
5001
6010
7000 for the Markov chain with transition prob4
.5
0
0
0
0
0
.4 5
.5
0
0
0
0
0
0 6
0
0
0
0
0
0
.6 7
0
.7
1
0
0
0
0 Theorem 5.7.2. Convergence theorem, periodic case. Suppose p is irreducible,
has a stationary distribution π , and all states have period d. Let x ∈ S , and let
S0 , S1 , . . . , Sd−1 be the cyclic decomposition of the state space with x ∈ S0 . If y ∈ Sr
then
lim pmd+r (x, y ) = π (y )d
m→∞ Proof. If y ∈ S0 then using (iii) in Lemma 5.7.1 and applying Theorem 5.6.4 to pd
shows
lim pmd (x, y ) exists
m→∞ To identify the limit, we note that (5.6.1) implies
n 1
pm (x, y ) → π (y )
n m=1
and (ii) of Lemma 5.7.1 implies pm (x, y ) = 0 unless dm, so the limit in the ﬁrst
display must be π (y )d. If y ∈ Sr with 1 ≤ r < d then
pmd+r (x, y ) = pr (x, z )pmd (z, y )
z ∈Sr 5.7. PERIODICITY, TAIL σ FIELD* 261 Since y, z ∈ Sr it follows from the ﬁrst case in the proof that pmd (z, y ) → π (y )d as
m → ∞. pmd (z, y ) ≤ 1, and z pr (x, z ) = 1, so the result follows from the dominated
convergence theorem.
Let Fn = σ (Xn+1 , Xn+2 , . . .) and T = ∩n Fn be the tail σ ﬁeld. The next result
is due to Orey. The proof we give is from Blackwell and Freedman (1964).
Theorem 5.7.3. Suppose p is irreducible, recurrent, and all states have period d,
T = σ ({X0 ∈ Sr } : 0 ≤ r < d).
Remark. To be precise, if µ is any initial distribution and A ∈ T then there is an r
so that A = {X0 ∈ Sr } Pµ a.s.
Proof. We build up to the general result in three steps.
Case 1. Suppose P (X0 = x) = 1. Let T0 = 0, and for n ≥ 1, let Tn = inf {m > Tn−1 :
Xm = x} be the time of the nth return to x. Let
Vn = (X (Tn−1 ), . . . , X (Tn − 1))
The vectors Vn are i.i.d. by Exercise 5.4.1, and the tail σ ﬁeld is contained in the
exchangeable ﬁeld of the Vn , so the HewittSavage 01 law (Theorem 3.1.1, proved
there for r.v’s taking values in a general measurable space) implies that T is trivial
in this case.
Case 2. Suppose that the initial distribution is concentrated on one cyclic class, say
S0 . If A ∈ T then Px (A) ∈ {0, 1} for each x by case 1. If Px (A) = 0 for all x ∈ S0
then Pµ (A) = 0. Suppose Py (A) > 0, and hence = 1, for some y ∈ S0 . Let z ∈ S0 .
Since pd is irreducible and aperiodic on S0 , there is an n so that pn (z, y ) > 0 and
pn (y, y ) > 0. If we write 1A = 1B ◦ θn then the Markov property implies
1 = Py (A) = Ey (Ey (1B ◦ θn Fn )) = Ey (EXn 1B )
so Py (B ) = 1. Another application of the Markov property gives
Pz (A) = Ez (EXn 1B ) ≥ pn (z, y ) > 0
so Pz (A) = 1, and since z ∈ S0 is arbitrary, Pµ (A) = 1.
General Case. From case 2, we see that P (AX0 = y ) ≡ 1 or ≡ 0 on each cyclic class.
This implies that either {X0 ∈ Sr } ⊂ A or {X0 ∈ Sr } ∩ A = ∅ Pµ a.s. Conversely, it
is clear that {X0 ∈ Sr } = {Xnd ∈ Sr i.o.} ∈ T , and the proof is complete.
The next result will help us identify the tail σ ﬁeld in transient examples.
Theorem 5.7.4. Suppose X0 has initial distribution µ. The equations
h(Xn , n) = Eµ (Z Fn ) and Z = lim h(Xn , n)
n→∞ set up a 11 correspondence between bounded Z ∈ T and bounded spacetime harmonic functions, i.e., bounded h : S × {0, 1, . . .} → R, so that h(Xn , n) is a martingale.
Proof. Let Z ∈ T , write Z = Yn ◦ θn , and let h(x, n) = Ex Yn .
Eµ (Z Fn ) = Eµ (Yn ◦ θn Fn ) = h(Xn , n)
by the Markov property, so h(Xn , n) is a martingale. Conversely, if h(Xn , n) is a
bounded martingale, using Theorems 4.2.8 and 4.5.6 shows h(Xn , n) → Z ∈ T as
n → ∞, and h(Xn , n) = Eµ (Z Fn ). 262 CHAPTER 5. MARKOV CHAINS Exercise 5.7.2. A random variable Z with Z = Z ◦ θ, and hence = Z ◦ θn for all n,
is called invariant. Show there is a 11 correspondence between bounded invariant
random variables and bounded harmonic functions. We will have more to say about
invariant r.v.’s in Section 6.1.
Example 5.7.1. Simple random walk in d dimensions. We begin by constructing a coupling for this process. Let i1 , i2 , . . . be i.i.d. uniform on {1, . . . , d}. Let
ξ1 , ξ2 , . . . and η1 , η2 , . . . be i.i.d. uniform on {−1, 1}. Let ej be the j th unit vector.
Construct a coupled pair of ddimensional simple random walks by
Xn = Xn−1 + e(in )ξn
Yn = Yn−1 + e(in )ξn
Yn−1 + e(in )ηn i
in
if Xnn 1 = Yn−1
−
i
in
if Xnn 1 = Yn−1
− In words, the coordinate that changes is always the same in the two walks, and once
they agree in one coordinate, future movements in that direction are the same. It is
i
easy to see that if X0 − Y0i is even for 1 ≤ i ≤ d, then the two random walks will hit
with probability one.
Let L0 = {z ∈ Zd : z 1 + · · · + z d is even } and L1 = Zd − L0 . Although we
have only deﬁned the notion for the recurrent case, it should be clear that L0 , L1 is
the cyclic decomposition of the state space for simple random walk. If Sn ∈ Li then
Sn+1 ∈ L1−i and p2 is irreducible on each Li . To couple two random walks starting
from x, y ∈ Li , let them run independently until the ﬁrst time all the coordinate
diﬀerences are even, and then use the last coupling. In the remaining case, x ∈ L0 ,
y ∈ L1 coupling is impossible.
The next result should explain our interest in coupling two ddimensional simple
random walks.
Theorem 5.7.5. For ddimensional simple random walk,
T = σ ({X0 ∈ Li }, i = 0, 1)
Proof. Let x, y ∈ Li , and let Xn , Yn be a realization of the coupling deﬁned above
for X0 = x and Y0 = y . Let h(x, n) be a bounded spacetime harmonic function.
The martingale property implies h(x, 0) = Ex h(Xn , n). If h ≤ C , it follows from the
coupling that
h(x, 0) − h(y, 0) = Eh(Xn , n) − Eh(Yn , n) ≤ 2CP (Xn = Yn ) → 0
so h(x, 0) is constant on L0 and L1 . Applying the last result to h (x, m) = h(x, n + m),
we see that h(x, n) = ai on Li . The martingale property implies ai = a1−i , and the
n
n
n+1
desired result follows from Theorem 5.7.4.
Example 5.7.2. Ornstein’s coupling. Let p(x, y ) = f (y − x) be the transition
probability for an irreducible aperiodic random walk on Z. To prove that the tail
σ ﬁeld is trivial, pick M large enough so that the random walk generated by the
probability distribution fM (x) with fM (x) = cM f (x) for x ≤ M and fM (x) = 0 for
x > M is irreducible and aperiodic. Let Z1 , Z2 , . . . be i.i.d. with distribution f and
let W1 , W2 , . . . be i.i.d. with distribution fM . Let Xn = Xn−1 + Zn for n ≥ 1. If
Xn−1 = Yn−1 , we set Xn = Yn . Otherwise, we let
Yn = Yn−1 + Zn
Yn−1 + Wn if Zn  > m
if Zn  ≤ m 5.7. PERIODICITY, TAIL σ FIELD* 263 In words, the big jumps are taken in parallel and the small jumps are independent. The
recurrence of onedimensional random walks with mean 0 implies P (Xn = Yn ) → 0.
Repeating the proof of Theorem 5.7.5, we see that T is trivial.
The tail σ ﬁeld in Theorem 5.7.5 is essentially the same as in Theorem 5.7.3. To
get a more interesting T , we look at:
Example 5.7.3. Random walk on a tree. To facilitate deﬁnitions, we will consider
the system as a random walk on a group with 3 generators a, b, c that have a2 = b2 =
c2 = e, the identity element. To form the random walk, let ξ1 , ξ2 , . . . be i.i.d. with
P (ξn = x) = 1/3 for x = a, b, c, and let Xn = Xn−1 ξn . (This is equivalent to a random
walk on the tree in which each vertex has degree 3 but the algebraic formulation is
convenient for computations.) Let Ln be the length of the word Xn when it has been
reduced as much as possible, with Ln = 0 if Xn = e. The reduction can be done
as we go along. If the last letter of Xn−1 is the same as ξn , we erase it, otherwise
we add the new letter. It is easy to see that Ln is a Markov chain with a transition
probability that has p(0, 1) = 1 and
p(j, j − 1) = 1/3 p(j, j + 1) = 2/3 for j ≥ 1 As n → ∞, Ln → ∞. From this, it follows easily that the word Xn has a limit in the
i
sense that the ith letter Xn stays the same for large n. Let X∞ be the limiting word,
i
i
i
i.e., X∞ = lim Xn . T ⊃ σ (X∞ , i ≥ 1), but it is easy to see that this is not all. If
c
S0 = the words of even length, and S1 = S0 , then Xn ∈ Si implies Xn+1 ∈ S1−i , so
{X0 ∈ S0 } ∈ T . Can the reader prove that we have now found all of T ? As Fermat
once said, “I have a proof but it won’t ﬁt in the margin.”
Remark. This time the solution does not involve elliptic curves but uses “hpaths.”
See Furstenburg (1970) or decode the following: “Condition on the exit point (the
inﬁnite word). Then the resulting RW is an hprocess, which moves closer to the
boundary with probability 2/3 and farther with probability 1/3 (1/6 each to the two
possibilities). Two such random walks couple, provided they have same parity.” The
quote is from Robin Pemantle, who says he consulted Itai Benajamini and Yuval
Peres. 264 CHAPTER 5. MARKOV CHAINS 5.8 General State Space* In this section, we will generalize the results from Sections 5.4–5.6 to a collection of
Markov chains with uncountable state space called Harris chains. The developments
here are motivated by three ideas. First, the proofs for countable state space if there
is one point in the state space that the chain hits with probability one. (Think,
for example, about the construction of the stationary measure via the cycle trick.)
Second, a recurrent Harris chain can be modiﬁed to contain such a point. Third,
the collection of Harris chains is a comfortable level of generality; broad enough to
contain a large number of interesting examples, yet restrictive enough to allow for a
rich theory.
We say that a Markov chain Xn is a Harris chain if we can ﬁnd sets A, B ∈ S ,
a function q with q (x, y ) ≥ > 0 for x ∈ A, y ∈ B , and a probability measure ρ
concentrated on B so that:
(i) If τA = inf {n ≥ 0 : Xn ∈ A}, then Pz (τA < ∞) > 0 for all z ∈ S.
(ii) If x ∈ A and C ⊂ B then p(x, C ) ≥ C q (x, y ) ρ(dy ). To explain the deﬁnition we turn to some examples:
Example 5.8.1. Countable state space. If S is countable and there is a point a
with ρxa > 0 for all x (a condition slightly weaker than irreducibility) then we can
take A = {a}, B = {b}, where b is any state with p(a, b) > 0, µ = δb the point mass
at b, and q (a, b) = p(a, b).
Conversely, if S is countable and (A , B ) is a pair for which (i) and (ii) hold, then
we can without loss of generality reduce B to a single point b. Having done this, if
we set A = {b}, pick c so that p(b, c) > 0, and set B = {c}, then (i) and (ii) hold with
A and B both singletons.
Example 5.8.2. Chains with continuous densities. Suppose Xn ∈ Rd is a
Markov chain with a transition probability that has p(x, dy ) = p(x, y ) dy where
(x, y ) p(x, y ) is continuous. Pick (x0 , y0 ) so that p(x0 , y0 ) > 0. Let A and B
be open sets around x0 and y0 that are small enough so that p(x, y ) ≥ > 0 on
A × B . If we let ρ(C ) = B ∩ C /B , where B  is the Lebesgue measure of B , then
(ii) holds. If (i) holds, then Xn is a Harris chain.
For concrete examples, consider:
(a) Diﬀusion processes are a large class of examples that lie outside the scope
of this book, but are too important to ignore. When things are nice, speciﬁcally,
if the generator of X has H¨lder continuous coeﬃcients satisfying suitable growth
o
conditions, see the Appendix of Dynkin (1965), then P (X1 ∈ dy ) = p(x, y ) dy , and p
satisﬁes the conditions above.
(b) ARMAP’s. Let ξ1 , ξ2 , . . . be i.i.d. and Vn = θVn−1 + ξn . Vn is called an
autoregressive moving average process or armap for short. We call Vn a
smooth armap if the distribution of ξn has a continuous density g . In this case
p(x, dy ) = g (y − θx) dy with (x, y ) → g (y − θx) continuous.
In the analyzing the behavior of armap’s there are a number of cases to consider
depending on the nature of the support of ξn . We call Vn a simple armap if the
density function for ξn is positive for at all points in R. In this case we can take
A = B = [−1/2, 1/2] with ρ = the restriction of Lebesgue measure.
(c) The discrete OrnsteinUhlenbeck process is a special case of (a) and (b). Let
ξ1 , ξ2 , . . . be i.i.d. standard normals and let Vn = θVn−1 + ξn . The OrnsteinUhlenbeck 5.8. GENERAL STATE SPACE* 265 process is a diﬀusion process {Vt , t ∈ [0, ∞)} that models the velocity of a particle
suspended in a liquid. See, e.g., Breiman (1968) Section 16.1. Looking at Vt at integer
times (and dividing by a constant to make the variance 1) gives a Markov chain with
the indicated distributions.
Example 5.8.3. GI/G/1 queue, or storage model. Let ξ1 , ξ2 , . . . be i.i.d. and
deﬁne Wn inductively by Wn = (Wn−1 + ξn )+ . If P (ξn < 0) > 0 then we can take
A = B = {0} and (i) and (ii) hold. To explain the ﬁrst name in the title, consider a
queueing system in which customers arrive at times of a renewal process, i.e., at times
0 = T0 < T1 < T2 . . . with ζn = Tn − Tn−1 , n ≥ 1 i.i.d. Let ηn , n ≥ 0, be the amount
of service time the nth customer requires and let ξn = ηn−1 − ζn . I claim that Wn is
the amount of time the nth customer has to wait to enter service. To see this, notice
that the (n − 1)th customer adds ηn−1 to the server’s workload, and if the server is
busy at all times in [Tn−1 , Tn ), he reduces his workload by ζn . If Wn−1 + ηn−1 < ζn
then the server has enough time to ﬁnish his work and the next arriving customer will
ﬁnd an empty queue.
The second name in the title refers to the fact that Wn can be used to model the
contents of a storage facility. For an intuitive description, consider water reservoirs.
We assume that rain storms occur at times of a renewal process {Tn : n ≥ 1}, that
the nth rainstorm contributes an amount of water ηn , and that water is consumed at
constant rate c. If we let ζn = Tn − Tn−1 as before, and ξn = ηn−1 − cζn , then Wn
gives the amount of water in the reservoir just before the nth rainstorm.
History Lesson. Doeblin was the ﬁrst to prove results for Markov chains on general
state space. He supposed that there was an n so that pn (x, C ) ≥ ρ(C ) for all x ∈ S
and C ⊂ S . See Doob (1953), Section V.5, for an account of his results. Harris (1956)
generalized Doeblin’s result by observing that it was enough to have a set A so that (i)
k
k
k
holds and the chain viewed on A (Yk = X (TA ), where TA = inf {n > TA−1 : Xn ∈ A}
0
and TA = 0) satisﬁes Doeblin’s condition. Our formulation, as well as most of the
proofs in this section, follows Athreya and Ney (1978). For a nice description of the
“traditional approach,” see Revuz (1984).
¯
Given a Harris chain on (S, S ), we will construct a Markov chain Xn with transition
¯¯
¯
¯
probability p on (S, S ), where S = S ∪ {α} and S = {B , B ∪ {α} : B ∈ S}. The
¯
aim, as advertised earlier, is to manufacture a point α that the process hits with
probability 1 in the recurrent case.
If x ∈ S − A p(x, C ) = p(x, C ) for C ∈ S
¯ If x ∈ A p(x, {α}) =
¯
p(x, C ) = p(x, C ) − ρ(C ) for C ∈ S
¯
¯
p(α, D) = ρ(dx)¯(x, D) for D ∈ S
¯
p If x = α ¯
Intuitively, Xn = α corresponds to Xn being distributed on B according to ρ. Here,
and in what follows, we will reserve A and B for the special sets that occur in the
deﬁnition and use C and D for generic elements of S . We will often simplify notation
by writing p(x, α) instead of p(x, {α}), µ(α) instead of µ({α}), etc.
¯
¯
Our next step is to prove three technical lemmas that will help us develop the
theory below. Deﬁne a transition probability v by
v (x, {x}) = 1 if x ∈ S v (α, C ) = ρ(C ) In words, V leaves mass in S alone but returns the mass at α to S and distributes it
according to ρ. 266 CHAPTER 5. MARKOV CHAINS Lemma 5.8.1. v p = p and pv = p.
¯¯
¯
Proof. Before giving the proof, we would like to remind the reader that measures
multiply the transition probability on the left, i.e., in the ﬁrst case we want to show
µv p = µp. If we ﬁrst make a transition according to v and then one according to p,
¯
¯
¯
this amounts to one transition according to p, since only mass at α is aﬀected by v
¯
and
p(α, D) =
¯ ρ(dx)¯(x, D)
p The second equality also follows easily from the deﬁnition. In words, if p acts ﬁrst
¯
and then v , then v returns the mass at α to where it came from.
From Lemma 5.8.1, it follows easily that we have:
Lemma 5.8.2. Let Yn be an inhomogeneous Markov chain with p2k = v and p2k+1 =
¯
p. Then Xn = Y2n is a Markov chain with transition probability p and Xn = Y2n+1
¯
¯
is a Markov chain with transition probability p.
Lemma 5.8.2 shows that there is an intimate relationship between the asymptotic
¯
behavior of Xn and of Xn . To quantify this, we need a deﬁnition. If f is a bounded
¯
¯
¯
measurable function on S , let f = vf , i.e., f (x) = f (x) for x ∈ S and f (α) = f dρ.
Lemma 5.8.3. If µ is a probability measure on (S, S ) then
¯¯
Eµ f (Xn ) = Eµ f (Xn )
¯
¯
Proof. Observe that if Xn and Xn are constructed as in Lemma 5.8.2, and P (X0 ∈
¯ 0 and Xn is obtained from Xn by making a transition according
¯
S ) = 1 then X0 = X
to v.
¯
The last three lemmas will allow us to obtain results for Xn from those for Xn . We
¯ n . To facilitate
turn now to the task of generalizing the results of Sections 5.4–5.6 to X
comparison with the results for countable state space, we will break this section into
four subsections, the ﬁrst three of which correspond to Sections 5.4–5.6. In the fourth
subsection, we take an in depth look at the GI/G/1 queue. Before developing the
theory, we will give one last example that explains why some of the statements are
messy.
Example 5.8.4. Perverted O.U. process. Take the discrete O.U. process from
part (c) of Example 5.8.2 and modify the transition probability at the integers x ≥ 2
so that
p(x, {x + 1}) = 1 − x−2
p(x, A) = x−2 A for A ⊂ (0, 1)
p is the transition probability of a Harris chain, but
P2 (Xn = n + 2 for all n) > 0
I can sympathize with the reader who thinks that such chains will not arise “in
applications,” but it seems easier (and better) to adapt the theory to include them
than to modify the assumptions to exclude them. 5.8. GENERAL STATE SPACE* 5.8.1 267 Recurrence and Transience We begin with the dichotomy between recurrence and transience. Let R = inf {n ≥
¯
1 : Xn = α}. If Pα (R < ∞) = 1 then we call the chain recurrent, otherwise we
¯
call it transient. Let R1 = R and for k ≥ 2, let Rk = inf {n > Rk−1 : Xn = α} be
the time of the k th return to α. The strong Markov property implies Pα (Rk < ∞) =
¯
Pα (R < ∞)k , so Pα (Xn = α i.o.) = 1 in the recurrent case and = 0 in the transient
case. It is easy to generalize Theorem 5.4.2 to the current setting.
¯
Exercise 5.8.1. Xn is recurrent if and only if ∞
n=1 pn (α, α) = ∞.
¯ The next result generalizes Lemma 5.4.3.
∞ Theorem 5.8.4. Let λ(C ) = n=1 2−n pn (α, C ). In the recurrent case, if λ(C ) > 0
¯
¯
then Pα (Xn ∈ C i.o.) = 1. For λa.e. x, Px (R < ∞) = 1.
Proof. The ﬁrst conclusion follows from Lemma 5.3.3. For the second let D = {x :
Px (R < ∞) < 1} and observe that if pn (α, D) > 0 for some n, then
¯
Pα (Xm = α i.o.) ≤ pn (α, dx)Px (R < ∞) < 1
¯ Remark. Example 5.8.4 shows that we cannot expect to have Px (R < ∞) = 1 for
all x. To see that even when the state space is countable, we need not hit every point
starting from α do
Exercise 5.8.2. If Xn is a recurrent Harris chain on a countable state space, then S
can only have one irreducible set of recurrent states but may have a nonempty set of
transient states. For a concrete example, consider a branching process in which the
probability of no children p0 > 0 and set A = B = {0}.
Exercise 5.8.3. Suppose Xn is a recurrent Harris chain. Show that if (A , B ) is
another pair satisfying the conditions of the deﬁnition, then Theorem 5.8.4 implies
¯
Pα (Xn ∈ A i.o.) = 1, so the recurrence or transience does not depend on the choice
of (A, B ).
As in Section 5.4, we need special methods to determine whether an example is
recurrent or transient.
Exercise 5.8.4. In the GI/G/1 queue, the waiting time Wn and the random walk
Sn = X0 + ξ1 + · · · + ξn agree until N = inf {n : Sn < 0}, and at this time WN = 0. Use
this observation as we did in Example 5.4.7 to show that Example 5.8.3 is recurrent
when Eξn ≤ 0 and transient when Eξn > 0.
Exercise 5.8.5. Let Vn be a simple smooth armap with E ξi  < ∞. Show that if
θ < 1 then Ex V1  ≤ x for x ≥ M . Use this and ideas from the proof of Theorem
5.4.8 to show that the chain is recurrent in this case.
Exercise 5.8.6. Let Vn be an armap (not necessarily smooth or simple) and suppose
θ > 1. Let γ ∈ (1, θ) and observe that if x > 0 then Px (V1 < γx) ≤ C/((θ − γ )x), so
if x is large, Px (Vn ≥ γ n x for all n) > 0.
Remark. In the case θ = 1 the chain Vn discussed in the last two exercises is a
random walk with mean 0 and hence recurrent.
Exercise 5.8.7. In the discrete O.U. process, Xn+1 is normal with mean θXn and
variance 1. What happens to the recurrence and transience if instead Yn+1 is normal
with mean 0 and variance β 2 Yn ? 268 CHAPTER 5. MARKOV CHAINS 5.8.2 Stationary Measures Theorem 5.8.5. In the recurrent case, there is a stationary measure.
¯
Proof. Let R = inf {n ≥ 1 : Xn = α}, and let
R−1 µ(C ) = Eα
¯ ∞ 1{Xn ∈C }
¯
n=0 ¯
Pα (Xn ∈ C, R > n) =
n=0 Repeating the proof of Theorem 5.5.2 shows that µp = µ. If we let µ = µv then it
¯¯ ¯
¯
follows from Lemma 5.8.1 that µv p = µpv = µv , so µ p = µ.
¯
¯¯
¯
Exercise 5.8.8. Let Gk,δ = {x : pk (x, α) ≥ δ }. Show that µ(Gk,δ ) ≤ 2k/δ and use
¯
¯
this to conclude that µ and hence µ is σ ﬁnite.
¯
Exercise 5.8.9. Let λ be the measure deﬁned in Theorem 5.8.5. Show that µ < λ
¯<
and λ < µ.
<¯
Exercise 5.8.10. Let Vn be an armap (not necessarily smooth or simple) with θ < 1
and E log+ ξn  < ∞. Show that m≥0 θm ξm converges a.s. and deﬁnes a stationary
distribution for Vn .
Exercise 5.8.11. In the GI/G/1 queue, the waiting time Wn and the random walk
Sn = X0 + ξ1 + · · · + ξn agree until N = inf {n : Sn < 0}, and at this time WN = 0.
Use this observation as we did in Example 5.5.6 to show that if Eξn < 0, EN < ∞
and hence there is a stationary distribution.
To investigate uniqueness of the stationary measure, we begin with:
Lemma 5.8.6. If ν is a σ ﬁnite stationary measure for p, then ν (A) < ∞ and ν = ν p
¯
¯
is a stationary measure for p with ν (α) < ∞.
¯
¯
Proof. We will ﬁrst show that ν (A) < ∞. If ν (A) = ∞ then part (ii) of the deﬁnition
implies ν (C ) = ∞ for all sets C with ρ(C ) > 0. If B = ∪i Bi with ν (Bi ) < ∞
then ρ(Bi ) = 0 by the last observation and ρ(B ) = 0 by countable subadditivity, a
contradiction. So ν (A) < ∞ and ν (α) = ν p(α) = ν (A) < ∞. Using the fact that
¯
¯
ν p = ν , we ﬁnd
ν p(C ) = ν (C ) − ν (A)ρ(B ∩ C )
¯
the last subtraction being welldeﬁned since ν (A) < ∞, and it follows that ν v = ν .
¯
To check ν p = ν , we observe that Lemma 5.8.1 and the last result imply ν p = ν v p =
¯¯ ¯
¯¯ ¯ ¯
νp = ν.
¯¯
Theorem 5.8.7. Suppose p is recurrent. If ν is a σ ﬁnite stationary measure then
ν = ν (α)µ, where µ is the measure constructed in the proof of Theorem 5.8.5.
¯
Proof. By Lemma 5.8.6, it suﬃces to prove that if ν is a stationary measure for p
¯
¯
with ν (α) < ∞ then ν = ν (α)¯. Repeating the proof of Theorem 5.5.3 with a = α, it
¯
¯¯µ
is easy to show that ν (C ) ≥ ν (α)¯(C ). Continuing to compute as in that proof:
¯
¯µ
ν (α) =
¯ ν (dx)¯n (x, α) ≥ ν (α)
¯
p
¯ µ(dx)¯n (x, α) = ν (α)¯(α) = ν (α)
¯
p
¯µ
¯ Let Sn = {x : pn (x, α) > 0}. By assumption, ∪n Sn = S . If ν (D) > ν (α)¯(D)
¯
¯µ
for some D, then ν (D ∩ Sn ) > ν (α)¯(D ∩ Sn ), and it follows that ν (α) > ν (α) a
¯
¯µ
¯
¯
contradiction. 5.8. GENERAL STATE SPACE* 5.8.3 269 Convergence Theorem We say that a recurrent Harris chain Xn is aperiodic if g.c.d. {n ≥ 1 : pn (α, α) >
0} = 1. This occurs, for example, if we can take A = B in the deﬁnition for then
p(α, α) > 0.
Theorem 5.8.8. Let Xn be an aperiodic recurrent Harris chain with stationary distribution π . If Px (R < ∞) = 1 then as n → ∞,
pn (x, ·) − π (·) → 0
Note. Here
denotes the total variation distance between the measures. Lemma
5.8.4 guarantees that π a.e. x satisﬁes the hypothesis.
Proof. In view of Lemma 5.8.3, it suﬃces to prove the result for p. We begin by
¯
observing that the existence of a stationary probability measure and the uniqueness
result in Theorem 5.8.7 imply that the measure constructed in Theorem 5.8.5 has
Eα R = µ(S ) < ∞. As in the proof of Theorem 5.6.4, we let Xn and Yn be independent
¯
copies of the chain with initial distributions δx and π , respectively, and let τ = inf {n ≥
0 : Xn = Yn = α}. For m ≥ 0, let Sm (resp. Tm ) be the times at which Xn (resp. Yn )
visit α for the (m + 1)th time. Sm − Tm is a random walk with mean 0 steps, so
M = inf {m ≥ 1 : Sm = Tm } < ∞ a.s., and it follows that this is true for τ as well.
The computations in the proof of Theorem 5.6.4 show P (Xn ∈ C ) − P (Yn ∈ C ) ≤
P (τ > n). Since this is true for all C , pn (x, ·) − π (·) ≤ P (τ > n), and the proof is
complete.
Exercise 5.8.12. Use Exercise 5.8.1 and imitate the proof of Theorem 5.5.4 to show
that a Harris chain with a stationary distribution must be recurrent.
Exercise 5.8.13. Show that an armap with θ < 1 and E log+ ξn  < ∞ converges in
distribution as n → ∞. Hint: Recall the construction of π in Exercise 5.8.10. 5.8.4 GI/G/1 queue For the rest of the section, we will concentrate on the GI/G/1 queue. Let ξ1 , ξ2 , . . .
be i.i.d., let Wn = (Wn−1 + ξn )+ , and let Sn = ξ1 + · · · + ξn . Recall ξn = ηn−1 − ζn ,
where the η ’s are service times, ζ ’s are the interarrival times, and suppose Eξn < 0
so that Exercise 6.11 implies there is a stationary distribution.
Exercise 5.8.14. Let mn = min(S0 , S1 , . . . , Sn ), where Sn is the random walk deﬁned
above. (i) Show that Sn − mn =d Wn . (ii) Let ξm = ξn+1−m for 1 ≤ m ≤ n.
Show that Sn − mn = max(S0 , S1 , . . . , Sn ). (iii) Conclude that as n → ∞ we have
Wn ⇒ M ≡ max(S0 , S1 , S2 , . . .).
Explicit formulas for the distribution of M are in general diﬃcult to obtain. However, this can be done if either the arrival or service distribution is exponential. One
reason for this is:
Exercise 5.8.15. Suppose X , Y ≥ 0 are independent and P (X > x) = e−λx . Show
that P (X − Y > x) = ae−λx , where a = P (X − Y > 0).
Example 5.8.5. Exponential service time. Suppose P (ηn > x) = e−βx and
Eζn > Eηn . Let T = inf {n : Sn > 0} and L = ST , setting L = −∞ if T = ∞. The
lack of memory property of the exponential distribution implies that P (L > x) =
re−βx , where r = P (T < ∞). To compute the distribution of the maximum, M , let 270 CHAPTER 5. MARKOV CHAINS T1 = T and let Tk = inf {n > Tk−1 : Sn > STk−1 } for k ≥ 2. (1.3) in Chapter 3 implies
that if Tk < ∞ then S (Tk+1 ) − S (Tk ) =d L and is independent of S (Tk ). Using this
and breaking things down according to the value of K = inf {k : Lk+1 = −∞}, we see
that for x > 0 the density function
∞ rk (1 − r)e−βx β k xk−1 /(k − 1)! = βr(1 − r)e−βx(1−r) P (M = x) =
k=1 To complete the calculation, we need to calculate r. To do this, let
φ(θ) = E exp(θξn ) = E exp(θηn−1 )E exp(−θζn )
which is ﬁnite for 0 < θ < β since ζn ≥ 0 and ηn−1 has an exponential distribution.
It is easy to see that
φ (0) = Eξn < 0
lim φ(θ) = ∞
θ ↑β so there is a θ ∈ (0, β ) with φ(θ) = 1. Exercise 4.7.4 implies exp(θSn ) is a martingale.
Theorem 4.4.1 implies 1 = E exp(θST ∧n ). Letting n → ∞ and noting that (Sn T = n)
has an exponential distribution and Sn → −∞ on {T = ∞}, we have
∞ eθx βe−βx dx = 1=r
0 rβ
β−θ Example 5.8.6. Poisson arrivals. Suppose P (ζn > x) = e−αx and Eζn > Eηn .
¯
Let Sn = −Sn . Reversing time as in (ii) of Exercise 5.8.14, we see (for n ≥ 1)
¯
¯
max Sk < Sn ∈ A P 0≤k<n =P ¯
¯
min Sk > 0, Sn ∈ A 1≤k≤n Let ψn (A) be the common value of the last two expression and let ψ (A) = n≥0 ψn (A).
ψn (A) is the probability the random walk reaches a new maximum (or ladder height,
see Example 3.1.4 in A at time n, so ψ (A) is the number of ladder points in A with
ψ ({0}) = 1. Letting the random walk take one more step
P ¯
¯
min Sk > 0, Sn+1 ≤ x 1≤k≤n = F (x − z ) dψn (z ) The last identity is valid for n = 0 if we interpret the lefthand side as F (x). Let
¯
τ = inf {n ≥ 1 : Sn ≤ 0} and x ≤ 0. Integrating by parts on the righthand side and
then summing over n ≥ 0 gives
∞ ¯
P (Sτ ≤ x) = P
n=0 ¯
¯
min Sk > 0, Sn+1 ≤ x 1≤k≤n ψ [0, x − y ] dF (y ) = (5.8.1) y ≤x The limit y ≤ x comes from the fact that ψ ((−∞, 0)) = 0.
¯
¯
¯
¯
Let ξn = Sn − Sn−1 = −ξn . Exercise 5.8.15 implies P (ξn > x) = ae−αx . Let
¯
¯ = inf {n : Sn > 0}. E ξn > 0 so P (T < ∞) = 1. Let J = ST . As in the previous
¯
¯
¯
T
−αx
example, P (J > x) = e
. Let Vn = J1 + · · · + Jn . Vn is a rate α Poisson process,
so ψ [0, x − y ] = 1 + α(x − y ) for x − y ≥ 0. Using (5.8.1) now and integrating by parts 5.8. GENERAL STATE SPACE* 271 gives
¯
P (Sτ ≤ x) = (1 + α(x − y )) dF (y )
y ≤x
x = F (x) + α F (y ) dy for x ≤ 0 (5.8.2) −∞ ¯
¯
Since P (Sn = 0) = 0 for n ≥ 1, −Sτ has the same distribution as ST , where T =
inf {n : Sn > 0}. Combining this with part (ii) of Exercise 5.8.14 gives a “formula” for
P (M > x). Straightforward but somewhat tedious calculations show that if B (s) =
E exp(−sηn ), then
(1 − α · Eη )s
E exp(−sM ) =
s − α + αB (s)
a result known as the PollaczekKhintchine formula. The computations we omitted can be found in Billingsley (1979) on p. 277 or several times in Feller, Vol. II
(1971). 272 CHAPTER 5. MARKOV CHAINS Chapter 6 Ergodic Theorems
Xn , n ≥ 0, is said to be a stationary sequence if for each k ≥ 1 it has the same distribution as the shifted sequence Xn+k , n ≥ 0. The basic fact about these sequences,
called the ergodic theorem, is that if E f (X0 ) < ∞ then
n−1 1
f (Xm )
n→∞ n
m=0
lim exists a.s. If Xn is ergodic (a generalization of the notion of irreducibility for Markov chains)
then the limit is Ef (X0 ). Sections 6.1 and 6.2 develop the theory needed to prove
the ergodic theorem. In Section 6.3, we apply the ergodic theorem to study the
recurrence of random walks with increments that are stationary sequences ﬁnding
remarkable generalizations of the i.i.d. case. In Section 6.4, we prove a subadditive
ergodic theorem. As the examples in Sections 6.4 and 6.5. 6.1 Deﬁnitions and Examples X0 , X1 , . . . is said to be a stationary sequence if for every k , the shifted sequence {Xk+n , n ≥ 0} has the same distribution, i.e., for each m, (X0 , . . . , Xm ) and
(Xk , . . . , Xk+m ) have the same distribution. We begin by giving four examples that
will be our constant companions.
Example 6.1.1. X0 , X1 , . . . are i.i.d.
Example 6.1.2. Let Xn be a Markov chain with transition probability p(x, A)
and stationary distribution π , i.e., π (A) = π (dx) p(x, A). If X0 has distribution
π then X0 , X1 , . . . is a stationary sequence. A special case to keep in mind for
counterexamples is the chain with state space S = {0, 1} and transition probability p(x, {1 − x}) = 1. In this case, the stationary distribution has π (0) = π (1) = 1/2
and (X0 , X1 , . . .) = (0, 1, 0, 1, . . .) or (1, 0, 1, 0, . . .) with probability 1/2 each.
Example 6.1.3. Rotation of the circle. Let Ω = [0, 1), F = Borel subsets, P =
Lebesgue measure. Let θ ∈ (0, 1), and for n ≥ 0, let Xn (ω ) = (ω + nθ) mod 1, where
x mod 1 = x − [x], [x] being the greatest integer ≤ x. To see the reason for the name,
map [0, 1) into C by x → exp(2πix). This example is a special case of the last one.
Let p(x, {y }) = 1 if y = (x + θ) mod 1.
To make new examples from old, we can use:
273 274 CHAPTER 6. ERGODIC THEOREMS Theorem 6.1.1. If X0 , X1 , . . . is a stationary sequence and g : R{0,1,...} → R is
measurable then Yk = g (Xk , Xk+1 , . . .) is a stationary sequence.
Proof. If x ∈ R{0,1,...} , let gk (x) = g (xk , xk+1 , . . .), and if B ∈ R{0,1,...} let
A = {x : (g0 (x), g1 (x), . . .) ∈ B }
To check stationarity now, we observe:
P (ω : (Y0 , Y1 , . . .) ∈ B ) = P (ω : (X0 , X1 , . . .) ∈ A)
= P (ω : (Xk , Xk+1 , . . .) ∈ A)
= P (ω : (Yk , Yk+1 , . . .) ∈ B )
which proves the desired result.
Example 6.1.4. Bernoulli shift. Ω = [0, 1), F = Borel subsets, P = Lebesgue
measure. Y0 (ω ) = ω and for n ≥ 1, let Yn (ω ) = (2 Yn−1 (ω )) mod 1. This example is
a special case of (1.1). Let X0 , X1 , . . . be i.i.d. with P (Xi = 0) = P (Xi = 1) = 1/2,
∞
and let g (x) = i=0 xi 2−(i+1) . The name comes from the fact that multiplying by 2
shifts the X ’s to the left. This example is also a special case of Example 6.1.2. Let
p(x, {y }) = 1 if y = (2x) mod 1.
Examples 6.1.3 and 6.1.4 are special cases of the following situation.
Example 6.1.5. Let (Ω, F , P ) be a probability space. A measurable map ϕ : Ω → Ω
is said to be measure preserving if P (ϕ−1 A) = P (A) for all A ∈ F . Let φn be the
nth iterate of φ deﬁned inductively by φn = φ(φn−1 ) for n ≥ 1, where φ0 (ω ) = ω . We
claim that if X ∈ F , then Xn (ω ) = X (ϕn ω ) deﬁnes a stationary sequence. To check
this, let B ∈ Rn+1 and A = {ω : (X0 (ω ), . . . , Xn (ω )) ∈ B }. Then
P ((Xk , . . . , Xk+n ) ∈ B ) = P (ϕk ω ∈ A) = P (ω ∈ A) = P ((X0 , . . . , Xn ) ∈ B )
The last example is more than an important example. In fact, it is the only example!
If Y0 , Y1 , . . . is a stationary sequence taking values in a nice space, Kolmogorov’s extension theorem, Theorem A.7.1, allows us to construct a measure P on sequence space
(S {0,1,...} , S {0,1,...} ), so that the sequence Xn (ω ) = ωn has the same distribution as
that of {Yn , n ≥ 0}. If we let ϕ be the shift operator, i.e., ϕ(ω0 , ω1 , . . .) = (ω1 , ω2 , . . .),
and let X (ω ) = ω0 , then ϕ is measure preserving and Xn (ω ) = X (ϕn ω ).
In some situations, e.g., in the proof of Theorem 6.3.3 below, it is useful to observe:
Theorem 6.1.2. Any stationary sequence {Xn , n ≥ 0} can be embedded in a twosided stationary sequence {Yn : n ∈ Z}.
Proof. We observe that
P (Y−m ∈ A0 , . . . , Yn ∈ Am+n ) = P (X0 ∈ A0 , . . . , Xm+n ∈ Am+n )
is a consistent set of ﬁnite dimensional distributions, so a trivial generalization of the
Kolmogorov extension theorem implies there is a measure P on (S Z , S Z ) so that the
variables Yn (ω ) = ωn have the desired distributions.
In view of the observations above, it suﬃces to give our deﬁnitions and prove our
results in the setting of Example 6.1.5. Thus, our basic set up consists of 6.1. DEFINITIONS AND EXAMPLES
(Ω, F , P )
φ
Xn (ω ) = X (φn ω ) 275 a probability space
a map that preserves P
where X is a random variable We will now give some important deﬁnitions. Here and in what follows we assume φ
is measurepreserving. A set A ∈ F is said to be invariant if ϕ−1 A = A. (Here, as
usual, two sets are considered to be equal if their symmetric diﬀerence has probability
0.) Some authors call A almost invariant if P (A∆ϕ−1 (A)) = 0. We call such sets
invariant and call B invariant in the strict sense if B = ϕ−1 (B ).
Exercise 6.1.1. Show that the class of invariant events I is a σ ﬁeld, and X ∈ I if
and only if X is invariant, i.e., X ◦ ϕ = X a.s.
Exercise 6.1.2. (i) Let A be any set, let B = ∪∞ φ−n (A). Show φ−1 (B ) ⊂ B . (ii)
n=0
Let B be any set with φ−1 (B ) ⊂ B and let C = ∩∞ φ−n (B ). Show that φ−1 (C ) = C .
n=0
(iii) Show that A is almost invariant if and only if there is a C invariant in the strict
sense with P (A∆C ) = 0.
A measurepreserving transformation on (Ω, F , P ) is said to be ergodic if I is
trivial, i.e., for every A ∈ I , P (A) ∈ {0, 1}. If ϕ is not ergodic then the space can
be split into two sets A and Ac , each having positive measure so that ϕ(A) = A and
ϕ(Ac ) = Ac . In words, ϕ is not “irreducible.”
To investigate further the meaning of ergodicity, we return to our examples, renumbering them because the new focus is on checking ergodicity.
Example 6.1.6. i.i.d. sequence. We begin by observing that if Ω = R{0,1,...} and
ϕ is the shift operator, then an invariant set A has {ω : ω ∈ A} = {ω : ϕω ∈ A} ∈
σ (X1 , X2 , . . .). Iterating gives
A ∈ ∩∞ σ (Xn , Xn+1 , . . .) = T ,
n=1 the tail σ ﬁeld so I ⊂ T . For an i.i.d. sequence, Kolmogorov’s 01 law implies T is trivial, so I is
trivial and the sequence is ergodic (i.e., when the corresponding measure is put on
sequence space Ω = R{0,1,2,,...} the shift is).
Example 6.1.7. Markov chains. Suppose the state space S is countable and
the stationary distribution has π (x) > 0 for all x ∈ S . By Theorems 5.5.4 and
5.4.5, all states are recurrent, and we can write S = ∪Ri , where the Ri are disjoint
irreducible closed sets. If X0 ∈ Ri then with probability one, Xn ∈ Ri for all n ≥ 1
so {ω : X0 (ω ) ∈ Ri } ∈ I . The last observation shows that if the Markov chain
is not irreducible then the sequence is not ergodic. To prove the converse, observe
that if A ∈ I , 1A ◦ θn = 1A where θn (ω0 , ω1 , . . .) = (ωn , ωn+1 , . . .). So if we let
Fn = σ (X0 , . . . , Xn ), the shift invariance of 1A and the Markov property imply
Eπ (1A Fn ) = Eπ (1A ◦ θn Fn ) = h(Xn )
where h(x) = Ex 1A . L´vy’s 01 law implies that the lefthand side converges to
e
1A as n → ∞. If Xn is irreducible and recurrent then for any y ∈ S , the righthand side = h(y ) i.o., so either h(x) ≡ 0 or h(x) ≡ 1, and Pπ (A) ∈ {0, 1}. This
example also shows that I and T may be diﬀerent. When the transition probability
p is irreducible I is trivial, but if all the states have period d > 1, T is not. In
Theorem 5.7.3, we showed that if S0 , . . . , Sd−1 is the cyclic decomposition of S , then
T = σ ({X0 ∈ Sr } : 0 ≤ r < d). 276 CHAPTER 6. ERGODIC THEOREMS Example 6.1.8. Rotation of the circle is not ergodic if θ = m/n where m < n
are positive integers. If B is a Borel subset of [0, 1/n) and
−1
A = ∪n=0 (B + k/n)
k then A is invariant. Conversely, if θ is irrational, then ϕ is ergodic. To prove this,
we need a fact from Fourier analysis. If f is a measurable function on [0, 1) with
f 2 (x) dx < ∞, then f can be written as f (x) = k ck e2πikx where the equality is
in the sense that as K → ∞
K ck e2πikx → f (x) in L2 [0, 1)
k=−K and this is possible for only one choice of the coeﬃcients ck =
ck e2πik(x+θ) = f (ϕ(x)) =
k f (x)e−2πikx dx. Now (ck e2πikθ )e2πikx
k The uniqueness of the coeﬃcients ck implies that f (ϕ(x)) = f (x) if and only if
ck (e2πikθ − 1) = 0. If θ is irrational, this implies ck = 0 for k = 0, so f is constant. Applying the last result to f = 1A with A ∈ I shows that A = ∅ or [0, 1)
a.s.
Exercise 6.1.3. A direct proof of ergodicity. (i) Show that if θ is irrational, xn = nθ
mod 1 is dense in [0,1). Hint: All the xn are distinct, so for any N < ∞, xn − xm  ≤
1/N for some m < n ≤ N . (ii) Use Exercise A.3.1 to show that if A is a Borel set with
A > 0, then for any δ > 0 there is an interval J = [a, b) so that A ∩ J  > (1 − δ )J .
(iii) Combine this with (i) to conclude P (A) = 1.
Example 6.1.9. Bernoulli shift is ergodic. To prove this, we recall that the stationary sequence Yn (ω ) = ϕn (ω ) can be represented as
∞ 2−(m+1) Xn+m Yn =
m=0 where X0 , X1 , . . . are i.i.d. with P (Xk = 1) = P (Xk = 0) = 1/2, and use the following
fact:
Theorem 6.1.3. Let g : R{0,1,...} → R be measurable. If X0 , X1 , . . . is an ergodic
stationary sequence, then Yk = g (Xk , Xk+1 , . . .) is ergodic.
Proof. Suppose X0 , X1 , . . . is deﬁned on sequence space with Xn (ω ) = ωn . If B has
{ω : (Y0 , Y1 , . . .) ∈ B } = {ω : (Y1 , Y2 , . . .) ∈ B } then A = {ω : (Y0 , Y1 , . . .) ∈ B } is
shift invariant.
Exercise 6.1.4. Use Fourier analysis as in Example 6.1.3 to prove that Example
6.1.4 is ergodic.
Exercises
6.1.5. Continued fractions. Let ϕ(x) = 1/x − [1/x] for x ∈ (0, 1) and A(x) = [1/x],
where [1/x] = the largest integer ≤ 1/x. an = A(ϕn x), n = 0, 1, 2, . . . gives the
continued fraction representation of x, i.e.,
x = 1/(a0 + 1/(a1 + 1/(a2 + 1/ . . .)))
Show that ϕ preserves µ(A) = 1
log 2 dx
A 1+x for A ⊂ (0, 1). 6.1. DEFINITIONS AND EXAMPLES 277 Remark. In his (1959) monograph, Kac claimed that it was “entirely trivial” to
check that ϕ is ergodic but retracted his claim in a later footnote. We leave it to the
reader to construct a proof or look up the answer in RyllNardzewski (1951). Chapter
9 of L´vy (1937) is devoted to this topic and is still interesting reading today.
e
6.1.6. Independent blocks. Let X1 , X2 , . . . be a stationary sequence. Let n < ∞
and let Y1 , Y2 , . . . be a sequence so that (Ynk+1 , . . . , Yn(k+1) ), k ≥ 0 are i.i.d. and
(Y1 , . . . , Yn ) = (X1 , . . . , Xn ). Finally, let ν be uniformly distributed on {1, 2, . . . , n},
independent of Y , and let Zm = Yν +m for m ≥ 1. Show that Z is stationary and
ergodic. 278 6.2 CHAPTER 6. ERGODIC THEOREMS Birkhoﬀ ’s Ergodic Theorem Throughout this section, ϕ is a measurepreserving transformation on (Ω, F , P ). See
Example 6.1.5 for details. We begin by proving a result that is usually referred to as:
Theorem 6.2.1. The ergodic theorem. For any X ∈ L1 ,
n−1 1
X (ϕm ω ) → E (X I )
n m=0 a.s. and in L1 This result due to Birkhoﬀ (1931) is sometimes called the pointwise or individual
ergodic theorem because of the a.s. convergence in the conclusion. When the sequence
is ergodic, the limit is the mean EX . In this case, if we take X = 1A , it follows that
the asymptotic fraction of time φm ∈ A is P (A).
The proof we give is based on an odd integration inequality due to Yosida and
Kakutani (1939). We follow Garsia (1965). The proof is not intuitive, but none of
the steps are diﬃcult.
Lemma 6.2.2. Maximal ergodic lemma. Let Xj (ω ) = X (ϕj ω ), Sk (ω ) = X0 (ω )+
. . . + Xk−1 (ω ), and Mk (ω ) = max(0, S1 (ω ), . . . , Sk (ω )). Then E (X ; Mk > 0) ≥ 0.
Proof. If j ≤ k then Mk (ϕω ) ≥ Sj (ϕω ), so adding X (ω ) gives
X (ω ) + Mk (ϕω ) ≥ X (ω ) + Sj (ϕω ) = Sj +1 (ω )
and rearranging we have
X (ω ) ≥ Sj +1 (ω ) − Mk (ϕω ) for j = 1, . . . , k
Trivially, X (ω ) ≥ S1 (ω ) − Mk (ϕω ), since S1 (ω ) = X (ω ) and Mk (ϕω ) ≥ 0. Therefore
E (X (ω ); Mk > 0) ≥ max(S1 (ω ), . . . , Sk (ω )) − Mk (ϕω ) dP
{Mk >0} Mk (ω ) − Mk (ϕω ) dP =
{Mk >0} Now Mk (ω ) = 0 and Mk (ϕω ) ≥ 0 on {Mk > 0}c , so the last expression is
≥ Mk (ω ) − Mk (ϕω ) dP = 0 since ϕ is measure preserving.
Proof of Theorem 6.2.1. E (X I ) is invariant under ϕ (see Exercise 6.1.1), so letting
X = X − E (X I ) we can assume without loss of generality that E (X I ) = 0. Let
¯
¯
X = lim sup Sn /n, let > 0, and let D = {ω : X (ω ) > }. Our goal is to prove that
¯ (ϕω ) = X (ω ), so D ∈ I . Let
¯
P (D) = 0. X
X ∗ (ω ) = (X (ω ) − )1D (ω )
∗
Mn (ω ) = ∗
Sn (ω ) =
∗
∗
max(0, S1 (ω ), . . . ,Sn (ω )) X ∗ (ω ) + . . . + X ∗ (ϕn−1 ω )
∗
Fn = {Mn > 0} ∗
F = ∪n Fn = sup Sk /k > 0
k≥1 6.2. BIRKHOFF’S ERGODIC THEOREM 279 Since X ∗ (ω ) = (X (ω ) − )1D (ω ) and D = {lim sup Sk /k > }, it follows that
F= sup Sk /k > ∩D =D k≥1 Lemma 6.2.2 implies that E (X ∗ ; Fn ) ≥ 0. Since E X ∗  ≤ E X  + < ∞, the
dominated convergence theorem implies E (X ∗ ; Fn ) → E (X ∗ ; F ), and it follows that
E (X ∗ ; F ) ≥ 0. The last conclusion looks innocent, but F = D ∈ I , so it implies
0 ≤ E (X ∗ ; D) = E (X − ; D) = E (E (X I ); D) − P (D) = − P (D)
since E (X I ) = 0. The last inequality implies that
0 = P (D) = P (lim sup Sn /n > )
and since > 0 is arbitrary, it follows that lim sup Sn /n ≤ 0. Applying the last result
to −X shows that Sn /n → 0 a.s.
The clever part of the proof is over and the rest is routine. To prove that convergence occurs in L1 , let
XM (ω ) = X (ω )1(X (ω)≤M ) and XM (ω ) = X (ω ) − XM (ω ) The part of the ergodic theorem we have proved implies
n−1 1
X (ϕm ω ) → E (XM I )
n m=0 M a.s. Since XM is bounded, the bounded convergence theorem implies
n−1 1
X (ϕm ω ) − E (XM I ) → 0
E
n m=0 M
To handle XM , we observe
n−1 n−1 1
1
X (ϕm ω ) ≤
E XM (ϕm ω ) = E XM 
n m=0 M
n m=0 E and E E (XM I ) ≤ EE (XM I ) = E XM . So
n−1 E 1
X (ϕm ω ) − E (XM I ) ≤ 2E XM 
n m=0 M and it follows that
n−1 lim sup E
n→∞ 1
X (ϕm ω ) − E (X I ) ≤ 2E XM 
n m=0 As M → ∞, E XM  → 0 by the dominated convergence theorem, which completes
the proof.
Exercise 6.2.1. Show that if X ∈ Lp with p > 1 then the convergence in Theorem
6.2.1 occurs in Lp . 280 CHAPTER 6. ERGODIC THEOREMS Exercise 6.2.2. (i) Show that if gn (ω ) → g (ω ) a.s. and E (supk gk (ω )) < ∞, then
n−1 1
gm (φm ω ) = E (g I )
n→∞ n
m=0
lim a.s. (ii) Show that if we suppose only that gn → g in L1 , we get L1 convergence.
Before turning to examples, we would like to prove a useful result that is a simple
consequence of Lemma 6.2.2:
Theorem 6.2.3. Wiener’s maximal inequality. Let Xj (ω ) = X (ϕj ω ), Sk (ω ) =
X0 (ω ) + · · · + Xk−1 (ω ), Ak (ω ) = Sk (ω )/k , and Dk = max(A1 , . . . , Ak ). If α > 0 then
P (Dk > α) ≤ α−1 E X 
Proof. Let B = {Dk > α}. Applying (2.2) to X = X − α, with Xj (ω ) = X (ϕj ω ),
Sk = X0 (ω ) + · · · + Xk−1 , and Mk = max(0, S1 , . . . , Sk ) we conclude that E (X ; Mk >
0) ≥ 0. Since {Mk > 0} = {Dk > α} ≡ B , it follows that
E X  ≥ X dP ≥
B αdP = αP (B )
B Exercise 6.2.3. Use Lemma 6.2.3 and the truncation argument at the end of the
proof of Theorem 6.2.2 to conclude that if Theorem 6.2.2 holds for bounded r.v.’s,
then it holds whenever E X  < ∞.
Our next step is to see what Theorem 6.2.2 says about our examples.
Example 6.2.1. i.i.d. sequences. Since I is trivial, the ergodic theorem implies
that
n−1
1
Xm → EX0 a.s. and in L1
n m=0
The a.s. convergence is the strong law of large numbers.
Remark. We can prove the L1 convergence in the law of large numbers without
invoking the ergodic theorem. To do this, note that
n 1
X + → EX +
n m=1 m n a.s. E 1
X+
n m=1 m = EX + n 1
+
and use Theorem 4.5.2 to conclude that n m=1 Xm → EX + in L1 . A similar result
for the negative part and the triangle inequality now give the desired result. Example 6.2.2. Markov chains. Let Xn be an irreducible Markov chain on a
countable state space that has a stationary distribution π . Let f be a function with
f (x)π (x) < ∞
x In Example 6.1.7, we showed that I is trivial, so applying the ergodic theorem to
f (X0 (ω )) gives
n−1 1
f (Xm ) →
n m=0 f (x)π (x) a.s. and in L1 x For another proof of the almost sure convergence, see Exercise 5.6.4. 6.2. BIRKHOFF’S ERGODIC THEOREM 281 Example 6.2.3. Rotation of the circle. Ω = [0, 1) ϕ(ω ) = (ω + θ) mod 1. Suppose
that θ ∈ (0, 1) is irrational, so that by a result in Section 6.1 I is trivial. If we set
X (ω ) = 1A (ω ), with A a Borel subset of [0,1), then the ergodic theorem implies
n−1 1
1(ϕm ω∈A) → A a.s.
n m=0
where A denotes the Lebesgue measure of A. The last result for ω = 0 is usually
called Weyl’s equidistribution theorem, although Bohl and Sierpinski should also
get credit. For the history and a nonprobabilistic proof, see Hardy and Wright (1959),
p. 390–393.
To recover the number theoretic result, we will now show that:
Theorem 6.2.4. If A = [a, b) then the exceptional set is ∅.
Proof. Let Ak = [a + 1/k, b − 1/k ). If b − a > 2/k , the ergodic theorem implies
n−1 2
1
1A (ϕm ω ) → b − a −
n m=0 k
k
for ω ∈ Ωk with P (Ωk ) = 1. Let G = ∩Ωk , where the intersection is over integers k
with b − a > 2/k . P (G) = 1 so G is dense in [0,1). If x ∈ [0, 1) and ωk ∈ G with
ωk − x < 1/k , then ϕm ωk ∈ Ak implies ϕm x ∈ A, so
n−1 lim inf
n→∞ 1
2
1A (ϕm x) ≥ b − a −
n m=0
k for all large enough k . Noting that k is arbitrary and applying similar reasoning to
Ac shows
n−1
1
1A (ϕm x) → b − a
n m=0
Example 6.2.4. Benford’s law. As Gelfand ﬁrst observed, the equidistribution
theorem says something interesting about 2m . Let θ = log10 2, 1 ≤ k ≤ 9, and
Ak = [log10 k, log10 (k + 1)) where log10 y is the logarithm of y to the base 10. Taking
x = 0 in the last result, we have
n−1 1
1A (ϕm 0) → log10
n m=0 k+1
k A little thought reveals that the ﬁrst digit of 2m is k if and only if mθ mod 1 ∈ Ak .
The numerical values of the limiting probabilities are
1
.3010 2
.1761 3
.1249 4
.0969 5
.0792 6
.0669 7
.0580 8
.0512 9
.0458 The limit distribution on {1, . . . , 9} is called Benford’s (1938) law, although it was
discovered by Newcomb (1881). As Raimi (1976) explains, in many tables the observed
frequency with which k appears as a ﬁrst digit is approximately log10 ((k + 1)/k ).
Some of the many examples that are supposed to follow Benford’s law are: census
populations of 3259 counties, 308 numbers from Reader’s Digest, areas of 335 rivers,
342 addresses of American Men of Science. The next table compares the percentages
of the observations in the ﬁrst ﬁve categories to Benford’s law: 282 CHAPTER 6. ERGODIC THEOREMS Census
Reader’s Digest
Rivers
Benford’s Law
Addresses 1
33.9
33.4
31.0
30.1
28.9 2
20.4
18.5
16.4
17.6
19.2 3
14.2
12.4
10.7
12.5
12.6 4
8.1
7.5
11.3
9.7
8.8 5
7.2
7.1
7.2
7.9
8.5 The ﬁts are far from perfect, but in each case Benford’s law matches the general shape
of the observed distribution.
Example 6.2.5. Bernoulli shift. Ω = [0, 1), ϕ(ω ) = (2ω ) mod 1. Let i1 , . . . , ik ∈
{0, 1}, let r = i1 2−1 + · · · + ik 2−k , and let X (ω ) = 1 if r ≤ ω < r + 2−k . In words,
X (ω ) = 1 if the ﬁrst k digits of the binary expansion of ω are i1 , . . . , ik . The ergodic
theorem implies that
n−1
1
X (ϕm ω ) → 2−k a.s.
n m=0
i.e., in almost every ω ∈ [0, 1) the pattern i1 , . . . , ik occurs with its expected frequency.
Since there are only a countable number of patterns of ﬁnite length, it follows that almost every ω ∈ [0, 1) is normal, i.e., all patterns occur with their expected frequency.
This is the binary version of Borel’s (1909) normal number theorem. 6.3. RECURRENCE 6.3 283 Recurrence In this section, we will study the recurrence properties of stationary sequences. Our
ﬁrst result is an application of the ergodic theorem. Let X1 , X2 , . . . be a stationary
sequence taking values in Rd , let Sk = X1 + · · · + Xk , let A = {Sk = 0 for all k ≥ 1},
and let Rn = {S1 , . . . , Sn } be the number of points visited at time n. Kesten, Spitzer,
and Whitman, see Spitzer (1964), p. 40, proved the next result when the Xi are i.i.d.
In that case, I is trivial, so the limit is P (A).
Theorem 6.3.1. As n → ∞, Rn /n → E (1A I ) a.s.
Proof. Suppose X1 , X2 , . . . are constructed on (Rd ){0,1,...} with Xn (ω ) = ωn , and let
ϕ be the shift operator. It is clear that
n 1A (ϕm ω ) Rn ≥
m=1 since the righthand side = {m : 1 ≤ m ≤ n, S = Sm for all
ergodic theorem now gives
lim inf Rn /n ≥ E (1A I )
n→∞ > m}. Using the a.s. To prove the opposite inequality, let Ak = {S1 = 0, S2 = 0, . . . , Sk = 0}. It is clear
that
n−k 1Ak (ϕm ω ) Rn ≤ k +
m=1 since the sum on the righthand side = {m : 1 ≤ m ≤ n − k, S = Sm for m <
m + k }. Using the ergodic theorem now gives ≤ lim sup Rn /n ≤ E (1Ak I )
n→∞ As k ↑ ∞, Ak ↓ A, so the monotone convergence theorem for conditional expectations,
(c) in Theorem 4.1.2, implies
E (1Ak I ) ↓ E (1A I ) as k ↑ ∞ and the proof is complete.
Exercise 6.3.1. Let gn = P (S1 = 0, . . . , Sn = 0) for n ≥ 1 and g0 = 1. Show that
n
ERn = m=1 gm−1 .
From Theorem 6.3.1, we get a result about the recurrence of random walks with
stationary increments that is (for integer valued random walks) a generalization of
the ChungFuchs theorem, 3.2.7.
Theorem 6.3.2. Let X1 , X2 , . . . be a stationary sequence taking values in Z with
E Xi  < ∞. Let Sn = X1 + · · · + Xn , and let A = {S1 = 0, S2 = 0, . . .}. (i) If
E (X1 I ) = 0 then P (A) = 0. (ii) If P (A) = 0 then P (Sn = 0 i.o.) = 1.
Remark. In words, mean zero implies recurrence. The condition E (X1 I ) = 0 is
needed to rule out trivial examples that have mean 0 but are a combination of a
sequence with positive and negative means, e.g., P (Xn = 1 for all n) = P (Xn = −1
for all n) = 1/2. 284 CHAPTER 6. ERGODIC THEOREMS Proof. If E (X1 I ) = 0 then the ergodic theorem implies Sn /n → 0 a.s. Now
max Sk /n lim sup = lim sup 1≤k≤n n→∞ n→∞ max Sk /n K ≤k≤n ≤ max Sk /k
k≥K for any K and the righthand side ↓ 0 as K ↑ ∞. The last conclusion leads easily to
lim n→∞ max Sk  1≤k≤n n=0 Since Rn ≤ 1 + 2 max1≤k≤n Sk  it follows that Rn /n → 0 and Theorem 6.3.1 implies
P (A) = 0.
Let Fj = {Si = 0 for i < j, Sj = 0} and Gj,k = {Sj +i − Sj = 0 for i < k ,
Sj +k − Sj = 0}. P (A) = 0 implies that
P (Fk ) = 1. Stationarity implies P (Gj,k ) =
P (Fk ), and for ﬁxed j the Gj,k are disjoint, so ∪k Gj,k = Ω a.s. It follows that
P (Fj ∩ Gj,k ) = P (Fj )
k P (Fj ∩ Gj,k ) = 1 and
j,k On Fj ∩ Gj,k , Sj = 0 and Sj +k = 0, so we have shown P (Sn = 0 at least two times
) = 1. Repeating the last argument shows P (Sn = 0 at least k times) = 1 for all k ,
and the proof is complete.
Exercise 6.3.2. Imitate the proof of (i) in Theorem 6.3.2 to show that if we assume
P (Xi > 1) = 0, EXi > 0, and the sequence Xi is ergodic in addition to the hypotheses
of Theorem 6.3.2, then P (A) = EXi .
Remark. This result was proved for asymmetric simple random walk in Exercise
3.1.13. It is interesting to note that we can use martingale theory to prove a result
for random walks that do not skip over integers on the way down, see Exercise 4.7.7.
Extending the reasoning in the proof of part (ii) of Theorem 6.3.2 gives a result
of Kac (1947b). Let X0 , X1 , . . . be a stationary sequence taking values in (S, S ). Let
A ∈ S , let T0 = 0, and for n ≥ 1, let Tn = inf {m > Tn−1 : Xm ∈ A} be the time of
the nth return to A.
Theorem 6.3.3. If P (Xn ∈ A at least once) = 1, then under P (·X0 ∈ A), tn =
Tn − Tn−1 is a stationary sequence with E (T1 X0 ∈ A) = 1/P (X0 ∈ A).
Remark. If Xn is an irreducible Markov chain on a countable state space S starting
from its stationary distribution π , and A = {x}, then Theorem 6.3.3 says Ex Tx =
1/π (x), which is Theorem 5.5.5. Theorem 6.3.3 extends that result to an arbitrary
A ⊂ S and drops the assumption that Xn is a Markov chain.
Proof. We ﬁrst show that under P (·X0 ∈ A), t1 , t2 , . . . is stationary. To cut down on
. . .’s, we will only show that
P (t1 = m, t2 = nX0 ∈ A) = P (t2 = m, t3 = nX0 ∈ A)
It will be clear that the same proof works for any ﬁnitedimensional distribution. Our
ﬁrst step is to extend {Xn , n ≥ 0} to a twosided stationary sequence {Xn , n ∈ Z}
using Theorem 6.1.2. Let Ck = {X−1 ∈ A, . . . , X−k+1 ∈ A, X−k ∈ A}.
/
/
∪K Ck
k=1 c = {Xk ∈ A for − K ≤ k ≤ −1}
/ 6.3. RECURRENCE 285 The last event has the same probability as {Xk ∈ A for 1 ≤ k ≤ K }, so letting
/
K → ∞, we get P (∪∞ Ck ) = 1. To prove the desired stationarity, we let Ij,k = {i ∈
k=1
[j, k ] : Xi ∈ A} and observe that
∞ P (t2 = m, t3 = n, X0 ∈ A) = P (X0 ∈ A, t1 = , t2 = m, t3 = n)
=1
∞ = P (I0, +m+n P (I− ,m+n = {0, , + m, + m + n}) =1
∞ = = {− , 0, m, m + n}) =1
∞ P (C , X0 ∈ A, t1 = m, t2 = n) =
=1 To complete the proof, we compute
∞ ∞ P (t1 ≥ k X0 ∈ A) = P (X0 ∈ A)−1 E (t1 X0 ∈ A) =
k=1 P (t1 ≥ k, X0 ∈ A)
k=1 ∞ = P (X0 ∈ A)−1 P (Ck ) = 1/P (X0 ∈ A)
k=1 since the Ck are disjoint and their union has probability 1.
In the next two exercises, we continue to use the notation of Theorem 6.3.3.
Exercise 6.3.3. Show that if P (Xn ∈ A at least once) = 1 and A ∩ B = ∅ then
1(Xm ∈B ) X0 ∈ A E
1≤m≤T1 = P (X0 ∈ B )
P (X0 ∈ A) When A = {x} and Xn is a Markov chain, this is the “cycle trick” for deﬁning a
stationary measure. See Theorem 5.5.2.
¯
Exercise 6.3.4. Consider the special case in which Xn ∈ {0, 1}, and let P =
P (·X0 = 1). Here A = {1} and so T1 = inf {m > 0 : Xm = 1}. Show P (T1 =
¯
¯
n) = P (T1 ≥ n)/ET1 . When t1 , t2 , . . . are i.i.d., this reduces to the formula for the
ﬁrst waiting time in a stationary renewal process.
In checking the hypotheses of Kac’s theorem, a result Poincar´ proved in 1899 is
e
useful. First, we need a deﬁnition. Let TA = inf {n ≥ 1 : ϕn (ω ) ∈ A}.
Theorem 6.3.4. Suppose ϕ : Ω → Ω preserves P , that is, P ◦ ϕ−1 = P . (i) TA < ∞
a.s. on A, that is, P (ω ∈ A, TA = ∞) = 0. (ii) {ϕn (ω ) ∈ A i.o.} ⊃ A. (iii) If ϕ is
ergodic and P (A) > 0, then P (ϕn (ω ) ∈ A i.o.) = 1.
Remark. Note that in (i) and (ii) we assume only that ϕ is measurepreserving.
Extrapolating from Markov chain theory, the conclusions can be “explained” by noting
that: (i) the existence of a stationary distribution implies the sequence is recurrent,
and (ii) since we start in A we do not have to assume irreducibility. Conclusion (iii)
is, of course, a consequence of the ergodic theorem, but as the selfcontained proof
below indicates, it is a much simpler fact. 286 CHAPTER 6. ERGODIC THEOREMS Proof. Let B = {ω ∈ A, TA = ∞}. A little thought shows that if ω ∈ ϕ−m B
then ϕm (ω ) ∈ A, but ϕn (ω ) ∈ A for n > m, so the ϕ−m B are pairwise disjoint.
/
The fact that ϕ is measurepreserving implies P (ϕ−m B ) = P (B ), so we must have
P (B ) = 0 (or P would have inﬁnite mass). To prove (ii), note that for any k , ϕk is
measurepreserving, so (i) implies
0 = P (ω ∈ A, ϕnk (ω ) ∈ A for all n ≥ 1)
/
≥ P (ω ∈ A, ϕm (ω ) ∈ A for all m ≥ k )
/
Since the last probability is 0 for all k , (ii) follows. Finally, for (iii), note that B ≡
{ω : ϕn (ω ) ∈ A i.o.} is invariant and ⊃ A by (b), so P (B ) > 0, and it follows from
ergodicity that P (B ) = 1. 6.4. A SUBADDITIVE ERGODIC THEOREM* 6.4 287 A Subadditive Ergodic Theorem* In this section we will prove Liggett’s (1985) version of Kingman’s (1968)
Theorem 6.4.1. Subadditive ergodic theorem. Suppose Xm,n , 0 ≤ m < n
satisfy:
(i) X0,m + Xm,n ≥ X0,n
(ii) {Xnk,(n+1)k , n ≥ 1} is a stationary sequence for each k .
(iii) The distribution of {Xm,m+k , k ≥ 1} does not depend on m.
+
(iv) EX0,1 < ∞ and for each n, EX0,n ≥ γ0 n, where γ0 > −∞. Then
(a) limn→∞ EX0,n /n = inf m EX0,m /m ≡ γ
(b) X = limn→∞ X0,n /n exists a.s. and in L1 , so EX = γ .
(c) If all the stationary sequences in (ii) are ergodic then X = γ a.s.
Remark. Kingman assumed (iv), but instead of (i)–(iii) he assumed that X ,m +
Xm,n ≥ X ,n for all < m < n and that the distribution of {Xm+k,n+k , 0 ≤ m < n}
does not depend on k . In two of the four applications in the next, these stronger
conditions do not hold.
Before giving the proof, which is somewhat lengthy, we will consider several examples for motivation. Since the validity of (ii) and (iii) in each case is clear, we will only
check (i) and (iv). The ﬁrst example shows that Theorem 6.4.1 contains the ergodic
theorem, 6.2.1, as a special case.
Example 6.4.1. Stationary sequences. Suppose ξ1 , ξ2 , . . . is a stationary sequence
with E ξk  < ∞, and let Xm,n = ξm+1 + · · · + ξn . Then X0,n = X0,m + Xm,n , and
(iv) holds.
Example 6.4.2. Range of random walk. Suppose ξ1 , ξ2 , . . . is a stationary sequence and let Sn = ξ1 + · · · + ξn . Let Xm,n = {Sm+1 , . . . , Sn }. It is clear that
X0,m + Xm,n ≥ X0,n . 0 ≤ X0,n ≤ n, so (iv) holds. Applying (6.1) now gives
X0,n /n → X a.s. and in L1 , but it does not tell us what the limit is.
Exercise 6.4.1. Suppose ξ1 , ξ2 , . . . is ergodic in Example 6.4.2. Use (c) and (a) of
Theorem 6.4.1 to conclude that {S1 , . . . , Sn }/n → P ( no return to 0 ).
Example 6.4.3. Longest common subsequences. Given are ergodic stationary
sequences X1 , X2 , X3 , . . . and Y1 , Y2 , Y3 , . . . be Let Lm,n = max{K : Xik = Yjk for
1 ≤ k ≤ K , where m < i1 < i2 . . . < iK ≤ n and m < j1 < j2 . . . < jK ≤ n}. It is
clear that
L0,m + Lm,n ≥ L0,n
so Xm,n = −Lm,n is subadditive. 0 ≤ L0,n ≤ n so (iv) holds. Applying Theorem
6.4.1 now, we conclude that
L0,n /n → γ = sup E (L0,m /m)
m≥1 Exercise 6.4.2. Suppose that in the last exercise X1 , X2 , . . . and Y1 , Y2 , . . . are
i.i.d. and take the values 0 and 1 with probability 1/2 each. (a) Compute EL1 and
EL2 /2 to get lower bounds on γ . (b) Show γ < 1 by computing the expected number
of i and j sequences of length K = an with the desired property. 288 CHAPTER 6. ERGODIC THEOREMS Remark. Chvatal and Sankoﬀ (1975) have shown 0.727273 ≤ γ ≤ 0.866595
Our ﬁnal example shows that the convergence in (a) of Theorem 6.4.1 may occur
arbitrarily slowly.
Example 6.4.4. Suppose Xm,m+k = f (k ) ≥ 0, where f (k )/k is decreasing.
f (n)
f (n)
+ (n − m)
n
n
f (m)
f (n − m)
≤m
+ (n − m)
= X0,m + Xm,n
m
n−m X0,n = f (n) = m The examples above should provide enough motivation for now. In the next section, we will give four more applications of Theorem 6.4.1.
Proof of Theorem 6.4.1. There are four steps. The ﬁrst, second, and fourth date
back to Kingman (1968). The half dozen proofs of subadditive ergodic theorems that
exist all do the crucial third step in a diﬀerent way. Here we use the approach of S.
Leventhal (1988), who in turn based his proof on Katznelson and Weiss (1982).
Step 1. The ﬁrst thing to check is that E X0,n  ≤ Cn. To do this, we note that (i)
+
+
+
implies X0,m + Xm,n ≥ X0,n . Repeatedly using the last inequality and invoking (iii)
+
+
gives EX0,n ≤ nEX0,1 < ∞. Since x = 2x+ − x, it follows from (iv) that
+
E X0,n  ≤ 2EX0,n − EX0,n ≤ Cn < ∞ Let an = EX0,n . (i) and (iii) imply that
am + an−m ≥ an (6.4.1) an /n → inf am /m ≡ γ (6.4.2) From this, it follows easily that
m≥1 To prove this, we observe that the liminf is clearly ≥ γ , so all we have to show is that
the limsup ≤ am /m for any m. The last fact is easy, for if we write n = km + with
0 ≤ < m, then repeated use of (6.4.1) gives an ≤ kam + a . Dividing by n = km +
gives
km
am
a
an
≤
·
+
n
km +
m
n
Letting n → ∞ and recalling 0 ≤ < m gives 6.4.2 and proves (a) in Theorem 6.4.1.
Step 2. Making repeated use of (i), we get
X0,n ≤ X0,km + Xkm,n
X0,n ≤ X0,(k−1)m + X(k−1)m,km + Xkm,n
and so on until the ﬁrst term on the right is X0,m . Dividing by n = km + then gives
X0,m + · · · + X(k−1)m,km
k
Xkm,n
X0,n
≤
·
+
n
km +
k
n
Using (ii) and the ergodic theorem now gives that
X0,m + · · · + X(k−1)m,km
→ Am
k a.s. and in L1 (6.4.3) 6.4. A SUBADDITIVE ERGODIC THEOREM* 289 where Am = E (X0,m Im ) and the subscript indicates that Im is the shift invariant
σ ﬁeld for the sequence X(k−1)m,km , k ≥ 1. The exact formula for the limit is not
important, but we will need to know later that EAm = EX0,m .
If we ﬁx and let > 0, then (iii) implies
∞ ∞ P (Xkm,km+ > (km + ) ) ≤
k=1 P (X0, > k ) < ∞
k=1 +
since EX0, < ∞ by the result at the beginning of Step 1. The last two observations
imply
X ≡ lim sup X0,n /n ≤ Am /m
(6.4.4)
n→∞ Taking expected values now gives E X ≤ E (X0,m /m), and taking the inﬁmum over
m, we have E X ≤ γ . Note that if all the stationary sequences in (ii) are ergodic, we
have X ≤ γ.
+
Remark. If (i)–(iii) hold, EX0,1 < ∞, and inf EX0,m /m = −∞, then it follows from
the last argument that as X0,n /n → −∞ a.s. as n → ∞. Step 3. The next step is to let
X = lim inf X0,n /n
n→∞ and show that EX ≥ γ . Since ∞ > EX0,1 ≥ γ ≥ γ0 > −∞, and we have shown in
Step 2 that E X ≤ γ , it will follow that X = X , i.e., the limit of X0,n /n exists a.s.
Let
X m = lim inf Xm,m+n /n
n→∞ (i) implies
X0,m+n ≤ X0,m + Xm,m+n
Dividing both sides by n and letting n → ∞ gives X ≤ X m a.s. However, (iii) implies
that X m and X have the same distribution so X = X m a.s.
Let > 0 and let Z = + (X ∨ −M ). Since X ≤ X and E X ≤ γ < ∞ by Step 2,
E Z  < ∞. Let
Ym,n = Xm,n − (n − m)Z
Y satisﬁes (i)–(iv), since Zm,n = −(n − m)Z does, and has
Y ≡ lim inf Y0,n /n ≤ −
n→∞ Let Tm = min{n ≥ 1 : Ym,m+n ≤ 0}. (iii) implies Tm =d T0 and
E (Ym,m+1 ; Tm > N ) = E (Y0,1 ; T0 > N )
(6.4.5) implies that P (T0 < ∞) = 1, so we can pick N large enough so that
E (Y0,1 ; T0 > N ) ≤
Let
Sm = Tm
1 on {Tm ≤ N }
on {Tm > N } (6.4.5) 290 CHAPTER 6. ERGODIC THEOREMS This is not a stopping time but there is nothing special about stopping times for a
stationary sequence! Let
ξm = 0
Ym,m+1 on {Tm ≤ N }
on {Tm > N } Since Y (m, m + Tm ) ≤ 0 always and we have Sm = 1, Ym,m+1 > 0 on {Tm > N },
we have Y (m, m + Sm ) ≤ ξm and ξm ≥ 0. Let R0 = 0, and for k ≥ 1, let Rk =
Rk−1 + S (Rk−1 ). Let K = max{k : Rk ≤ n}. From (i), it follows that
Y (0, n) ≤ Y (R0 , R1 ) + · · · + Y (RK −1 , RK ) + Y (RK , n)
Since ξm ≥ 0 and n − RK ≤ N , the last quantity is
n−1 ≤ N Yn−j,n−j +1  ξm +
m=0 j =1 Here we have used (i) on Y (RK , n). Dividing both sides by n, taking expected values,
and letting n → ∞ gives
lim sup EY0,n /n ≤ Eξ0 ≤ E (Y0,1 ; T0 > N ) ≤
n→∞ It follows from (a) and the deﬁnition of Y0,n that
γ = lim EX0,n /n ≤ 2 + E (X ∨ −M )
n→∞ Since > 0 and M are arbitrary, it follows that EX ≥ γ and Step 3 is complete. Step 4. It only remains to prove convergence in L1 . Let Γm = Am /m be the limit in
(6.4.4), recall E Γm = E (X0,m /m), and let Γ = inf Γm . Observing that z  = 2z + − z
(consider two cases z ≥ 0 and z < 0), we can write
E X0,n /n − Γ = 2E (X0,n /n − Γ)+ − E (X0,n /n − Γ) ≤ 2E (X0,n /n − Γ)+
since
E (X0,n /n) ≥ γ = inf E Γm ≥ E Γ
Using the trivial inequality (x + y )+ ≤ x+ + y + and noticing Γm ≥ Γ now gives
E (X0,n /n − Γ)+ ≤ E (X0,n /n − Γm )+ + E (Γm − Γ)
¯
Now E Γm → γ as m → ∞ and E Γ ≥ E X ≥ EX ≥ γ by steps 2 and 3, so E Γ = γ ,
and it follows that E (Γm − Γ) is small if m is large. To bound the other term, observe
that (i) implies
E (X0,n /n − Γm )+ ≤ E
+E X (0, m) + · · · + X ((k − 1)m, km)
− Γm
km +
X (km, n)
n + + +
The second term = E (X0, /n) → 0 as n → ∞. For the ﬁrst, we observe y + ≤ y , and
the ergodic theorem implies E X (0, m) + · · · + X ((k − 1)m, km)
− Γm → 0
k so the proof of Theorem 6.4.1 is complete. 6.5. APPLICATIONS* 6.5 291 Applications* In this section, we will give four applications of our subadditive ergodic theorem, 6.4.1.
These examples are independent of each other and can be read in any order. In the
last two, we encounter situations to which Liggett’s version applies but Kingman’s
version does not.
Example 6.5.1. Products of random matrices. Suppose A1 , A2 , . . . is a stationary sequence of k × k matrices with positive entries and let
αm,n (i, j ) = (Am+1 · · · An )(i, j ),
i.e., the entry in row i of column j of the product. It is clear that
α0,m (1, 1)αm,n (1, 1) ≤ α0,n (1, 1)
so if we let Xm,n = − log αm,n (1, 1) then X0,m + Xm,n ≥ X0,n . To check (iv), we
observe that
n n Am (1, 1) ≤ α0,n (1, 1) ≤ k n−1 sup Am (i, j )
i,j m=1 m=1 or taking logs
n n log Am (1, 1) ≥ X0,n ≥ −(n log k ) − − log sup Am (i, j )
i,j m=1 m=1
+
So if E log Am (1, 1) > −∞ then EX0,1 < ∞, and if <∞ E log sup Am (i, j )
i,j
−
then EX0,n ≤ γ0 n. If we observe that P log sup Am (i, j ) ≥x ≤ i,j P (log Am (i, j ) ≥ x)
i,j we see that it is enough to assume that
(∗) E  log Am (i, j ) < ∞ for all i, j When (∗) holds, applying Theorem 6.4.1 gives X0,n /n → X a.s. Using the strict
positivity of the entries, it is easy to improve that result to
1
log α0,n (i, j ) → −X
n a.s. for all i, j (6.5.1) a result ﬁrst proved by Furstenberg and Kesten (1960).
The key to the proof above was the fact that α0,n (1, 1) was supermultiplicative.
An alternative approach is to let
A(i, j ) = max{ xA A = max
i j 1 :x 1 = 1} 292 CHAPTER 6. ERGODIC THEOREMS where (xA)j = i xi A(i, j ) and x 1 = x1  + · · · + xk . From the second deﬁnition,
it is clear that AB ≤ A · B , so if we let
βm,n = Am+1 · · · An
and Ym,n = log βm,n , then Ym,n is subadditive. It is easy to use (6.5.1) to show that
1
log Am+1 · · · An → −X
n a.s. where X is the limit of X0,n /n. To see the advantage in having two proofs of the
same result, we observe that if A1 , A2 , . . . is an i.i.d. sequence, then X is constant,
and we can get upper and lower bounds by observing
sup (E log α0,m )/m = −X = inf (E log β0,m )/m
m≥1 m≥1 Remark. Oseled˘c (1968) proved a result which gives the asymptotic behavior of all
e
of the eigenvalues of A. As Raghunathan (1979) and Ruelle (1979) have observed, this
result can also be obtained from Theorem 6.4.1. See Krengel (1985) or the papers
cited for details. Furstenberg and Kesten (1960) and later Ishitani (1977) have proved
central limit theorems:
(log α0,n (1, 1) − µn)/n1/2 ⇒ σχ
where χ has the standard normal distribution. For more about products of random
matrices, see Cohen, Kesten, and Newman (1985).
Example 6.5.2. Increasing sequences in random permutations. Let π be
a permutation of {1, 2, . . . , n} and let (π ) be the length of the longest increasing
sequence in π . That is, the largest k for which there are integers i1 < i2 . . . < ik
so that π (i1 ) < π (i2 ) < . . . < π (ik ). Hammersley (1970) attacked this problem by
putting a rate one Poisson process in the plane, and for s < t ∈ [0, ∞), letting Ys,t
denote the length of the longest increasing path lying in the square Rs,t with vertices
(s, s), (s, t), (t, t), and (t, s). That is, the largest k for which there are points (xi , yi )
in the Poisson process with s < x1 < . . . < xk < t and s < y1 < . . . < yk < t. It is
clear that Y0,m + Ym,n ≤ Y0,n . Applying Theorem 6.4.1 to −Y0,n shows
Y0,n /n → γ ≡ sup EY0,m /m a.s. m≥1 For each k , Ynk,(n+1)k , n ≥ 0 is i.i.d., so the limit is constant. We will show that
γ < ∞ in Exercise 6.5.2.
To get from the result about the Poisson process back to the random permutation
problem, let τ (n) be the smallest value of t for which there are n points in R0,t . Let
the n points in R0,τ (n) be written as (xi , yi ) where 0 < x1 < x2 . . . < xn ≤ τ (n) and
let πn be the unique permutation of {1, 2, . . . , n} so that yπn (1) < yπn (2) . . . < yπn (n) .
It is clear that Y0,τ (n) = (πn ). An easy argument shows:
√
Lemma 6.5.1. τ (n)/ n → 1 a.s.
Proof. Let Sn be the number of points in R0,√n . Sn − Sn−1 are independent Poisson
r.v.’s with mean 1, so the strong law of large numbers implies Sn /n → 1 a.s. If > 0
then for large n, Sn(1− ) < n < Sn(1+ ) and hence (1 − )n ≤ τ (n) ≤ (1 + )n. 6.5. APPLICATIONS* 293 It follows from Lemma 6.5.1 and the monotonicity of m → Y0,m that
n−1/2 (πn ) → γ a.s. Hammersley (1970) has a proof that π/2 ≤ γ ≤ e, and Kingman (1973) shows
that 1.59 < γ < 2.49. See Exercises 6.5.1 and 6.5.2. Subsequent work on the random
permutation problem, see Logan and Shepp (1977) and Vershik and Kerov (1977),
has shown that γ = 2.
Exercise 6.5.1. Given a rate one Poisson process in [0, ∞) × [0, ∞), let (X1 , Y1 ) be
the point that minimizes x + y . Let (X2 , Y2 ) be the point in [X1 , ∞) × [Y1 , ∞) that
minimizes x + y , and so on. Use this construction to show that γ ≥ (8/π )1/2 > 1.59.
n
Exercise 6.5.2. Let πn be a random permutation of {1, . . . , n} and let Jk be the
number of subsets of {1, . . . n} of size k so that the associated πn (j ) form an increasing
n
subsequence. Compute EJk and take k ∼ αn1/2 to conclude γ ≤ e.
n
Remark. Kingman improved this by observing that (πn ) ≥ then Jk ≥ k . Using
1 /2
1 /2
n
and k ∼ αn , he showed γ < 2.49.
this with the bound on EJk and taking ∼ βn Example 6.5.3. Agedependent branching processes. This is a variation of the
branching process introduced in Section 4.3.4 in which each individual lives for an
amount of time with distribution F before producing k oﬀspring with probability pk .
The description of the process is completed by supposing that the process starts with
one individual in generation 0 who is born at time 0, and when this particle dies, its
oﬀspring start independent copies of the original process.
Suppose p0 = 0, let X0,m be the birth time of the ﬁrst member of generation m,
and let Xm,n be the time lag necessary for that individual to have an oﬀspring in
generation n. In case of ties, pick an individual at random from those in generation
m born at time X0,m . It is clear that X0,n ≤ X0,m + Xm,n . Since X0,n ≥ 0, (iv) holds
if we assume F has ﬁnite mean. Applying Theorem 6.4.1 now, it follows that
X0,n /n → γ a.s. The limit is constant because the sequences {Xnk,(n+1)k , n ≥ 0} are i.i.d.
Remark. The inequality X ,m + Xm,n ≥ X ,n is false when > 0, because if we call
im the individual that determines the value of Xm,n for n > m, then im may not be
a descendant of i .
As usual, one has to use other methods to identify the constant. Let t1 , t2 , . . . be
i.i.d. with distribution F , let Tn = t1 + · · · + tn , and µ =
k pk . Let Zn (an) be the
number of individuals in generation n born by time an. Each individual in generation
n has probability P (Tn ≤ an) to be born by time an, and the times are independent
of the oﬀspring numbers so
EZn (an) = EE (Zn (an)Zn ) = E (Zn P (Tn ≤ an)) = µn P (Tn ≤ an)
By results in Section 1.9, n−1 log P (Tn ≤ an) → −c(a) as n → ∞. If log µ − c(a) <
0 then Chebyshev’s inequality and the BorelCantelli lemma imply P (Zn (an) ≥
1 i.o.) = 0. Conversely, if EZn (an) > 1 for some n, then we can deﬁne a supercritical branching process Ym that consists of the oﬀspring in generation mn that
are descendants of individuals in Ym−1 in generation (m − 1)n that are born less
than an units of time after their parents. This shows that with positive probability, 294 CHAPTER 6. ERGODIC THEOREMS X0,mn ≤ mna for all m. Combining the last two observations with the fact that c(a)
is strictly increasing gives
γ = inf {a : log µ − c(a) > 0}
The last result is from Biggins (1977). See his (1978) and (1979) papers for
extensions and reﬁnements. Kingman (1975) has an approach to the problem via
martingales:
Exercise 6.5.3. Let ϕ(θ) = E exp(−θti ) and
Zn Yn = (µϕ(θ))−n exp(−θTn (i))
i=1 where the sum is over individuals in generation n and Tn (i) is the ith person’s birth
time. Show that Yn is a nonnegative martingale and use this to conclude that if
exp(−θa)/µϕ(θ) > 1, then P (X0,n ≤ an) → 0. A little thought reveals that this
bound is the same as the answer in the last exercise.
Example 6.5.4. First passage percolation. Consider Zd as a graph with edges
connecting each x, y ∈ Zd with x−y  = 1. Assign an independent nonnegative random
variable τ (e) to each edge that represents the time required to traverse the edge going
in either direction. If e is the edge connecting x and y , let τ (x, y ) = τ (y, x) = τ (e). If
x0 = x, x1 , . . . , xn = y is a path from x to y , i.e., a sequence with xm − xm−1  = 1 for
1 ≤ m ≤ n, we deﬁne the travel time for the path to be τ (x0 , x1 ) + · · · + τ (xn−1 , xn ).
Deﬁne the passage time from x to y , t(x, y ) = the inﬁmum of the travel times over
all paths from x to y . Let z ∈ Zd and let Xm,n = t(mu, nu), where u = (1, 0, . . . , 0).
Clearly X0,m + Xm,n ≥ X0,n . X0,n ≥ 0 so if Eτ (x, y ) < ∞ then (iv) holds,
and Theorem 6.4.1 implies that X0,n /n → X a.s. To see that the limit is constant,
enumerate the edges in some order e1 , e2 , . . . and observe that X is measurable with
respect to the tail σ ﬁeld of the i.i.d. sequence τ (e1 ), τ (e2 ), . . .
Remark. It is not hard to see that the assumption of ﬁnite ﬁrst moment can be
weakened. If τ has distribution F with
∞ (1 − F (x))2d dx < ∞ (∗)
0 i.e., the minimum of 2d independent copies has ﬁnite mean, then by ﬁnding 2d disjoint
paths from 0 to u = (1, 0, . . . , 0), one concludes that Eτ (0, u) < ∞ and (6.1) can be
applied. The condition (∗) is also necessary for X0,n /n to converge to a ﬁnite limit.
If (∗) fails and Yn is the minimum of t(e) over all the edges from ν , then
lim sup X0,n /n ≥ lim sup Yn /n = ∞
n→∞ a.s. n→∞ Above we considered the pointtopoint passage time. A second object of
interest is the pointtoline passage time:
an = inf {t(0, x) : x1 = n}
Unfortunately, it does not seem to be possible to embed this sequence in a subadditive
¯
family. To see the diﬃculty, let t(0, x) be inﬁmum of travel times over paths from 0
to x that lie in {y : y1 ≥ 0}, let
¯
am = inf {t(0, x) : x1 = m}
¯ 6.5. APPLICATIONS* 295 and let xm be a point at which the inﬁmum is achieved. We leave to the reader the
highly nontrivial task of proving that such a point exists; see Smythe and Wierman
(1978) for a proof. If we let am,n be the inﬁmum of travel times over all paths that
¯
start at xm , stay in {y : y1 ≥ m}, and end on {y : y1 = n}, then am,n is independent
¯
of am and
¯
¯
¯
am + am,n ≥ an
¯
The last inequality is true without the halfspace restriction, but the independence is
not and without the halfspace restriction, we cannot get the stationarity properties
needed to apply Theorem 6.4.1.
Remark. The family am,n is another example where a
¯
¯
hold for > 0. ,m + am,n ≥ a
¯
¯ ,n need not A second approach to limit theorems for am is to prove a result about the set of
points that can be reached by time t: ξt = {x : t(0, x) ≤ t}. Cox and Durrett (1981)
have shown
Theorem 6.5.2. For any passage time distribution F with F (0) = 0, there is a
convex set A so that for any > 0 we have with probability one
ξt ⊂ (1 + )tA for all t suﬃciently large
and ξt ∩ (1 − )tA ∩ Zd /td → 0 as t → ∞.
Ignoring the boring details of how to state things precisely, the last result says ξt /t →
A a.s. It implies that an /n → γ a.s., where γ = 1/ sup{x1 : x ∈ A}. (Use the
convexity and reﬂection symmetry of A.) When the distribution has ﬁnite mean
(or satisﬁes the weaker condition in the remark above), γ is the limit of t(0, nu)/n.
Without any assumptions, t(0, nu)/n → γ in probability. For more details, see the
paper cited above. Kesten (1986) and (1987) are good sources for more about ﬁrstpassage percolation.
Exercise 6.5.4. Oriented ﬁrstpassage percolation. Consider a graph with vertices {(m, n) ∈ Z2 : m + n is even and n ≤ 0}, and oriented edges connecting (m, n)
to (m − 1, n − 1) and (m, n) to (m − 1, n − 1). Assign i.i.d. exponential mean one r.v.’s
to each edge. Thinking of the number on edge e as giving the time it takes water to
travel down the edge, deﬁne t(m, n) = the time at which the ﬂuid ﬁrst reaches (m, n),
and an = inf {t(m, −n)}. Show that as n → ∞, an /n converges to a limit γ a.s.
Exercise 6.5.5. Continuing with the set up in the last exercise: (i) Show γ ≤ 1/2
by considering a1 . (ii) Get a positive lower bound on γ by looking at the expected
number of paths down to {(m, −n) : −n ≤ m ≤ n} with passage time ≤ an and using
results from Section 1.9.
Remark. If we replace the graph in Exercise 6.5.4 by a binary tree, then we get a
problem equivalent to the ﬁrst birth problem (Example 6.5.3) for p2 = 2, P (ti > x) =
e−x . In that case, the lower bound obtained by the methods of part (ii) Exercise 6.5.5
was sharp, but in this case it is not. 296 CHAPTER 6. ERGODIC THEOREMS Chapter 7 Brownian Motion
Brownian motion is a process of tremendous practical and theoretical signiﬁcance.
It originated (a) as a model of the phenomenon observed by Robert Brown in 1828
that “pollen grains suspended in water perform a continual swarming motion,” and
(b) in Bachelier’s (1900) work as a model of the stock market. These are just two
of many systems that Brownian motion has been used to model. On the theoretical
side, Brownian motion is a Gaussian Markov process with stationary independent
increments. It lies in the intersection of three important classes of processes and is a
fundamental example in each theory.
The ﬁrst part of this chapter develops properties of Brownian motion. In Section
7.1, we deﬁne Brownian motion and investigate continuity properties of its paths. In
Section 7.2, we prove the Markov property and a related 01 law. In Section 7.3, we
deﬁne stopping times and prove the strong Markov property. In Section 7.4, we take
a close look at the zero set of Brownian motion. In Section 7.5, we introduce some
martingales associated with Brownian motion and use them to obtain information
about its properties.
The second part of this chapter applies Brownian motion to some of the problems considered in Chapters 1 and 2. In Section 7.6, we embed random walks into
Brownian motion to prove Donsker’s theorem, a farreaching generalization of the
central limit theorem. In Section 7.7, we extend Donsker’s theorem to martingales
satisfying “LindebergFeller conditions.” In Section 7.8, we show that the discrepancy
between the empirical distribution and the true distribution when suitably magniﬁed
converges to Brownian bridge. In Section 7.9, we prove laws of the iterated logarithm
for Brownian motion and random walks with ﬁnite variance. The last three sections
depend on Section 7.6 but are independent of each other and can be read in any order. 7.1 Deﬁnition and Construction A onedimensional Brownian motion is a realvalued process Bt , t ≥ 0 that has the
following properties:
(a) If t0 < t1 < . . . < tn then B (t0 ), B (t1 ) − B (t0 ), . . . , B (tn ) − B (tn−1 ) are independent.
(b) If s, t ≥ 0 then
(2πt)−1/2 exp(−x2 /2t) dx P (B (s + t) − B (s) ∈ A) =
A 297 298 CHAPTER 7. BROWNIAN MOTION (c) With probability one, t → Bt is continuous.
(a) says that Bt has independent increments. (b) says that the increment B (s + t) −
B (s) has a normal distribution with mean 0 and variance t. (c) is selfexplanatory.
Thinking of Brown’s pollen grain (c) is certainly reasonable. (a) and (b) can be
justiﬁed by noting that the movement of the pollen grain is due to the net eﬀect of
the bombardment of millions of water molecules, so by the central limit theorem, the
displacement in any one interval should have a normal distribution, and the displacements in two disjoint intervals should be independent.
Two immediate consequences of the deﬁnition that will be useful many times are:
Translation invariance. {Bt − B0 , t ≥ 0} is independent of B0 and has the same
distribution as a Brownian motion with B0 = 0.
Proof. Let A1 = σ (B0 ) and A2 be the events of the form
{B (t1 ) − B (t0 ) ∈ A1 , . . . , B (tn ) − B (tn−1 ) ∈ An }.
The Ai are π systems that are independent, so the desired result follows from the
π − λ theorem 1.4.2.
The Brownian scaling relation. If B0 = 0 then for any t > 0,
d {Bst , s ≥ 0} = {t1/2 Bs , s ≥ 0} (7.1.1) To be precise, the two families of r.v.’s have the same ﬁnite dimensional distributions,
i.e., if s1 < . . . < sn then
d (Bs1 t , . . . , Bsn t ) = (t1/2 Bs1 , . . . t1/2 Bsn )
Proof. To check this when n = 1, we note that t1/2 times a normal with mean 0 and
variance s is a normal with mean 0 and variance st. The result for n > 1 follows from
independent increments.
A second equivalent deﬁnition of Brownian motion starting from B0 = 0, that we
will occasionally ﬁnd useful is that Bt , t ≥ 0, is a realvalued process satisfying
(a ) B (t) is a Gaussian process (i.e., all its ﬁnite dimensional distributions are
multivariate normal).
(b ) EBs = 0 and EBs Bt = s ∧ t.
(c ) With probability one, t → Bt is continuous.
It is easy to see that (a) and (b) imply (a ). To get (b ) from (a) and (b), suppose
s < t and write
2
EBs Bt = E (Bs ) + E (Bs (Bt − Bs )) = s
The converse is even easier. (a ) and (b ) specify the ﬁnite dimensional distributions
of Bt , which by the last calculation must agree with the ones deﬁned in (a) and (b).
The ﬁrst question that must be addressed in any treatment of Brownian motion
is, “Is there a process with these properties?” The answer is “Yes,” of course, or this
chapter would not exist. For pedagogical reasons, we will pursue an approach that
leads to a dead end and then retreat a little to rectify the diﬃculty. Fix an x ∈ R
and for each 0 < t1 < . . . < tn , deﬁne a measure on Rn by
n µx,t1 ,...,tn (A1 × . . . × An ) = dx1 · · ·
A1 dxn
An ptm −tm−1 (xm−1 , xm )
m=1 (7.1.2) 7.1. DEFINITION AND CONSTRUCTION 299 where Ai ∈ R, x0 = x, t0 = 0, and
pt (a, b) = (2πt)−1/2 exp(−(b − a)2 /2t)
From the formula above, it is easy to see that for ﬁxed x the family µ is a consistent
set of ﬁnite dimensional distributions (f.d.d.’s), that is, if {s1 , . . . , sn−1 } ⊂ {t1 , . . . , tn }
and tj ∈ {s1 , . . . , sn−1 } then
/
µx,s1 ,...,sn−1 (A1 × · · · × An−1 ) = µx,t1 ,...,tn (A1 × · · · × Aj −1 × R × Aj × · · · × An−1 )
This is clear when j = n. To check the equality when 1 ≤ j < n, it is enough to show
that
ptj −tj−1 (x, y )ptj+1 −tj (y, z ) dy = ptj+1 −tj−1 (x, z )
By translation invariance, we can without loss of generality assume x = 0, but all this
says is that the sum of independent normals with mean 0 and variances tj − tj −1 and
tj +1 − tj has a normal distribution with mean 0 and variance tj +1 − tj −1 .
With the consistency of f.d.d.’s veriﬁed, we get our ﬁrst construction of Brownian
motion:
Theorem 7.1.1. Let Ωo = {functions ω : [0, ∞) → R} and Fo be the σ ﬁeld generated
by the ﬁnite dimensional sets {ω : ω (ti ) ∈ Ai for 1 ≤ i ≤ n}, where Ai ∈ R. For each
x ∈ R, there is a unique probability measure νx on (Ωo , Fo ) so that νx {ω : ω (0) =
x} = 1 and when 0 < t1 < . . . < tn
νx {ω : ω (ti ) ∈ Ai } = µx,t1 ,...,tn (A1 × · · · × An ) (7.1.3) This follows from a generalization of Kolmogorov’s extension theorem, (7.1) in the
Appendix. We will not bother with the details since at this point we are at the dead
end referred to above. If C = {ω : t → ω (t) is continuous} then C ∈ Fo , that is, C is
/
not a measurable set. The easiest way of proving C ∈ Fo is to do:
/
Exercise 7.1.1. A ∈ Fo if and only if there is a sequence of times t1 , t2 , . . . in [0, ∞)
and a B ∈ R{1,2,...} so that A = {ω : (ω (t1 ), ω (t2 ), . . .) ∈ B }. In words, all events in
Fo depend on only countably many coordinates.
The above problem is easy to solve. Let Q2 = {m2−n : m, n ≥ 0} be the dyadic
rationals. If Ωq = {ω : Q2 → R} and Fq is the σ ﬁeld generated by the ﬁnite
dimensional sets, then enumerating the rationals q1 , q2 , . . . and applying Kolmogorov’s
extension theorem shows that we can construct a probability νx on (Ωq , Fq ) so that
νx {ω : ω (0) = x} = 1 and (7.1.3) holds when the ti ∈ Q2 . To extend Bt to a process
deﬁned on [0, ∞), we will show:
Theorem 7.1.2. Let T < ∞ and x ∈ R. νx assigns probability one to paths ω :
Q2 → R that are uniformly continuous on Q2 ∩ [0, T ].
Remark. It will take quite a bit of work to prove Theorem 7.1.2. Before taking on
that task, we will attend to the last measure theoretic detail: We tidy things up by
moving our probability measures to (C, C ), where C = {continuous ω : [0, ∞) → R}
and C is the σ ﬁeld generated by the coordinate maps t → ω (t). To do this, we
observe that the map ψ that takes a uniformly continuous point in Ωq to its unique
continuous extension in C is measurable, and we set
Px = νx ◦ ψ −1 300 CHAPTER 7. BROWNIAN MOTION Our construction guarantees that Bt (ω ) = ωt has the right ﬁnite dimensional distributions for t ∈ Q2 . Continuity of paths and a simple limiting argument shows that
this is true when t ∈ [0, ∞). Finally, the reader should note that, as in the case of
Markov chains, we have one set of random variables Bt (ω ) = ω (t), and a family of
probability measures Px , x ∈ R, so that under Px , Bt is a Brownian motion with
Px (B0 = x) = 1.
Proof. By translation invariance and scaling (7.1.1), we can without loss of generality
suppose B0 = 0 and prove the result for T = 1. In this case, part (b) of the deﬁnition
and the scaling relation imply
E0 (Bt − Bs )4 = E0 Bt−s 4 = C (t − s)2
where C = E0 B1 4 < ∞. From the last observation, we get the desired uniform
continuity by using the following result due to Kolmogorov.
Theorem 7.1.3. Suppose E Xs − Xt β ≤ K t − s1+α where α, β > 0. If γ < α/β
then with probability one there is a constant C (ω ) so that
X (q ) − X (r) ≤ C q − rγ for all q, r ∈ Q2 ∩ [0, 1] Proof Let γ < α/β , η > 0, In = {(i, j ) : 0 ≤ i ≤ j ≤ 2n , 0 < j − i ≤ 2nη } and
Gn = {X (j 2−n ) − X (i2−n ) ≤ ((j − i)2−n )γ for all (i, j ) ∈ In }. Since aβ P (Y  >
a) ≤ E Y β , we have
((j − i)2−n )−βγ E X (j 2−n ) − X (i2−n )β P (Gc ) ≤
n
(i,j )∈In Using our assumption and then noticing the number of (i, j ) ∈ In is ≤ 2n 2nη , we see
that the righthand side is (for the second step note α − βγ > 0)
((j − i)2−n )−βγ +1+α ≤ K 2n 2nη (2nη 2−n )−βγ +1+α = K 2−nλ ≤K
(i,j )∈In where λ = (1 − η )(1 + α − βγ ) − (1 + η ). Since γ < α/β , we can pick η small enough
so that λ > 0. To complete the proof now, we will show:
Lemma 7.1.4. Let A = 3 · 2(1−η)γ /(1 − 2−γ ). On HN = ∩∞ N Gn we have X (q ) −
n=
X (r) ≤ Aq − rγ for q, r ∈ Q2 ∩ [0, 1] with q − r < 2−(1−η)N .
To show that Lemma 7.1.4 implies Theorem 7.1.3 we note that the trivial inequality
∞
c
P (HN ) ≤ ∞ 2−nλ = K 2−N λ /(1 − 2−λ ) P (Gc ) ≤ K
n
n=N n=N implies that X (q ) − X (r) ≤ Aq − rγ for q, r ∈ Q2 with q − r < δ (ω ). To extend this
to q, r ∈ Q2 ∩ [0, 1], let s0 = q < s1 < . . . < sn = r with si − si−1  < δ (ω ) and use the
triangle inequality to conclude X (q ) − X (r) ≤ C (ω )q − rγ where C (ω ) = 1+ δ (ω )−1 .
Proof of Lemma 7.1.4. Let q, r ∈ Q2 ∩ [0, 1] with 0 < r − q < 2−(1−η)N . Pick m ≥ N
so that
2−(m+1)(1−η) ≤ r − q < 2−m(1−η) 7.1. DEFINITION AND CONSTRUCTION 301 and write
r = j 2−m + 2−r(1) + · · · + 2−r( ) q = i2−m − 2−q(1) − · · · − 2−q(k)
where m < r(1) < · · · < r( ) and m < q (1) < · · · < q (k ). Now 0 < r − q < 2−m(1−η) ,
so (j − i) < 2mη , and it follows that on HN
X (i2−m ) − X (j 2−m ) ≤ ((2mη )2−m )γ (a) On HN , it follows from the triangle inequality that
∞ k (b) X (q ) − X (i2−m ) ≤ (2−q(h) )γ ≤
h=1 (2−γ )h Cγ 2−γm
h=m where Cγ = 1/(1 − 2−γ ) > 1. Repeating the last computation shows
X (r) − X (j 2−m ) ≤ Cγ 2−γm (c)
Combining (a)–(c) gives X (q ) − X (r) ≤ 3Cγ 2−γm(1−η) ≤ 3Cγ 2(1−η)γ r − q γ
since 2−m(1−η) ≤ 21−η r − q . This completes the proof of Lemma 7.1.4 and hence of
Theorems 7.1.3 and 7.1.2.
The scaling relation, (7.1.1), implies
E Bt − Bs 2m = Cm t − sm where Cm = E B1 2m so using Theorem 7.1.3 with β = 2m, α = m − 1 and letting m → ∞ gives a result of
Wiener (1923).
Theorem 7.1.5. Brownian paths are H¨lder continuous with exponent γ for any
o
γ < 1/2.
It is easy to show:
Theorem 7.1.6. With probability one, Brownian paths are not Lipschitz continuous
(and hence not diﬀerentiable) at any point.
Remark. The nondiﬀerentiability of Brownian paths was discovered by Paley, Wiener,
and Zygmund (1933). Paley died in 1933 at the age of 26 in a skiing accident while
the paper was in press. The proof we are about to give is due to Dvoretsky, Erd¨s,
o
and Kakutani (1961).
Proof. Fix a constant C < ∞ and let An = {ω : there is an s ∈ [0, 1] so that
Bt − Bs  ≤ C t − s when t − s ≤ 3/n}. For 1 ≤ k ≤ n − 2, let
Yk,n = max B k+j
n −B k+j−1
n Bn = { at least one Yk,n ≤ 5C/n} : j = 0, 1, 2 302 CHAPTER 7. BROWNIAN MOTION The triangle inequality implies An ⊂ Bn . The worst case is s = 1. We pick k = n − 2
and observe
B n−3
n −B n−2
n ≤B n−3
n − B (1) + B (1) − B n−2
n ≤ C (3/n + 2/n)
Using An ⊂ Bn and the scaling relation (7.1.1 in gives
P (An ) ≤ P (Bn ) ≤ nP (B (1/n) ≤ 5C/n)3 = nP (B (1) ≤ 5C/n1/2 )3
≤ n{(10C/n1/2 ) · (2π )−1/2 }3
since exp(−x2 /2) ≤ 1. Letting n → ∞ shows P (An ) → 0. Noticing n → An is
increasing shows P (An ) = 0 for all n and completes the proof.
Exercise 7.1.2. Looking at the proof of Theorem 7.1.6 carefully shows that if γ > 5/6
then Bt is not H¨lder continuous with exponent γ at any point in [0,1]. Show, by
o
considering k increments instead of 3, that the last conclusion is true for all γ >
1/2 + 1/k.
The next result is more evidence that the sample paths of Brownian motion behave
√
locally like t.
Exercise 7.1.3. Fix t and let ∆m,n = B (tm2−n ) − B (t(m − 1)2−n ). Compute
2 ∆2 − t
m,n E
m≤2n and use BorelCantelli to conclude that m≤2n ∆2 → t a.s. as n → ∞.
m,n Remark. The last result is true if we consider a sequence of partitions Π1 ⊂ Π2 ⊂
. . . with mesh → 0. See Freedman (1971a) p. 42–46. However, the true quadratic
variation, deﬁned as the sup over all partitions, is ∞.
Multidimensional Brownian motion
All of the result in this section have been for onedimensional Brownian motion.
1
d
To deﬁne a ddimensional Brownian motion starting at x ∈ Rd we let Bt , . . . Bt be
i
independent Brownian motions with B0 = xi . As in the case d = 1 these are realized as
probability measures Px on (C, C ) where C = {continuous ω : [0, ∞) → Rd } and C is
the σ ﬁeld generated by the coordinate maps. Since the coordinates are independent,
it is easy to see that the ﬁnite dimensional distributions satisfy (7.1.2) with transition
probability
(7.1.4)
pt (x, y ) = (2πt)−d/2 exp(−y − x2 /2t) 7.2. MARKOV PROPERTY, BLUMENTHAL’S 01 LAW 7.2 303 Markov Property, Blumenthal’s 01 Law Intuitively, the Markov property says “if s ≥ 0 then B (t + s) − B (s), t ≥ 0 is a
Brownian motion that is independent of what happened before time s.” The ﬁrst
step in making this into a precise statement is to explain what we mean by “what
happened before time s.” The ﬁrst thing that comes to mind is
o
Fs = σ (Br : r ≤ s)
o
For reasons that will become clear as we go along, it is convenient to replace Fs by
+
o
Fs = ∩t>s Ft
+
The ﬁelds Fs are nicer because they are right continuous:
+
o
o
+
∩t>s Ft = ∩t>s (∩u>t Fu ) = ∩u>s Fu = Fs
+
+
In words, the Fs allow us an “inﬁnitesimal peek at the future,” i.e., A ∈ Fs if it is
o
in Fs+ for any > 0. If f (u) > 0 for all u > 0, then in d = 1 the random variable lim sup
t↓s Bt − Bs
f (t − s) +
o
is measurable with respect to Fs but not Fs . We will see below that there are no
+
o
interesting examples, i.e., Fs and Fs are the same (up to null sets).
To state the Markov property, we need some notation. Recall that we have a
family of measures Px , x ∈ Rd , on (C, C ) so that under Px , Bt (ω ) = ω (t) is a
Brownian motion starting at x. For s ≥ 0, we deﬁne the shift transformation
θs : C → C by
(θs ω )(t) = ω (s + t) for t ≥ 0 In words, we cut oﬀ the part of the path before time s and then shift the path so that
time s becomes time 0.
Theorem 7.2.1. Markov property. If s ≥ 0 and Y is bounded and C measurable,
then for all x ∈ Rd
+
Ex (Y ◦ θs Fs ) = EBs Y
where the righthand side is the function ϕ(x) = Ex Y evaluated at x = Bs .
Proof. By the deﬁnition of conditional expectation, what we need to show is that
Ex (Y ◦ θs ; A) = Ex (EBs Y ; A) +
for all A ∈ Fs (7.2.1) We will begin by proving the result for a carefully chosen special case and then
use the monotone class theorem (MCT) to get the general case. Suppose Y (ω ) =
1≤m≤n fm (ω (tm )), where 0 < t1 < . . . < tn and the fm are bounded and measurable.
Let 0 < h < t1 , let 0 < s1 . . . < sk ≤ s + h, and let A = {ω : ω (sj ) ∈ Aj , 1 ≤ j ≤ k },
where Aj ∈ R for 1 ≤ j ≤ k . From the deﬁnition of Brownian motion, it follows that
Ex (Y ◦ θs ; A) = dx2 ps2 −s1 (x1 , x2 ) · · · dx1 ps1 (x, x1 )
A1 A2 dxk psk −sk−1 (xk−1 , xk )
Ak dy ps+h−sk (xk , y )ϕ(y, h) 304 CHAPTER 7. BROWNIAN MOTION where
ϕ(y, h) = dy1 pt1 −h (y, y1 )f1 (y1 ) . . . dyn ptn −tn−1 (yn−1 , yn )fn (yn ) For more details, see the proof of (5.1.3), which applies without change here. Using
that identity on the righthand side, we have
Ex (Y ◦ θs ; A) = Ex (ϕ(Bs+h , h); A) (7.2.2) The last equality holds for all ﬁnite dimensional sets A so the π − λ theorem, Theorem
o
+
1.4.2, implies that it is valid for all A ∈ Fs+h ⊃ Fs .
It is easy to see by induction on n that
ψ (y1 ) =f1 (y1 )
... dy2 pt2 −t1 (y1 , y2 )f2 (y2 )
dyn ptn −tn−1 (yn−1 , yn )fn (yn ) is bounded and measurable. Letting h ↓ 0 and using the dominated convergence
theorem shows that if xh → x, then
φ(xh , h) = dy1 pt1 −h (xh , y1 )ψ (y1 ) → φ(x, 0) as h ↓ 0. Using (7.2.2) and the bounded convergence theorem now gives
Ex (Y ◦ θs ; A) = Ex (ϕ(Bs , 0); A)
+
for all A ∈ Fs . This shows that (∗) holds for Y = 1≤m≤n fm (ω (tm )) and the fm
are bounded and measurable.
The desired conclusion now follows from the monotone class theorem, 5.1.3. Let
H = the collection of bounded functions for which (∗) holds. H clearly has properties
(ii) and (iii). Let A be the collection of sets of the form {ω : ω (tj ) ∈ Aj }, where
Aj ∈ R. The special case treated above shows (i) holds and the desired conclusion
follows. The next two exercises give typical applications of the Markov property. In Section
7.4, we will use these equalities to compute the distributions of L and R.
Exercise 7.2.1. Let T0 = inf {s > 0 : Bs = 0} and let R = inf {t > 1 : Bt = 0}. R is
for right or return. Use the Markov property at time 1 to get
Px (R > 1 + t) = p1 (x, y )Py (T0 > t) dy (7.2.3) Exercise 7.2.2. Let T0 = inf {s > 0 : Bs = 0} and let L = sup{t ≤ 1 : Bt = 0}. L is
for left or last. Use the Markov property at time 0 < t < 1 to conclude
P0 (L ≤ t) = pt (0, y )Py (T0 > 1 − t) dy (7.2.4) The reader will see many applications of the Markov property below, so we turn
our attention now to a “triviality” that has surprising consequences. Since
+
o
Ex (Y ◦ θs Fs ) = EB (s) Y ∈ Fs it follows from (1.1) in Chapter 5 that
+
o
Ex (Y ◦ θs Fs ) = Ex (Y ◦ θs Fs ) From the last equation, it is a short step to: 7.2. MARKOV PROPERTY, BLUMENTHAL’S 01 LAW 305 Theorem 7.2.2. If Z ∈ C is bounded then for all s ≥ 0 and x ∈ Rd ,
+
o
Ex (Z Fs ) = Ex (Z Fs ) Proof. As in the proof of Theorem 7.2.1, it suﬃces to prove the result when
n Z= fm (B (tm ))
m=1 and the fm are bounded and measurable. In this case, Z can be written as X (Y ◦ θs ),
o
where X ∈ Fs and Y is C measurable, so
+
+
o
Ex (Z Fs ) = XEx (Y ◦ θs Fs ) = XEBs Y ∈ Fs and the proof is complete.
+
o
o
If we let Z ∈ Fs , then Theorem 7.2.2 implies Z = Ex (Z Fs ) ∈ Fs , so the two
σ ﬁelds are the same up to null sets. At ﬁrst glance, this conclusion is not exciting.
The fun starts when we take s = 0 in Theorem 7.2.2 to get:
+
Theorem 7.2.3. Blumenthal’s 01 law. If A ∈ F0 then for all x ∈ Rd , Px (A) ∈ {0, 1}.
+
o
Proof. Using A ∈ F0 , Theorem 7.2.2, and F0 = σ (B0 ) is trivial under Px gives
+
o
1A = Ex (1A F0 ) = Ex (1A F0 ) = Px (A) Px a.s. This shows that the indicator function 1A is a.s. equal to the number Px (A), and the
result follows.
+
In words, the last result says that the germ ﬁeld, F0 , is trivial. This result
is very useful in studying the local behavior of Brownian paths. For the rest of the
section we restrict our attention to d = 1. Theorem 7.2.4. If τ = inf {t ≥ 0 : Bt > 0} then P0 (τ = 0) = 1.
Proof. P0 (τ ≤ t) ≥ P0 (Bt > 0) = 1/2 since the normal distribution is symmetric
about 0. Letting t ↓ 0, we conclude
P0 (τ = 0) = lim P0 (τ ≤ t) ≥ 1/2
t↓0 so it follows from Theorem 7.2.3 that P0 (τ = 0) = 1.
Once Brownian motion must hit (0, ∞) immediately starting from 0, it must also
hit (−∞, 0) immediately. Since t → Bt is continuous, this forces:
Theorem 7.2.5. If T0 = inf {t > 0 : Bt = 0} then P0 (T0 = 0) = 1.
A corollary of Theorem 7.2.5 is:
Exercise 7.2.3. If a < b, then with probability one there is a local maximum of Bt
in (a, b). So the set of local maxima of Bt is almost surely a dense set.
Another typical application of Theorem 7.2.3 is: 306 CHAPTER 7. BROWNIAN MOTION Exercise 7.2.4. (i) Suppose f (t) > 0 for all t > 0. Use Theorem 7.2.3 to conclude
that lim supt↓0 B (t)/f (t) = c, P0 a.s., where c ∈ [0, ∞] is a constant. (ii) Show that
√
if f (t) = t then c = ∞, so with probability one Brownian paths are not H¨lder
o
continuous of order 1/2 at 0.
Remark. Let Hγ (ω ) be the set of times at which the path ω ∈ C is H¨lder continuous
o
of order γ . (1.7) shows that P (Hγ = [0, ∞)) = 1 for γ < 1/2. Exercise 1.2 shows that
P (Hγ = ∅) = 1 for γ > 1/2. The last exercise shows P (t ∈ H1/2 ) = 0 for each t, but
B. Davis (1983) has shown P (H1/2 = ∅) = 1.
Theorem 7.2.3 concerns the behavior of Bt as t → 0. By using a trick, we can use
this result to get information about the behavior as t → ∞.
Theorem 7.2.6. If Bt is a Brownian motion starting at 0, then so is the process
deﬁned by X0 = 0 and Xt = tB (1/t) for t > 0.
Proof. Here we will check the second deﬁnition of Brownian motion. To do this, we
note: (i) If 0 < t1 < . . . < tn , then (X (t1 ), . . . , X (tn )) has a multivariate normal
distribution with mean 0. (ii) EXs = 0 and if s < t then
E (Xs Xt ) = stE (B (1/s)B (1/t)) = s
For (iii) we note that X is clearly continuous at t = 0.
To handle t = 0, we begin by observing that the strong law of large numbers
implies Bn /n → 0 as n → ∞ through the integers. To handle values in between
integers, we note that Kolmogorov’s inequality, (8.2) in Chapter 1, implies
P sup B (n + k 2−m ) − Bn  > n2/3 ≤ n−4/3 E (Bn+1 − Bn )2 0<k≤2m Letting m → ∞, we have
P sup Bu − Bn  > n2/3 ≤ n−4/3 u∈[n,n+1] Since n n−4/3 < ∞, the BorelCantelli lemma implies Bu /u → 0 as u → ∞. Taking
u = 1/t, we have Xt → 0 as t → 0.
Theorem 7.2.6 allows us to relate the behavior of Bt as t → ∞ and as t → 0.
Combining this idea with Blumenthal’s 01 law leads to a very useful result. Let
Ft = σ (Bs : s ≥ t) = the future at time t
T = ∩t≥0 Ft = the tail σ ﬁeld.
Theorem 7.2.7. If A ∈ T then either Px (A) ≡ 0 or Px (A) ≡ 1.
Remark. Notice that this is stronger than the conclusion of Blumenthal’s 01 law.
+
The examples A = {ω : ω (0) ∈ D} show that for A in the germ σ ﬁeld F0 , the value
of Px (A), 1D (x) in this case, may depend on x.
Proof. Since the tail σ ﬁeld of B is the same as the germ σ ﬁeld for X , it follows that
P0 (A) ∈ {0, 1}. To improve this to the conclusion given, observe that A ∈ F1 , so 1A
can be written as 1D ◦ θ1 . Applying the Markov property gives
Px (A) = Ex (1D ◦ θ1 ) = Ex (Ex (1D ◦ θ1 F1 )) = Ex (EB1 1D )
= (2π )−1/2 exp(−(y − x)2 /2)Py (D) dy 7.2. MARKOV PROPERTY, BLUMENTHAL’S 01 LAW 307 Taking x = 0, we see that if P0 (A) = 0, then Py (D) = 0 for a.e. y with respect to
Lebesgue measure, and using the formula again shows Px (A) = 0 for all x. To handle
the case P0 (A) = 1, observe that Ac ∈ T and P0 (Ac ) = 0, so the last result implies
Px (Ac ) = 0 for all x.
The next result is a typical application of Theorem 7.2.7.
Theorem 7.2.8. Let Bt be a onedimensional Brownian motion starting at 0 then
with probability 1,
√
√
lim sup Bt / t = ∞
lim inf Bt / t = −∞
t→∞ t→∞ Proof. Let K < ∞. By Exercise 1.6.6 and scaling
√
√
P0 (Bn / n ≥ K i.o.) ≥ lim sup P0 (Bn ≥ K n) = P0 (B1 ≥ K ) > 0
n→∞ so the 0–1 law in Theorem 7.2.7 implies the probability is 1. Since K is arbitrary, this
proves the ﬁrst result. The second one follows from symmetry.
From Theorem 7.2.8, translation invariance, and the continuity of Brownian paths
it follows that we have:
Theorem 7.2.9. Let Bt be a onedimensional Brownian motion and let A = ∩n {Bt =
0 for some t ≥ n}. Then Px (A) = 1 for all x.
In words, onedimensional Brownian motion is recurrent. For any starting point x,
it will return to 0 “inﬁnitely often,” i.e., there is a sequence of times tn ↑ ∞ so that
Btn = 0. We have to be careful with the interpretation of the phrase in quotes since
starting from 0, Bt will hit 0 inﬁnitely many times by time > 0.
Last rites. With our discussion of Blumenthal’s 01 law complete, the distinction
+
o
between Fs and Fs is no longer important, so we will make one ﬁnal improvement
in our σ ﬁelds and remove the superscripts. Let
Nx = {A : A ⊂ D with Px (D) = 0}
x
+
Fs = σ (Fs ∪ Nx )
x
Fs = ∩x Fs
x
Nx are the null sets and Fs are the completed σ ﬁelds for Px . Since we do not
want the ﬁltration to depend on the initial state, we take the intersection of all the
σ ﬁelds. The reader should note that it follows from the deﬁnition that the Fs are
rightcontinuous. 308 7.3 CHAPTER 7. BROWNIAN MOTION Stopping Times, Strong Markov Property Generalizing a deﬁnition in Section 3.1, we call a random variable S taking values in
[0, ∞] a stopping time if for all t ≥ 0, {S < t} ∈ Ft . In the last deﬁnition, we have
obviously made a choice between {S < t} and {S ≤ t}. This makes a big diﬀerence
in discrete time but none in continuous time (for a right continuous ﬁltration Ft ) :
If {S ≤ t} ∈ Ft then {S < t} = ∪n {S ≤ t − 1/n} ∈ Ft .
If {S < t} ∈ Ft then {S ≤ t} = ∩n {S < t + 1/n} ∈ Ft .
The ﬁrst conclusion requires only that t → Ft is increasing. The second relies on the
fact that t → Ft is right continuous. Theorem 7.3.2 and 7.3.3 below show that when
checking something is a stopping time, it is nice to know that the two deﬁnitions are
equivalent.
Theorem 7.3.1. If G is an open set and T = inf {t ≥ 0 : Bt ∈ G} then T is a
stopping time.
Proof. Since G is open and t → Bt is continuous, {T < t} = ∪q<t {Bq ∈ G}, where
the union is over all rational q , so {T < t} ∈ Ft . Here we need to use the rationals to
get a countable union, and hence a measurable set.
Theorem 7.3.2. If Tn is a sequence of stopping times and Tn ↓ T then T is a stopping
time.
Proof. {T < t} = ∪n {Tn < t}.
Theorem 7.3.3. If Tn is a sequence of stopping times and Tn ↑ T then T is a stopping
time.
Proof. {T ≤ t} = ∩n {Tn ≤ t}.
Theorem 7.3.4. If K is a closed set and T = inf {t ≥ 0 : Bt ∈ K } then T is a
stopping time.
Proof. Let B (x, r) = {y : y − x < r}, let Gn = ∪x∈K B (x, 1/n) and let Tn = inf {t ≥
0 : Bt ∈ Gn }. Since Gn is open, it follows from Theorem 7.3.1 that Tn is a stopping
time. I claim that as n ↑ ∞, Tn ↑ T . To prove this, notice that T ≥ Tn for all n,
so lim Tn ≤ T . To prove T ≤ lim Tn , we can suppose that Tn ↑ t < ∞. Since
¯
B (Tn ) ∈ Gn for all n and B (Tn ) → B (t), it follows that B (t) ∈ K and T ≤ t.
Exercise 7.3.1. Let S be a stopping time and let Sn = ([2n S ] + 1)/2n where [x] =
the largest integer ≤ x. That is,
Sn = (m + 1)2−n if m2−n ≤ S < (m + 1)2−n
In words, we stop at the ﬁrst time of the form k 2−n after S (i.e., > S ). From the
verbal description, it should be clear that Sn is a stopping time. Prove that it is.
Exercise 7.3.2. If S and T are stopping times, then S ∧ T = min{S, T }, S ∨ T =
max{S, T }, and S + T are also stopping times. In particular, if t ≥ 0, then S ∧ t,
S ∨ t, and S + t are stopping times.
Exercise 7.3.3. Let Tn be a sequence of stopping times. Show that
sup Tn ,
n are stopping times. inf Tn ,
n lim sup Tn ,
n lim inf Tn
n 7.3. STOPPING TIMES, STRONG MARKOV PROPERTY 309 Theorems 7.3.4 and 7.3.1 will take care of all the hitting times we will consider.
Our next goal is to state and prove the strong Markov property. To do this, we need
to generalize two deﬁnitions from Section 3.1. Given a nonnegative random variable
S (ω ) we deﬁne the random shift θS , which “cuts oﬀ the part of ω before S (ω ) and
then shifts the path so that time S (ω ) becomes time 0.” In symbols, we set
(θS ω )(t) = ω (S (ω ) + t) on {S < ∞}
∆
on {S = ∞} where ∆ is an extra point we add to C . As in Section 5.3, we will usually explicitly
restrict our attention to {S < ∞}, so the reader does not have to worry about the
second half of the deﬁnition.
The second quantity FS , “the information known at time S ,” is a little more
subtle. Imitating the discrete time deﬁnition from Section 3.1, we let
FS = {A : A ∩ {S ≤ t} ∈ Ft for all t ≥ 0}
In words, this makes the reasonable demand that the part of A that lies in {S ≤ t}
should be measurable with respect to the information available at time t. Again we
have made a choice between ≤ t and < t, but as in the case of stopping times, this
makes no diﬀerence, and it is useful to know that the two deﬁnitions are equivalent.
Exercise 7.3.4. Show that when Ft is right continuous, the last deﬁnition is unchanged if we replace {S ≤ t} by {S < t}.
For practice with the deﬁnition of FS , do:
Exercise 7.3.5. Let S be a stopping time, let A ∈ FS , and let R = S on A and
R = ∞ on Ac . Show that R is a stopping time.
Exercise 7.3.6. Let S and T be stopping times.
(i) {S < t}, {S > t}, {S = t} are in FS .
(ii) {S < T }, {S > T }, and {S = T } are in FS (and in FT ).
Most of the properties of FN derived in Section 3.1 carry over to continuous time.
The next two will be useful below. The ﬁrst is intuitively obvious: at a later time we
have more information.
Theorem 7.3.5. If S ≤ T are stopping times then FS ⊂ FT .
Proof. If A ∈ FS then A ∩ {T ≤ t} = (A ∩ {S ≤ t}) ∩ {T ≤ t} ∈ Ft .
Theorem 7.3.6. If Tn ↓ T are stopping times then FT = ∩F (Tn ).
Proof. Theorem 7.3.5 implies F (Tn ) ⊃ FT for all n. To prove the other inclusion, let
A ∈ ∩F (Tn ). Since A∩{Tn < t} ∈ Ft and Tn ↓ T , it follows that A∩{T < t} ∈ Ft .
The last result allows you to prove something that is obvious from the verbal
deﬁnition.
Exercise 7.3.7. BS ∈ FS , i.e., the value of BS is measurable with respect to the
information known at time S ! To prove this, let Sn = ([2n S ] + 1)/2n be the stopping
times deﬁned in Exercise 7.3.1. Show B (Sn ) ∈ FSn , then let n → ∞ and use Theorem
7.3.6.
We are now ready to state the strong Markov property, which says that the Markov
property holds at stopping times. 310 CHAPTER 7. BROWNIAN MOTION Theorem 7.3.7. Strong Markov property. Let (s, ω ) → Ys (ω ) be bounded and
R × C measurable. If S is a stopping time, then for all x ∈ Rd
Ex (YS ◦ θS FS ) = EB (S ) YS on {S < ∞}
where the righthand side is the function ϕ(x, t) = Ex Yt evaluated at x = B (S ), t = S.
Remark. The only facts about Brownian motion used here are that it is a Markov
process, and if f is bounded and continuous then x → Ex f (Bt ) is continuous.
Proof. We ﬁrst prove the result under the assumption that there is a sequence of
times tn ↑ ∞, so that Px (S < ∞) = Px (S = tn ). In this case, the proof is basically
the same as the proof of Theorem 5.3.4. We break things down according to the
value of S , apply the Markov property, and put the pieces back together. If we let
Zn = Ytn (ω ) and A ∈ FS , then
∞ Ex (YS ◦ θS ; A ∩ {S < ∞}) = Ex (Zn ◦ θtn ; A ∩ {S = tn })
n=1 Now if A ∈ FS , A ∩ {S = tn } = (A ∩ {S ≤ tn }) − (A ∩ {S ≤ tn−1 }) ∈ Ftn , so it follows
from the Markov property that the above sum is
∞ Ex (EB (tn ) Zn ; A ∩ {S = tn }) = Ex (EB (S ) YS ; A ∩ {S < ∞}) =
n=1 To prove the result in general, we let Sn = ([2n S ]+1)/2n be the stopping time deﬁned
in Exercise 7.3.1. To be able to let n → ∞, we restrict our attention to Y ’s of the
form
n (∗) fm (ω (tm )) Ys (ω ) = f0 (s)
m=1 where 0 < t1 < . . . < tn and f0 , . . . , fn are bounded and continuous. If f is bounded
and continuous then the dominated convergence theorem implies that
x→ dy pt (x, y )f (y ) is continuous. From this and induction, it follows that
ϕ(x, s) = Ex Ys = f0 (s)
... dy1 pt1 (x, y1 )f1 (y1 )
dyn ptn −tn−1 (yn−1 , yn )fn (yn ) is bounded and continuous.
Having assembled the necessary ingredients, we can now complete the proof. Let
A ∈ FS . Since S ≤ Sn , Theorem 7.3.5 implies A ∈ F (Sn ). Applying the special case
proved above to Sn and observing that {Sn < ∞} = {S < ∞} gives
Ex (YSn ◦ θSn ; A ∩ {S < ∞}) = Ex (ϕ(B (Sn ), Sn ); A ∩ {S < ∞})
Now, as n → ∞, Sn ↓ S , B (Sn ) → B (S ), ϕ(B (Sn ), Sn ) → ϕ(B (S ), S ) and
YSn ◦ θSn → YS ◦ θS 7.3. STOPPING TIMES, STRONG MARKOV PROPERTY 311 so the bounded convergence theorem implies that the result holds when Y has the
form given in (∗).
To complete the proof now, we will apply the monotone class theorem. As in the
proof of Theorem 7.2.1, we let H be the collection of Y for which
Ex (YS ◦ θS ; A) = Ex (EB (S ) YS ; A) for all A ∈ FS and it is easy to see that (ii) and (iii) hold. This time, however, we take A to be
the sets of the form A = G0 × {ω : ω (sj ) ∈ Gj , 1 ≤ j ≤ k }, where the Gj are
n
open sets. To verify (i), we note that if Kj = Gc and fj (x) = 1 ∧ nρ(x, Kj ), where
j
n
n
ρ(x, K ) = inf {x − y  : y ∈ K } then fj are continuous functions with fj ↑ 1Gj as
n ↑ ∞. The facts that
k
n
Ysn (ω ) = f0 (s) n
fj (ω (sj )) ∈ H
j =1 and (iii) holds for H imply that 1A ∈ H. This veriﬁes (i) in the monotone class
theorem and completes the proof. 312 7.4 CHAPTER 7. BROWNIAN MOTION Path Properites In this section, we will use the strong Markov property to derive properties of the
zero set {t : Bt = 0}, the hitting times Ta = inf {t : Bt = a}, and max0≤s≤t Bs for one
dimensional Brownian motion. 7.4.1 Zeros of Brownian Motion Let Rt = inf {u > t : Bu = 0} and let T0 = inf {u > 0 : Bu = 0}. Now Theorem 7.2.9
implies Px (Rt < ∞) = 1, so B (Rt ) = 0 and the strong Markov property and Theorem
7.2.5 imply
Px (T0 ◦ θRt > 0FRt ) = P0 (T0 > 0) = 0
Taking expected value of the last equation, we see that
Px (T0 ◦ θRt > 0 for some rational t) = 0
From this, it follows that if a point u ∈ Z (ω ) ≡ {t : Bt (ω ) = 0} is isolated on the left
(i.e., there is a rational t < u so that (t, u) ∩ Z (ω ) = ∅), then it is, with probability
one, a decreasing limit of points in Z (ω ). This shows that the closed set Z (ω ) has
no isolated points and hence must be uncountable. For the last step, see Hewitt and
Stromberg (1965), page 72.
If we let Z (ω ) denote the Lebesgue measure of Z (ω ) then Fubini’s theorem implies
T Ex (Z (ω ) ∩ [0, T ]) = Px (Bt = 0) dt = 0
0 So Z (ω ) is a set of measure zero.
The last four observations show that Z is like the Cantor set that is obtained by
removing (1/3, 2/3) from [0, 1] and then repeatedly removing the middle third from
the intervals that remain. The Cantor set is bigger however. Its Hausdorﬀ dimension
is log 2/ log 3, while Z has dimension 1/2. 7.4.2 Hitting times Theorem 7.4.1. Under P0 , {Ta , a ≥ 0} has stationary independent increments.
Proof. The ﬁrst step is to notice that if 0 < a < b then
Tb ◦ θTa = Tb − Ta ,
so if f is bounded and measurable, the strong Markov property, 7.3.7 and translation
invariance imply
E0 (f (Tb − Ta ) FTa ) = E0 (f (Tb ) ◦ θTa FTa )
= Ea f (Tb ) = E0 f (Tb−a )
To show that the increments are independent, let a0 < a1 . . . < an , let fi , 1 ≤ i ≤ n
be bounded and measurable, and let Fi = fi (Tai − Tai−1 ). Conditioning on FTan−1
and using the preceding calculation we have
n−1 n E0 Fi
i=1 n−1 Fi · E0 (Fn FTan−1 ) = E0
i=1 By induction, it follows that E0
conclusion. = E0 Fi E 0 Fn i=1
n
i=1 Fi = n
i=1 E0 Fi , which implies the desired 7.4. PATH PROPERITES 313 The scaling relation (7.1.1) implies
d Ta = a2 T1 (7.4.1) Combining Theorem 7.4.1 and (7.4.1), we see that tk = Tk − Tk−1 are i.i.d. and
t1 + · · · + tn
→ T1
n2
so using Theorem 2.7.4, we see that Ta has a stable law. Since we are dividng by n2
and Ta ≥ 0, the index α = 1/2 and the skewness parameter κ = 1, see (2.7.11).
Without knowing the theory mentioned in the previous paragraph, it is easy to
determine the Laplace transform
ϕa (λ) = E0 exp(−λTa ) for a ≥ 0 and reach the same conclusion. To do this, we start by observing that Theorem 7.4.1
implies
ϕx (λ)ϕy (λ) = ϕx+y (λ).
It follows easily from this that
ϕa (λ) = exp(−ac(λ)) (7.4.2) Proof. Let c(λ) = − log ϕ1 (λ) so (7.4.2) holds when a = 1. Using the previous identity
with x = y = 2−m and induction gives the result for a = 2−m , m ≥ 1. Then, letting
x = k 2−m and y = 2−m we get the result for a = (k + 1)2−m with k ≥ 1. Finally, to
extend to a ∈ [0, ∞), note that a → φa (λ) is decreasing.
To identify c(λ), we observe that (7.4.1) implies
E exp(−Ta ) = E exp(−a2 T1 )
√
so ac(1) = c(a2 ), i.e., c(λ) = c(1) λ. Since all of our arguments also apply to σBt we
cannot hope to compute c(1). Theorem 7.5.7 will show
√
E0 (exp(−λTa )) = exp(−a 2λ)
(7.4.3)
Our next goal is to compute the distribution of the hitting times Ta . This application of the strong Markov property shows why we want to allow the function Y that
we apply to the shifted path to depend on the stopping time S.
Example 7.4.1. Reﬂection principle. Let a > 0 and let Ta = inf {t : Bt = a}.
Then
P0 (Ta < t) = 2P0 (Bt ≥ a)
(7.4.4)
Intuitive proof. We observe that if Bs hits a at some time s < t, then the strong
Markov property implies that Bt − B (Ta ) is independent of what happened before
time Ta . The symmetry of the normal distribution and Pa (Bu = a) = 0 for u > 0
then imply
1
(7.4.5)
P0 (Ta < t, Bt > a) = P0 (Ta < t)
2
Rearranging the last equation and using {Bt > a} ⊂ {Ta < t} gives
P0 (Ta < t) = 2P0 (Ta < t, Bt > a) = 2P0 (Bt > a) 314 CHAPTER 7. BROWNIAN MOTION 2a − v
d ¨
d¨ 2a − u
v
r
r
u a e
0 e
e Figure 7.1: Proof by picture of the reﬂection principle. Proof. To make the intuitive proof rigorous, we only have to prove (7.4.5). To extract
this from the strong Markov property, Theorem 7.3.7, we let
Ys (ω ) = 1
0 if s < t, ω (t − s) > a
otherwise We do this so that if we let S = inf {s < t : Bs = a} with inf ∅ = ∞, then
YS (θS ω ) = 1 if S < t, Bt > a
0 otherwise and the strong Markov property implies
E0 (YS ◦ θS FS ) = ϕ(BS , S ) on {S < ∞} = {Ta < t} where ϕ(x, s) = Ex Ys . BS = a on {S < ∞} and ϕ(a, s) = 1/2 if s < t, so taking
expected values gives
P0 (Ta < t, Bt ≥ a) = E0 (YS ◦ θS ; S < ∞)
= E0 (E0 (YS ◦ θS FS ); S < ∞) = E0 (1/2; Ta < t)
which proves (7.4.5).
Exercise 7.4.1. Generalize the proof of (7.4.5) to conclude that if u < v ≤ a then
P0 (Ta < t, u < Bt < v ) = P0 (2a − v < Bt < 2a − u) (7.4.6) This should be obvious from the picture in Figure 7.1. Your task is to extract this
from the strong Markov property.
Letting (u, v ) shrink down to x in (7.4.6) we have for a < x
P0 (Ta < t, Bt = x) = pt (0, 2a − x)
P0 (Ta > t, Bt = x) = pt (0, x) − pt (0, 2a − x) (7.4.7) 7.4. PATH PROPERITES 315 i.e., the (subprobability) density for Bt on the two indicated events. Since {Ta < t} =
{Mt > a}, diﬀerentiating with respect to a gives the joint density
f(Mt ,Bt ) (a, x) = 2(2a − x) −(2a−x)2 /2t
√
e
2πt3 Using (7.4.4), we can compute the probability density of Ta . We begin by noting
that
∞ (2πt)−1/2 exp(−x2 /2t)dx P (Ta ≤ t) = 2 P0 (Bt ≥ a) = 2
a then change variables x = (t1/2 a)/s1/2 to get
0 (2πt)−1/2 exp(−a2 /2s) −t1/2 a/2s3/2 ds P0 (Ta ≤ t) = 2
t
t (2πs3 )−1/2 a exp(−a2 /2s) ds = (7.4.8) 0 Using the last formula, we can compute:
Example 7.4.2. The distribution of L = sup{t ≤ 1 : Bt = 0}. By (7.2.4),
∞ P0 (L ≤ s) = ps (0, x)Px (T0 > 1 − s) dx
−∞
∞ ∞ (2πs)−1/2 exp(−x2 /2s) =2 1−s 0 1
=
π
1
=
π (2πr3 )−1/2 x exp(−x2 /2r) dr dx ∞ ∞
3 −1/2 x exp(−x2 (r + s)/2rs) dx dr (sr )
1−s
∞ 0
3 −1/2 (sr ) rs/(r + s) dr 1−s Our next step is to let t = s/(r + s) to convert the integral over r ∈ [1 − s, ∞) into
one over t ∈ [0, s]. dt = −s/(r + s)2 dr, so to make the calculations easier we ﬁrst
rewrite the integral as
1
=
π ∞ (r + s)
rs 1−s 2 1 /2 s
dr
(r + s)2 and then change variables to get
P0 (L ≤ s) = 1
π s (t(1 − t))−1/2 dt =
0 √
2
arcsin( s)
π (7.4.9) The arcsin may remind the reader of the limit theorem for L2n = sup{m ≤ 2n :
Sm = 0} given in Theorem 3.3.5. We will see in Section 7.6 that our new result is a
consequence of the old one.
Exercise 7.4.2. Use (7.2.3) Show that R = inf {t > 1 : Bt = 0} has probability
density
P0 (R = 1 + t) = 1/(πt1/2 (1 + t)) 316 CHAPTER 7. BROWNIAN MOTION 7.4.3 L´vy’s Modulus of Continuity
e Let osc(δ ) = sup{Bs − Bt  : s, t ∈ [0, 1], t − s < δ }.
Theorem 7.4.2. With probability 1,
lim sup osc(δ )/(δ log(1/δ ))1/2 ≤ 6
δ →0 Remark. The constant 6 is not the best possible because the end of the proof is
sloppy. L´vy (1937) showed
e
√
lim sup osc(δ )/(δ log(1/δ ))1/2 = 2
δ →0 See McKean (1969), p. 1416, or Itˆ and McKean (1965), p. 3638, where a sharper
o
result due to Chung, Erd¨s and Sirao (1959) is proved. In contrast, if we look at the
o
behavior at a single point, (9.5) below shows
lim sup Bt / 2t log log(1/t) = 1 a.s. t→0 Proof. Let Im,n = [m2−n , (m + 1)2−n ], and ∆m,n = sup{Bt − B (m2−n ) : t ∈ Im,n }.
From (7.4.4) and the scaling relation, it follows that
P (∆m,n ≥ a2−n/2 ) ≤ 4P (B (2−n ) ≥ a2−n/2 )
= 4P (B (1) ≥ a) ≤ 4 exp(−a2 /2)
by Theorem 1.1.4 if a ≥ 1. If
last result implies > 0, b = 2(1 + )(log 2), and an = (bn)1/2 , then the P (∆m,n ≥ an 2−n/2 for some m ≤ 2n ) ≤ 2n · 4 exp(−bn/2) = 4 · 2−n
so the BorelCantelli lemma implies that if n ≥ N (ω ), ∆m,n ≤ (bn)1/2 2−n/2 . Now if
s ∈ Im,n , s < t and s − t < 2−n , then t ∈ Im,n or Im+1,n . I claim that in either case
the triangle inequality implies
Bt − Bs  ≤ 3(bn)1/2 2−n/2
To see this, note that the worst case is t ∈ Im+1,n , but even in this case
Bt − Bs  ≤ Bt − B ((m + 1)2−n )
+ B ((m + 1)2−n ) − B (m2−n ) + B (m2−n ) − Bs 
It follows from the last estimate that for 2−(n+1) ≤ δ < 2−n
osc(δ ) ≤ 3(bn)1/2 2−n/2 ≤ 3(b log2 (1/δ ))1/2 (2δ )1/2 = 6((1 + )δ log(1/δ ))1/2
Recall b = 2(1 + ) log 2 and observe exp((log 2)(log2 1/δ )) = 1/δ . 7.5. MARTINGALES 7.5 317 Martingales At the end of Section 4.7 we used martingales to study the hitting times of random
walks. The same methods can be used on Brownian motion once we prove:
Theorem 7.5.1. Let Xt be a right continuous martingale adapted to a right continuous ﬁltration. If T is a bounded stopping time, then EXT = EX0 .
Proof. Let n be an integer so that P (T ≤ n − 1) = 1. As in the proof of the strong
m
Markov property, let Tm = ([2m T ] + 1)/2m . Yk = X (k 2−m ) is a martingale with
m
−m
m
m
m
respect to Fk = F (k 2 ) and Sm = 2 Tm is a stopping time for (Yk , Fk ), so by
Exercise 4.4.3
m
m
m
X (Tm ) = YSm = E (Yn2m FSm ) = E (Xn F (Tm ))
As m ↑ ∞, X (Tm ) → X (T ) by right continuity and F (Tm ) ↓ F (T ) by Theorem 7.3.6,
so it follows from Theorem 4.6.3 that
X (T ) = E (Xn F (T ))
Taking expected values gives EX (T ) = EXn = EX0 , since Xn is a martingale.
Theorem 7.5.2. Bt is a martingale w.r.t. the σ ﬁelds Ft deﬁned in Section 7.2.
Note: We will use these σ ﬁelds in all of the martingale results but will not mention
them explicitly in the statements.
Proof. The Markov property implies that
Ex (Bt Fs ) = EBs (Bt−s ) = Bs
since symmetry implies Ey Bu = y for all u ≥ 0.
From Theorem 7.5.2, it follows immediately that we have:
Theorem 7.5.3. If a < x < b then Px (Ta < Tb ) = (b − x)/(b − a).
Proof. Let T = Ta ∧ Tb . Theorem 7.2.8 implies that T < ∞ a.s. Using Theorems
7.5.1 and 7.5.2, it follows that x = Ex B (T ∧ t). Letting t → ∞ and using the bounded
convergence theorem, it follows that
x = aPx (Ta < Tb ) + b(1 − Px (Ta < Tb ))
Solving for Px (Ta < Tb ) now gives the desired result.
Example 7.5.1. Optimal doubling in Backgammon (Keeler and Spencer (1975)).
In our idealization, backgammon is a Brownian motion starting at 1/2 run until it
hits 1 or 0, and Bt is the probability you will win given the events up to time t.
Initially, the “doubling cube” sits in the middle of the board and either player can
“double,” that is, tell the other player to play on for twice the stakes or give up and
pay the current wager. If a player accepts the double (i.e., decides to play on), she
gets possession of the doubling cube and is the only one who can oﬀer the next double.
A doubling strategy is given by two numbers b < 1/2 < a, i.e., oﬀer a double when
Bt ≥ a and give up if the other player doubles and Bt < b. It is not hard to see that
for the optimal strategy b∗ = 1 − a∗ and that when Bt = b∗ accepting and giving
up must have the same payoﬀ. If you accept when your probability of winning is b∗ ,
then you lose 2 dollars when your probability hits 0 but you win 2 dollars when your 318 CHAPTER 7. BROWNIAN MOTION probability of winning hits a∗ , since at that moment you can double and the other
player gets the same payoﬀ if they give up or play on. If giving up or playing on at
b∗ is to have the same payoﬀ, we must have
−1 = b∗
a∗ − b∗
·2+
· (−2)
a∗
a∗ Writing b∗ = c and a∗ = 1 − c and solving, we have −(1 − c) = 2c − 2(1 − 2c) or
1 = 5c. Thus b∗ = 1/5 and a∗ = 4/5. In words you should oﬀer a double if your odds
of winning are 80% and accept if they are ≥ 20%.
2
Theorem 7.5.4. Bt − t is a martingale.
2
Proof. Writing Bt = (Bs + Bt − B2 )2 we have
2
2
Ex (Bt Fs ) = Ex (Bs + 2Bs (Bt − Bs ) + (Bt − Bs )2 Fs )
2
= Bs + 2Bs Ex (Bt − Bs Fs ) + Ex ((Bt − Bs )2 Fs )
2
= Bs + 0 + (t − s) since Bt − Bs is independent of Fs and has mean 0 and variance t − s.
Theorem 7.5.5. Let T = inf {t : Bt ∈ (a, b)}, where a < 0 < b.
/
E0 T = −ab
Proof Theorem 7.5.1 and 7.5.4 imply E0 (B 2 (T ∧ t)) = E0 (T ∧ t)). Letting t → ∞ and
using the monotone convergence theorem gives E0 (T ∧ t) ↑ E0 T . Using the bounded
convergence theorem and Theorem 7.5.3, we have
2
E0 B 2 (T ∧ t) → E0 BT = a2 b
−a
a−b
+ b2
= ab
= −ab
b−a
b−a
b−a Theorem 7.5.6. exp(θBt − (θ2 t/2)) is a martingale.
Proof. Bringing exp(θBs ) outside
Ex (exp(θBt )Fs ) = exp(θBs )E (exp(θ(Bt − Bs ))Fs )
= exp(θBs ) exp(θ2 (t − s)/2)
since Bt − Bs is independent of Fs and has a normal distribution with mean 0 and
variance t − s.
√
Theorem 7.5.7. If Ta = inf {t : Bt = a} then E0 exp(−λT a) = exp(−a 2λ).
Proof. Theorem 7.5.1 and 7.5.6 imply that 1 = E0 exp(θB (T ∧ t) − θ2 (Ta ∧ t)/2).
√
Taking θ = √ λ, letting t → ∞ and using the bounded convergence theorem gives
2
1 = E0 exp(a 2λ − λTa ).
Exercise 7.5.1. Let T = inf {Bt ∈ (−a, a)}. Show that
√
E exp(−λT ) = 1/ cosh(a 2λ). 7.5. MARTINGALES 319 Exercise 7.5.2. The point of this exercise is to get information about the amount
of time it takes Brownian motion with drift −b, Xt ≡ Bt − bt to hit level a. Let
τ = inf {t : Bt = a + bt}, where a > 0. (i) Use the martingale exp(θBt − θ2 t/2) with
θ = b + (b2 + 2λ)1/2 to show
E0 exp(−λτ ) = exp(−a{b + (b2 + 2λ)1/2 })
Letting λ → 0 gives P0 (τ < ∞) = exp(−2ab).
Exercise 7.5.3. Let σ = inf {t : Bt ∈ (a, b)} and let λ > 0. Use the strong Markov
/
property to show
Ex exp(−λTa ) = Ex (e−λσ ; Ta < Tb ) + Ex (e−λσ ; Tb < Ta )Eb exp(−λTa )
(ii) Interchange the roles of a and b to get a second equation, use Theorem 7.5.7, and
solve to get
√
√
Ex (e−λT ; Ta < Tb ) = sinh( 2λ(b − x))/ sinh( 2λ(b − a))
√
√
Ex (e−λT ; Tb < Ta ) = sinh( 2λ(x − a))/ sinh( 2λ(b − a))
Theorem 7.5.8. If u(t, x) is a polynomial in t and x with
∂u 1 ∂ 2 u
+
=0
∂t
2 ∂x2 (∗)
then u(t, Bt ) is a martingale. Proof. Let pt (x, y ) = (2πt)−1/2 exp(−(y − x)2 /2t). The ﬁrst step is to check pt satisﬁes
the heat equation: ∂pt /∂t = (1/2)∂ 2 pt /∂y 2 .
∂p
1
(y − x)2
exp(−(y − x)2 /2t)
= − 2π (2πt)−1/2 exp(−(y − x)2 /2t) + (2πt)−1/2
∂t
2
2t 2
∂p
y−x
= −(2πt)−1/2 ·
exp(−(y − x)2 /2t)
∂y
2t
∂2p
1
(y − x)2
= − (2πt)−1/2 exp(−(y − x)2 /2t) + (2πt)−1/2
exp(−(y − x)2 /2t)
∂y 2
2t
4t2
Interchanging ∂/∂t and , and using the heat equation
∂
Ex u(t, Bt ) =
∂t
= ∂
(pt (x, y )u(t, y )) dy
∂t
1∂
∂
p (x, y )u(t, y ) + pt (x, y ) u(t, y ) dy
2t
2 ∂y
∂t Integrating by parts twice the above
= pt (x, y ) ∂
1∂
+
∂t 2 ∂y 2 u(t, y ) dy = 0 Since u(t, y ) is a polynomial there is no question about the convergence of integrals
and there is no contribution from the boundary terms when we integrate by parts.
Examples of functions that satisfy (∗) are exp(θx − θ2 t/2), x, x2 − t, x3 − 3tx,
x4 − 6x2 t + 3t2 . . . 320 CHAPTER 7. BROWNIAN MOTION Theorem 7.5.9. If T = inf {t : Bt ∈ (−a, a)} then ET 2 = 5a4 /3.
/
Proof. Theorem 7.5.1 implies
E (B (T ∧ t)4 − 6(T ∧ t)B (T ∧ t)2 ) = −3E (T ∧ t)2 .
From Theorem 7.5.5, we know that ET = a2 < ∞. Letting t → ∞, using the
dominated convergence theorem on the lefthand side, and the monotone convergence
theorem on the right gives
a4 − 6a2 ET = −3E (T 2 )
Plugging in ET = a2 gives the desired result.
Exercise 7.5.4. If T = inf {t : Bt ∈ (a, b)}, where a < 0 < b and a = −b, then T
/
2
and BT are not independent, so we cannot calculate ET 2 as we did in the proof of
2
Theorem 7.5.9. Use the CauchySchwarz inequality to estimate E (T BT ) and conclude
2
4
ET ≤ C E (BT ), where C is independent of a and b.
6
4
2
Exercise 7.5.5. Find a martingale of the form Bt − c1 tBt + c2 t2 Bt − c3 t3 and use
it to compute the third moment of T = inf {t : Bt ∈ (−a, a)}.
/
2
Exercise 7.5.6. Show that (1 + t)−1/2 exp(Bt /2(1 + t)) is √ martingale and use this
a
to conclude that lim supt→∞ Bt /((1 + t) log(1 + t))1/2 ≤ 1/ 2 a.s. 7.5.1 Multidimensional Brownian motion
d Let ∆f = i=1 ∂ 2 f /∂x2 be the Laplacian of f . The starting point for our investigai
tion is to note that repeating the calculation from the proof of Theorem 7.5.8 shows
that in d > 1 dimensions
pt (x, y ) = (2πt)−d/2 exp(−y − x2 /2t)
satisﬁes the heat equation ∂pt /∂t = (1/2)∆y pt , where the subscript y on δ indicates
at the Laplacian acts in the y variable.
Theorem 7.5.10. Suppose v ∈ C 2 , i.e., all ﬁrst and second order partial derivatives
exist and are continuous, and v has compact support. Then
t v (Bt ) −
0 1
∆v (Bs ) ds
2 is a martingale. Proof. Repeating the proof of Theorem 7.5.8
∂
Ex v (Bt ) =
∂t
=
= v (y ) ∂
pt (x, y ) dy
∂t 1
v (y )(∆y pt (x, y )) dy
2
1
pt (x, y )∆y v (y ) dy
2 the calculus steps being justiﬁed by our assumptions. 7.5. MARTINGALES 321 We will use this result for two special cases:
ϕ(x) = log x
x2−d d=2
d≥3 We leave it to the reader to check that in each case ∆ϕ = 0. Let Sr = inf {t : Bt  = r},
r < R, and τ = Sr ∧ SR . The ﬁrst detail is to note that Theorem 7.2.8 implies that
if x < R then Px (SR < ∞). Once we know this we can conclude
Theorem 7.5.11. If x < R then Ex SR = (R2 − x2 )/d.
d i
Proof. It follows from Theorem 7.5.4 that Bt 2 − dt = i=1 (Bt )2 − t is a martingale.
Theorem 7.5.1 implies x2 = E BSR ∧t 2 − dE (SR ∧ t). Letting t → ∞ gives the desired
result. Lemma 7.5.12. ϕ(x) = Ex ϕ(Bτ )
Proof. Deﬁne ψ (x) = g (x) to be C 2 and have compact support, and have ψ (x) =
φ(x) when r < x < R. Theorem 7.5.10 implies that ψ (x) = Ex ψ (Bt∧τ . Letting
t → ∞ now gives the desired result.
Lemma 7.5.12 implies that
ϕ(x) = ϕ(r)Px (Sr < SR ) + ϕ(R)(1 − Px (Sr < SR ))
where ϕ(r) is short for the value of ϕ(x) on {x : x = r}. Solving now gives
ϕ(R) − ϕ(x)
ϕ(R) − ϕ(r) (7.5.1) log R − log x
log R − log r (7.5.2) Px (Sr < SR ) =
In d = 2, the last formula says
Px (Sr < SR ) = If we ﬁx r and let R → ∞ in (7.5.2), the righthand side goes to 1. So
Px (Sr < ∞) = 1 for any x and any r > 0 It follows that twodimensional Brownian motion is recurrent in the sense that if G
is any open set, then Px (Bt ∈ G i.o.) ≡ 1.
If we ﬁx R, let r → 0 in (7.5.2), and let S0 = inf {t > 0 : Bt = 0}, then for x = 0
Px (S0 < SR ) ≤ lim Px (Sr < SR ) = 0
r →0 Since this holds for all R and since the continuity of Brownian paths implies SR ↑ ∞
as R ↑ ∞, we have Px (S0 < ∞) = 0 for all x = 0. To extend the last result to x = 0
we note that the Markov property implies
P0 (Bt = 0 for some t ≥ ) = E0 [PB (T0 < ∞)] = 0
for all > 0, so P0 (Bt = 0 for some t > 0) = 0, and thanks to our deﬁnition of
S0 = inf {t > 0 : Bt = 0}, we have
Px (S0 < ∞) = 0 for all x (7.5.3) 322 CHAPTER 7. BROWNIAN MOTION Thus, in d ≥ 2 Brownian motion will not hit 0 at a positive time even if it starts
there.
For d ≥ 3, formula (7.5.1) says
Px (Sr < SR ) = R2−d − x2−d
R2−d − r2−d (7.5.4) There is no point in ﬁxing R and letting r → 0, here. The fact that two dimensional
Brownian motion does not hit 0 implies that three dimensional Brownian motion does
not hit 0 and indeed will not hit the line {x : x1 = x2 = 0}. If we ﬁx r and let R → ∞
in (7.5.4) we get
Px (Sr < ∞) = (r/x)d−2 < 1 if x > r
(7.5.5)
From the last result it follows easily that for d ≥ 3, Brownian motion is transient,
i.e. it does not return inﬁnitely often to any bounded set.
Theorem 7.5.13. As t → ∞, Bt  → ∞ a.s.
Proof. Let An = {Bt  > n1− for all t ≥ Sn }. The strong Markov property implies
Px (Ac ) = Ex (PB (Sn ) (Sn1− < ∞)) = (n1− /n)d−2 → 0
n
as n → ∞. Now lim sup An = ∩∞=1 ∪∞ N An has
n=
N
P (lim sup An ) ≥ lim sup P (An ) = 1
So inﬁnitely often the Brownian path never returns to {x : x ≤ n1− } after time Sn
and this implies the desired result.
The scaling relation (7.1.1) implies that S√t =d tS1 , so the proof of Theorem
7.5.13 suggests that
Bt /t(1− )/2 → ∞
Dvoretsky and Erd¨s (1951) have proved the following result about how fast Brownian
o
motion goes to ∞ in d ≥ 3.
Theorem 7.5.14. Suppose g (t) is positive and decreasing. Then
√
P0 (Bt  ≤ g (t) t i.o. as t ↑ ∞) = 1 or 0
according as ∞ g (t)d−2 /t dt = ∞ or < ∞. Here the absence of the lower limit implies that we are only concerned with the
behavior of the integral “near ∞.” A little calculus shows that
∞ t−1 log−α t dt = ∞ or < ∞
√
according as α ≤ 1 or α > 1, so Bt goes to ∞ faster than t/(log t)α/d−2 for any
α > 1. Note that in view of the Brownian scaling √
relationship Bt =d t1/2 B1 we could
not sensibly expect escape at a faster rate than t. The last result shows that the
escape rate is not much slower. 7.6. DONSKER’S THEOREM 7.6 323 Donsker’s Theorem Let X1 , X2 , . . . be i.i.d. with EX = 0 and EX 2 = 1, and let Sn = X1 + · · · + Xn .
In this section, we will show that as n → ∞, S (nt)/n1/2 , 0 ≤ t ≤ 1 converges in
distribution to Bt , 0 ≤ t ≤ 1, a Brownian motion starting from B0 = 0. We will say
precisely what the last sentence means below. The key to its proof is:
Theorem 7.6.1. Skorokhod’s representation theorem. If EX = 0 and EX 2 <
d
∞ then there is a stopping time T for Brownian motion so that BT = X and ET =
EX 2 .
Remark. The Brownian motion in the statement and all the Brownian motions in
this section have B0 = 0.
Proof. Suppose ﬁrst that X is supported on {a, b}, where a < 0 < b. Since EX = 0,
we must have
b
−a
P (X = a) =
P (X = b) =
b−a
b−a
If we let T = Ta,b = inf {t : Bt ∈ (a, b)} then Theorem 7.5.3 implies BT =d X and
/
Theorem 7.5.5 tells us that
2
ET = −ab = EBT
To treat the general case, we will write F (x) = P (X ≤ x) as a mixture of two
point distributions with mean 0. Let
∞ 0 (−u) dF (u) = c= v dF (v ) −∞ 0 If ϕ is bounded and ϕ(0) = 0, then using the two formulas for c
∞ c ϕ(x) dF (x) = 0 (−u)dF (u) ϕ(v ) dF (v )
−∞
∞ 0
0 + ϕ(u) dF (u)
−∞
∞ = v dF (v )
0 0 dF (u) (vϕ(u) − uϕ(v )) dF (v )
−∞ 0 So we have
∞ ϕ(x) dF (x) = c−1 0 dF (u)(v − u) dF (v )
−∞ 0 −u
v
ϕ(u) +
ϕ(v )
v−u
v−u The last equation gives the desired mixture. If we let (U, V ) ∈ R2 have
P {(U, V ) = (0, 0)} = F ({0})
P ((U, V ) ∈ A) = c−1 dF (u) dF (v ) (v − u) (7.6.1) (u,v )∈A for A ⊂ (−∞, 0) × (0, ∞) and deﬁne probability measures by µ0,0 ({0}) = 1 and
µu,v ({u}) = v
v−u µu,v ({v }) = −u
v−u for u < 0 < v then
ϕ(x) dF (x) = E ϕ(x) µU,V (dx) 324 CHAPTER 7. BROWNIAN MOTION We proved the last formula when ϕ(0) = 0, but it is easy to see that it is true in
general. Letting ϕ ≡ 1 in the last equation shows that the measure deﬁned in (7.6.1)
has total mass 1.
From the calculations above it follows that if we have (U, V ) with distribution
given in (7.6.1) and an independent Brownian motion deﬁned on the same space then
B (TU,V ) =d X . Sticklers for detail will notice that TU,V is not a stopping time for Bt
since (U, V ) is independent of the Brownian motion. This is not a serious problem
since if we condition on U = u and V = v , then Tu,v is a stopping time, and this
is good enough for all the calculations below. For instance, to compute E (TU,V ) we
observe
E (TU,V ) = E {E (TU,V (U, V ))} = E (−U V )
by Theorem 7.5.5. (7.6.1) implies
∞ 0 dF (v )v (v − u)c−1 dF (u)(−u) E (−U V ) =
−∞
0 0
∞ dF (v )c−1 v 2 dF (u)(−u) −u + =
−∞ since 0 ∞ c= 0 (−u) dF (u) v dF (v ) =
−∞ 0 Using the second expression for c now gives
∞ 0 u2 dF (u) + E (TU,V ) = E (−U V ) =
−∞ v 2 dF (v ) = EX 2
0 2
Exercise 7.6.1. Use Exercise 7.5.4 to conclude that E (TU,V ) ≤ CEX 4 . Remark. One can embed distributions in Brownian motion without adding random
variables to the probability space: See Dubins (1968), Root (1969), or Sheu (1986).
From Theorem 7.6.1, it is only a small step to:
Theorem 7.6.2. Let X1 , X2 , . . . be i.i.d. with a distribution F , which has mean 0 and
variance 1, and let Sn = X1 + . . . + Xn . There is a sequence of stopping times T0 =
0, T1 , T2 , . . . such that Sn =d B (Tn ) and Tn − Tn−1 are independent and identically
distributed.
Proof. Let (U1 , V1 ), (U2 , V2 ), . . . be i.i.d. and have distribution given in (7.6.1) and let
Bt be an independent Brownian motion. Let T0 = 0, and for n ≥ 1, let
Tn = inf {t ≥ Tn−1 : Bt − B (Tn−1 ) ∈ (Un , Vn )}
/
As a corollary of Theorem 7.6.2, we get:
Theorem 7.6.3. Central limit theorem. Under the hypotheses of Theorem 7.6.2,
√
Sn / n ⇒ χ, where χ has the standard normal distribution.
√
Proof. If we let Wn (t) = B (nt)/ n =d Bt by Brownian scaling, then
√d
√
Sn / n = B (Tn )/ n = Wn (Tn /n) 7.6. DONSKERS THEOREM 325 The weak law of large numbers implies that Tn /n → 1 in probability. It should be
√
clear from this that Sn / n ⇒ B1 . To ﬁll in the details, let > 0, pick δ so that
P (Bt − B1  > for some t ∈ (1 − δ, 1 + δ )) < /2 then pick N large enough so that for n ≥ N , P (Tn /n − 1 > δ ) < /2. The last two
estimates imply that for n ≥ N
P (Wn (Tn /n) − Wn (1) > ) <
Since is arbitrary, it follows that Wn (Tn /n)−Wn (1) → 0 in probability. Applying the
converging together lemma, Exercise 2.2.13, with Xn = Wn (1) and Zn = Wn (Tn /n),
the desired result follows.
Our next goal is to prove a strengthening of the central limit theorem that allows
us to obtain limit theorems for functionals of {Sm : 0 ≤ m ≤ n}, e.g., max0≤m≤n Sm
or {m ≤ n : Sm > 0}. Let C [0, 1] = {continuous ω : [0, 1] → R}. When equipped
with the norm ω = sup{ω (s) : s ∈ [0, 1]}, C [0, 1] becomes a complete separable
metric space. To ﬁt C [0, 1] into the framework of Section 2.9, we want our measures
deﬁned on B = the σ ﬁeld generated by the open sets. Fortunately,
Lemma 7.6.4. B is the same as C the σ ﬁeld generated by the ﬁnite dimensional sets
{ω : ω (ti ) ∈ Ai }.
Proof. Observe that if ξ is a given continuous function
{ω : ω − ξ ≤ r − 1/n} = ∩q {ω : ω (q ) − ξ (q ) ≤ r − 1/n}
where the intersection is over all rationals in [0,1]. Letting n → ∞ shows {ω : ω −ξ <
r} ∈ C and B ⊂ C . To prove the reverse inclusion, observe that if the Ai are open
the ﬁnite dimensional set {ω : ω (ti ) ∈ Ai } is open, so the π − λ theorem implies
B ⊃ C.
A sequence of probability measures µn on C [0, 1] is said to converge weakly to
a limit µ if for all bounded continuous functions ϕ : C [0, 1] → R, ϕ dµn → ϕ dµ.
Let N be the nonnegative integers and let
S (u) = Sk
if u = k ∈ N
linear on [k, k + 1] for k ∈ N We will prove:
Theorem 7.6.5. Donsker’s theorem. Under the hypotheses of Theorem 7.6.3,
√
S (n·)/ n ⇒ B (·),
i.e., the associated measures on C [0, 1] converge weakly.
To motivate ourselves for the proof we will begin by extracting several corollaries. The key to each one is a consequence of the following result which follows from
Theorem 2.9.1.
Theorem 7.6.6. If ψ : C [0, 1] → R has the property that it is continuous P0 a.s. then
√
(∗)
ψ (S (n·)/ n) ⇒ ψ (B (·)) 326 CHAPTER 7. BROWNIAN MOTION Example 7.6.1. Let ψ (ω ) = ω (1). In this case, ψ : C [0, 1] → R is continuous and
(∗) is the central limit theorem.
Example 7.6.2. Let ψ (ω ) = max{ω (t) : 0 ≤ t ≤ 1}. Again, ψ : C [0, 1] → R is
continuous. This time (∗) says
√
max Sm / n ⇒ M1 ≡ max Bt
0≤m≤n 0≤t≤1 To complete the picture, we observe that by (7.4.4) the distribution of the righthand
side is
P0 (M1 ≥ a) = P0 (Ta ≤ 1) = 2P0 (B1 ≥ a)
Exercise 7.6.2. Suppose Sn is onedimensional simple random walk and let
Rn = 1 + max Sm − min Sm
m≤n m≤n √
be the number of points visited by time n. Show that Rn / n ⇒ a limit.
Example 7.6.3. Let ψ (ω ) = sup{t ≤ 1 : ω (t) = 0}. This time, ψ is not continuous,
for if ω has ω (0) = 0, ω (1/3) = 1, ω (2/3) = , ω (1) = 2, and linear on each
interval [j, (j + 1)/3], then ψ (ω0 ) = 2/3 but ψ (ω ) = 0 for > 0. It is easy to see that
if ψ (ω ) < 1 and ω (t) has positive and negative values in each interval (ψ (ω ) − δ, ψ (ω )),
then ψ is continuous at ω . By arguments in Section 7.4.1, the last set has P0 measure
1. (If the zero at ψ (ω ) was isolated on the left, it would not be isolated on the right.)
It follows that
sup{m ≤ n : Sm−1 · Sm ≤ 0}/n ⇒ L = sup{t ≤ 1 : Bt = 0}
The distribution of L is given in (7.4.9). The last result shows that the arcsine law,
theorem 3.3.5, proved for simple random walks holds when the mean is 0 and variance
is ﬁnite.
Example 7.6.4. Let ψ (ω ) = {t ∈ [0, 1] : ω (t) > a}. The point ω ≡ a shows
that ψ is not continuous, but it is easy to see that ψ is continuous at paths ω with
{t ∈ [0, 1] : ω (t) = a} = 0. Fubini’s theorem implies that
1 E0 {t ∈ [0, 1] : Bt = a} = P0 (Bt = a) dt = 0
0 so ψ is continuous P0 a.s. With a little work, (∗) implies
√
{m ≤ n : Sm > a n}/n ⇒ {t ∈ [0, 1] : Bt > a}
Proof. Application of (∗) gives that for any a,
√
{t ∈ [0, 1] : S (nt) > a n} ⇒ {t ∈ [0, 1] : Bt > a}
√
To convert this into a result about {m ≤ n : Sm > a n}, we note that on
√
{maxm≤n Xm  ≤
n}, which by Chebyshev’s inequality has a probability → 1,
we have
√
√
1
{t ∈ [0, 1] : S (nt) > (a + ) n} ≤ {m ≤ n : Sm > a n}
n
√
≤ {t ∈ [0, 1] : S (nt) > (a − ) n}
Combining this with the ﬁrst conclusion of the proof and using the fact that b → {t ∈
[0, 1] : Bt > b} is continuous at b = a with probability one, one arrives easily at the
desired conclusion. 7.6. DONSKER’S THEOREM 327 To compute the distribution of {t ∈ [0, 1] : Bt > 0}, observe that we proved in
Theorem 3.3.7 that if Sn =d −Sn and P (Sm = 0) = 0 for all m ≥ 1, e.g., the Xi have
a symmetric continuous distribution, then the lefthand side converges to the arcsine
law, so the righthand side has that distribution and is the limit for any random walk
with mean 0 and ﬁnite variance. The last argument uses an idea called the “invariance
principle” that originated with Erd¨s and Kac (1946, 1947): The asymptotic behavior
o
of functionals of Sn should be the same as long as the central limit theorem applies.
Our ﬁnal application is from the original paper of Donsker (1951). Erd¨s and Kac
o
(1946) give the limit distribution for the case k = 2.
Example 7.6.5. Let ψ (ω ) = [0,1] ω (t)k dt where k > 0 is an integer. ψ is continuous,
so applying Theorem 7.6.6 gives
1 1 √
(S (nt)/ n)k dt ⇒ 0 k
Bt dt
0 To convert this into a result about the original sequence, we begin by observing that
if x < y with x − y  ≤ and x, y  ≤ M , then
y xk − y k  ≤
x M k+1
z k+1
dz ≤
k+1
k+1 From this, it follows that on
√
√
max Xm  ≤ M −(k+2) n, max Sm  ≤ M n Gn (M ) = m≤n m≤n we have
1 n √
(S (nt)/ n)k dt − n−1−(k/2) 0 k
Sm ≤
m=1 1
(k + 1)M For ﬁxed M , it follows from Chebyshev’s inequality, Example 7.6.2, and Theorem
2.2.5 that
lim inf P (Gn (M )) ≥ P
n→∞ max Bt  < M 0≤t≤1 The righthand side is close to 0 if M is large, so
1 n √
(S (nt)/ n)k dt − n−1−(k/2) 0 k
Sm → 0
m=1 in probability, and it follows from the converging together lemma (Exercise 2.2.13)
that
n n−1−(k/2) 1 k
Sm ⇒
m=1 k
Bt dt
0 It is remarkable that the last result holds under the assumption that EXi = 0 and
2
k
EXi = 1, i.e., we do not need to assume that E Xi  < ∞.
Exercise 7.6.3. When k = 1, the last result says that if X1 , X2 , . . . are i.i.d. with
2
EXi = 0 and EXi = 1, then
n n−3/2 1 (n + 1 − m)Xm ⇒
m=1 Bt dt
0 (i) Show that the righthand side has a normal distribution with mean 0 and variance
1/3. (ii) Deduce this result from the LindebergFeller theorem. 328 CHAPTER 7. BROWNIAN MOTION Proof of Theorem 7.6.5. To simplify the proof and prepare for generalizations in
the next section, let Xn,m , 1 ≤ m ≤ n, be a triangular array of random variables,
n
Sn,m = Xn,1 + · · · + Xn,m and suppose Sn,m = B (τm ). Let
Sn,(u) = Sn,m
if u = m ∈ {0, 1, . . . , n}
linear for u ∈ [m − 1, m] when m ∈ {1, . . . , n} Lemma 7.6.7. If τ[n ] → s in probability for each s ∈ [0, 1] then
ns
Sn,(n·) − B (·) → 0 in probability √
To make the connection with the original problem, let Xn,m = Xm / n and deﬁne
n
n
n
n
τ1 , . . . , τn so that (Sn,1 , . . . , Sn,n ) =d (B (τ1 ), . . . , B (τn )). If T1 , T2 , . . . are the stopn
ping times deﬁned in (6.3), Brownian scaling implies τm =d Tm /n, so the hypothesis
of (6.8) is satisﬁed.
Proof. The fact that B has continuous paths (and hence uniformly continuous on
[0,1]) implies that if > 0 then there is a δ > 0 so that 1/δ is an integer and
P (Bt − Bs  < (a) for all 0 ≤ s ≤ 1, t − s < 2δ ) > 1 − The hypothesis of Lemma 7.6.7 implies that if n ≥ Nδ then
P (τ[n ] − kδ  < δ
nkδ for k = 1, 2, . . . , 1/δ ) ≥ 1 − n
Since m → τm is increasing, it follows that if s ∈ ((k − 1)δ, kδ ) τ[n ] − s ≥ τ[n (k−1)δ] − kδ
ns
n
τ[n ] − s ≤ τ[n ] − (k + 1)δ
ns
nkδ
so if n ≥ Nδ ,
(b) P sup τ[n ] − s < 2δ
ns ≥1− 0≤s≤1 When the events in (a) and (b) occur
(c) Sn,m − Bm/n  < for all m ≤ n To deal with t = (m + θ)/n with 0 < θ < 1, we observe that
Sn,(nt) − Bt  ≤ (1 − θ)Sn,m − Bm/n  + θSn,m+1 − B(m+1)/n 
+ (1 − θ)Bm/n − Bt  + θB(m+1)/n − Bt 
Using (c) on the ﬁrst two terms and (a) on the last two, we see that if n ≥ Nδ and
1/n < 2δ , then Sn,(n·) − B (·) < 2 with probability ≥ 1 − 2 . Since is arbitrary,
the proof of Lemma 7.6.7 is complete.
To get Theorem 7.6.5 now, we have to show:
Lemma 7.6.8. If ϕ is bounded and continuous then Eϕ(Sn,(n·) ) → Eϕ(B (·)). 7.6. DONSKER’S THEOREM 329 Proof. For ﬁxed > 0, let Gδ = {ω : if ω − ω < δ then ϕ(ω ) − ϕ(ω ) < }. Since
ϕ is continuous, Gδ ↑ C [0, 1] as δ ↓ 0. Let ∆ = Sn,(n·) − B (·) . The desired result
now follows from Lemma 7.6.7 and the trivial inequality
Eϕ(Sn,(n·) ) − Eϕ(B (·)) ≤ + (2 sup ϕ(ω )){P (Gc ) + P (∆ ≥ δ )}
δ
To accommodate our ﬁnal example, we need a trivial generalization of Theorem
7.6.5. Let C [0, ∞) = {continuous ω : [0, ∞) → R} and let C [0, ∞) be the σ ﬁeld
generated by the ﬁnite dimensional sets. Given a probability measure µ on C [0, ∞),
there is a corresponding measure πM µ on C [0, M ] = {continuous ω : [0, M ] → R}
(with C [0, M ] the σ ﬁeld generated by the ﬁnite dimensional sets) obtained by “cutting
−
oﬀ the paths at time M.” Let (ψM ω )(t) = ω (t) for t ∈ [0, M ] and let πM µ = µ ◦ ψM1 .
We say that a sequence of probability measures µn on C [0, ∞) converges weakly to
µ if for all M , πM µn converges weakly to πM µ on C [0, M ], the last concept being
deﬁned by a trivial extension of the deﬁnitions for M = 1. With these deﬁnitions, it
is easy to conclude:
√
Theorem 7.6.9. S (n·)/ n ⇒ B (·), i.e., the associated measures on C [0, ∞) converge weakly.
Proof. By deﬁnition, all we have to show is that weak convergence occurs on C [0, M ]
for all M < ∞. The proof of Theorem 7.6.5 works in the same way when 1 is replaced
by M.
√
Example 7.6.6. Let Nn = inf {m : Sm ≥ n} and T1 = inf {t : Bt ≥ 1}. Since
ψ (ω ) = T1 (ω ) ∧ 1 is continuous P0 a.s. on C [0, 1] and the distribution of T1 is continuous, it follows from Theorem 7.6.6 that for 0 < t < 1
P (Nn ≤ nt) → P (T1 ≤ t)
Repeating the last argument with 1 replaced by M and using Theorem 7.6.9 shows
that the last conclusion holds for all t. 330 7.7 CHAPTER 7. BROWNIAN MOTION Empirical Distributions, Brownian Bridge Let X1 , X2 , . . . be i.i.d. with distribution F . Theorem 1.7.7 shows that with probability one, the empirical distribution
1
ˆ
Fn (x) = {m ≤ n : Xm ≤ x}
n
converges uniformly to F (x). In this section, we will investigate the rate of convergence when F is continuous. We impose this restriction so we can reduce to the case
of a uniform distribution on (0,1) by setting Yn = F (Xn ). (See Exercise 1.1.9.) Since
x → F (x) is nondecreasing and continuous and no observations land in intervals of
constancy of F , it is easy to see that if we let
1
ˆ
Gn (y ) = {m ≤ n : Ym ≤ y }
n
then
ˆ
ˆ
sup Fn (x) − F (x) = sup Gn (y ) − y 
x 0<y<1 For the rest of the section then, we will assume Y1 , Y2 , . . . is i.i.d. uniform on (0,1).
To be able to apply Donsker’s theorem, we will transform the problem. Put the
n
n
n
observations Y1 , . . . , Yn in increasing order: U1 < U2 < . . . < Un . I claim that
m
n
ˆ
sup Gn (y ) − y = sup
− Um
0<y<1
1≤m≤n n
m−1
n
ˆ
− Um
(7.7.1)
inf Gn (y ) − y = inf
0<y<1
1≤m≤n
n
ˆ
since the sup occurs at a jump of Gn and the inf right before a jump. We will show
that
ˆ
Dn ≡ n1/2 sup Gn (y ) − y 
0<y<1 has a limit, so the extra −1/n in the inf does not make any diﬀerence.
Our third and ﬁnal maneuver is to give a special construction of the order statistics
n
n
n
U1 < U2 . . . < Un . Let W1 , W2 , . . . be i.i.d. with P (Wi > t) = e−t and let Zn =
W1 + · · · + Wn .
d n
Lemma 7.7.1. {Uk : 1 ≤ k ≤ n} = {Zk /Zn+1 : 1 ≤ k ≤ n} Proof. We change variables v = r(t), where vi = ti /tn+1 for i ≤ n, vn+1 = tn+1 . The
inverse function is
s(v ) = (v1 vn+1 , . . . , vn vn+1 , vn+1 )
which has matrix of partial derivatives ∂si /∂vj given vn+1
0
...
0
0
vn+1 . . .
0 .
.
.
..
.
.
.
.
.
.
.
0
0
. . . vn+1
0
0
...
0 by v1
v2 .
.
.
vn 1 n
The determinant of this matrix is vn+1 , so if we let W = (V1 , . . . , Vn+1 ) = r(Z1 , . . . , Zn+1 ),
the change of variables formula implies W has joint density
n λe−λvn+1 (vm −vm−1 ) fW (v1 , . . . , vn , vn+1 ) =
m=1 n
λe−λvn+1 (1−vn ) vn+1 7.7. EMPIRICAL DISTRIBUTIONS, BROWNIAN BRIDGE 331 To ﬁnd the joint density of V = (V1 , . . . , Vn ), we simplify the preceding formula and
integrate out the last coordinate to get
∞
n
λn+1 vn+1 e−λvn+1 dvn+1 = n! fV (v1 , . . . , vn ) =
0 for 0 < v1 < v2 . . . < vn < 1, which is the desired joint density.
We turn now to the limit law for Dn . As argued above, it suﬃces to consider
m
Zm
−
Zn+1
n
m Zn+1
Zm
−
max
·
1≤m≤n n1/2
n n1 /2
Zm − m m Zn+1 − n
max
−
·
1≤m≤n
n
n1 /2
n1 /2 Dn = n1/2 max 1≤m≤n n
Zn+1
n
=
Zn+1 = (7.7.2) If we let
Bn (t) =
then
Dn = (Zm − m)/n1/2
linear if t = m/n with m ∈ {0, 1, . . . , n}
on [(m − 1)/n, m/n] Zn+1 − Zn
n
max Bn (t) − t Bn (1) +
Zn+1 0≤t≤1
n1 /2 The strong law of large numbers implies Zn+1 /n → 1 a.s., so the ﬁrst factor will
disappear in the limit. To ﬁnd the limit of the second, we observe that Donsker’s
theorem, Theorem 7.6.5, implies Bn (·) ⇒ B (·), a Brownian motion, and computing
second moments shows
(Zn+1 − Zn )/n1/2 → 0 in probability
ψ (ω ) = max0≤t≤1 ω (t) − tω (1) is a continuous function from C [0, 1] to R, so it follows
from Donsker’s theorem that:
Theorem 7.7.2. Dn ⇒ max0≤t≤1 Bt − tB1 , where Bt is a Brownian motion starting
at 0.
Remark. Doob (1949) suggested this approach to deriving results of Kolmogorov
and Smirnov, which was later justiﬁed by Donsker (1952). Our proof follows Breiman
(1968).
To identify the distribution of the limit in Theorem 7.7.2, we will ﬁrst prove
d {Bt − tB1 , 0 ≤ t ≤ 1} = {Bt , 0 ≤ t ≤ 1B1 = 0} (7.7.3) 0
a process we will denote by Bt and call the Brownian bridge. The event B1 = 0
has probability 0, but it is easy to see what the conditional probability should mean.
If 0 = t0 < t1 < . . . < tn < tn+1 = 1, x0 = 0, xn+1 = 0, and x1 , . . . , xn ∈ R, then P (B (t1 ) = x1 , . . . , B (tn ) = xn B (1) = 0)
n+1 = 1
pt −t
(xm−1 , xm )
p1 (0, 0) m=1 m m−1 (7.7.4) 332 CHAPTER 7. BROWNIAN MOTION where pt (x, y ) = (2πt)−1/2 exp(−(y − x)2 /2t).
0
Proof of (7.7.3). Formula (7.7.4) shows that the f.d.d.’s of Bt are multivariate normal
and have mean 0. Since Bt − tB1 also has this property, it suﬃces to show that the
covariances are equal. We begin with the easier computation. If s < t then E ((Bs − sB1 )(Bt − tB1 )) = s − st − st + st = s(1 − t) (7.7.5) 0
0
For the other process, P (Bs = x, Bt = y ) is exp(−x2 /2s) exp(−(y − x)2 /2(t − s)) exp(−y 2 /2(1 − t))
·
·
· (2π )1/2
(2πs)1/2
(2π (t − s))1/2
(2π (1 − t))1/2
= (2π )−1 (s(t − s)(1 − t))−1/2 exp(−(ax2 + 2bxy + cy 2 )/2)
where
1
1
t
1
+
=
b=−
s t−s
s(t − s)
t−s
1
1
1−s
c=
+
=
t−s 1−t
(t − s)(1 − t) a= Recalling the discussion at the end of Section 2.9 and noticing
−1
(t−s)
1−s
(t−s)(1−t) t
s(t−s)
−1
(t−s) −1 s(1 − s) s(1 − t)
s(1 − t) t(1 − t) = (multiply the matrices!) shows (7.7.3) holds.
2b − y − 2(b − a)
y − 2(b − a)
•
+ 2b − y ↓
•
− b + 2a ↓ y
•
+
a 2b − y + 2(b − a)
y + 2(b − a) •
−
b •
+
2b − a ↓
•
− 3b − 2a Figure 7.2: Picture of the inﬁnite series in (7.7.6). Note that the array of + and − is
antisymmetric when seen from a or b. Our ﬁnal step in investigating the limit distribution of Dn is to compute the
0
distribution of max0≤t≤1 Bt . To do this, we ﬁrst prove the following result for
Theorem 7.7.3. The density function of Bt on {Ta ∧ Tb > t} is
∞ Px (Ta ∧ Tb > t, Bt = y ) = Px (Bt = y + 2n(b − a))
n=−∞ − Px (Bt = 2b − y + 2n(b − a)) (7.7.6) 7.7. EMPIRICAL DISTRIBUTIONS, BROWNIAN BRIDGE 333 Proof. We begin by observing that if A ⊂ (a, b)
Px (Ta ∧ Tb > t, Bt ∈ A) = Px (Bt ∈ A) − Px (Ta < Tb , Ta < t, Bt ∈ A)
− Px (Tb < Ta , Tb < t, Bt ∈ A) (7.7.7) If we let ρa (y ) = 2a − y be reﬂection through a and observe that {Ta < Tb } is F (Ta )
measurable, then it follows from the proof of (7.4.5) that
Px (Ta < Tb , Ta < t, Bt ∈ A) = Px (Ta < Tb , Bt ∈ ρa A)
where ρa A = {ρa (y ) : y ∈ A}. To get rid of the Ta < Tb , we observe that
Px (Ta < Tb , Bt ∈ ρa A) = Px (Bt ∈ ρa A) − Px (Tb < Ta , Bt ∈ ρa A)
Noticing that Bt ∈ ρa A and Tb < Ta imply Tb < t and using the reﬂection principle
again gives
Px (Tb < Ta , Bt ∈ ρa A) = Px (Tb < Ta , Bt ∈ ρb ρa A)
= Px (Bt ∈ ρb ρa A) − Px (Ta < Tb , Bt ∈ ρb ρa A)
Repeating the last two calculations n more times gives
n Px (Bt ∈ ρa (ρb ρa )m A) − Px (Bt ∈ (ρb ρa )m+1 A) Px (Ta < Tb , Bt ∈ ρa A) =
m=0 + Px (Ta < Tb , Bt ∈ (ρb ρa )n+1 A)
Each pair of reﬂections pushes A further away from 0, so letting n → ∞ shows
∞ Px (Bt ∈ ρa (ρb ρa )m A) − Px (Bt ∈ (ρb ρa )m+1 A) Px (Ta < Tb , Bt ∈ ρa A) =
m=0 Interchanging the roles of a and b gives
∞ Px (Bt ∈ ρb (ρa ρb )m A) − Px (Bt ∈ (ρa ρb )m+1 A) Px (Tb < Ta , Bt ∈ ρb A) =
m=0 −
Combining the last two expressions with (∗) and using ρ−1 = ρc , (ρa ρb )−1 = ρb 1 ρ−1
c
a
gives
∞ Px (Bt ∈ (ρb ρa )n A) − Px (Bt ∈ ρa (ρb ρa )n A) Px (Ta ∧ Tb > t, Bt ∈ A) =
m=−∞ To prepare for applications, let A = (u, v ) where a < u < v < b, notice that ρb ρa (y ) =
y + 2(b − a), and change variables in the second sum to get
Px (Ta ∧ Tb > t, u < Bt < v ) =
∞ {Px (u + 2n(b − a) < Bt < v + 2n(b − a)) (7.7.8) n=−∞ − Px (2b − v + 2n(b − a) < Bt < 2b − u + 2n(b − a))}
Letting u = y − , v = y + , dividing both sides by 2 , and letting → 0 (leaving
it to the reader to check that the dominated convergence theorem applies) gives the
desired result. 334 CHAPTER 7. BROWNIAN MOTION Setting x = y = 0, t = 1, and dividing by (2π )−1/2 = P0 (B1 = 0), we get a result
0
for the Brownian bridge Bt :
0
0
P0 a < min Bt < max Bt < b
0≤t≤1 (7.7.9) 0≤t≤1 ∞ e−(2n(b−a)) = 2 /2 − e−(2b+2n(b−a)) 2 /2 n=−∞ Taking a = −b, we have
∞ P0 max 0≤t≤1 0
Bt  <b = 22 (−1)m e−2m b (7.7.10) m=−∞ This formula gives the distribution of the KolmogorvSmirnov statistic, which can
be used to test if an i.i.d. sequence X1 , . . . , Xn has distribution F . To do this, we
transform the data to F (Xn ) and look at the maximum discrepancy between the
empirical distribution and the uniform. (8.11) tells us the distribution of the error
when the Xi have distribution F .
(7.7.9) gives the joint distribution of the maximum and minimum of Brownian
bridge. In theory, one can let a → −∞ in this formula to ﬁnd the distribution of the
maximum, but in practice it is easier to start over again.
Exercise 7.7.1. Use Exercise 7.4.6 and the reasoning that led to (7.7.9) to conclude
P 0
max Bt > b 0≤t≤1 = exp(−2b2 ) 7.8. LAWS OF THE ITERATED LOGARITHM* 7.8 335 Laws of the Iterated Logarithm* Our ﬁrst goal is to show:
Theorem 7.8.1. LIL for Brownian motion.
lim sup Bt /(2t log log t)1/2 = 1 a.s. t→∞ Here LIL is short for “law of the iterated logarithm,” a name that refers to the log log t
in the denominator. Once Theorem 7.8.1 is established, we can use the Skorokhod
representation to prove the analogous result for random walks with mean 0 and ﬁnite
variance.
Proof. The key to the proof is (7.4.4).
P0 max Bs > a 0≤s≤1 = P0 (Ta ≤ 1) = 2 P0 (B1 ≥ a) (7.8.1) To bound the righthand side, we use Theorem 1.1.4.
∞ 1
exp(−x2 /2)
x
x
∞
1
exp(−y 2 /2) dy ∼ exp(−x2 /2)
x
x
exp(−y 2 /2) dy ≤ (7.8.2)
as x → ∞ (7.8.3) where f (x) ∼ g (x) means f (x)/g (x) → 1 as x → ∞. The last result and Brownian
scaling imply that
P0 (Bt > (tf (t))1/2 ) ∼ κf (t)−1/2 exp(−f (t)/2)
where κ = (2π )−1/2 is a constant that we will try to ignore below. The last result
implies that if > 0, then
∞ P0 (Bn > (nf (n))1/2 )
n=1 <∞
=∞ when f (n) = (2 + ) log n
when f (n) = (2 − ) log n and hence by the BorelCantelli lemma that
lim sup Bn /(2n log n)1/2 ≤ 1 a.s. n→∞ To replace log n by log log n, we have to look along exponentially growing sequences.
Let tn = αn , where α > 1.
P0 max tn ≤s≤tn+1 Bs > (tn f (tn ))1/2 ≤ P0 max 0≤s≤tn+1 1 /2 Bs /tn+1 > f (tn )
α 1 /2 ≤ 2κ(f (tn )/α)−1/2 exp(−f (tn )/2α)
by (7.8.1) and (7.8.2). If f (t) = 2α2 log log t, then
log log tn = log(n log α) = log n + log log α
so exp(−f (tn )/2α) ≤ Cα n−α , where Cα is a constant that depends only on α, and
hence
∞ P0
n=1 max tn ≤s≤tn+1 Bs > (tn f (tn ))1/2 <∞ 336 CHAPTER 7. BROWNIAN MOTION Since t → (tf (t))1/2 is increasing and α > 1 is arbitrary, it follows that
lim sup Bt /(2t log log t)1/2 ≤ 1 (7.8.4) To prove the other half of Theorem 7.8.1, again let tn = αn , but this time α will be
large, since to get independent events, we will we look at
P0 B (tn+1 ) − B (tn ) > (tn+1 f (tn+1 ))1/2 = P0 B1 > (βf (tn+1 ))1/2
where β = tn+1 /(tn+1 − tn ) = α/(α − 1) > 1. The last quantity is
≥ κ
(βf (tn+1 ))−1/2 exp(−βf (tn+1 )/2)
2 if n is large by (7.8.3). If f (t) = (2/β 2 ) log log t, then log log tn = log n + log log α so
exp(−βf (tn+1 )/2) ≥ Cα n−1/β
where Cα is a constant that depends only on α, and hence
∞ P0 B (tn+1 ) − B (tn ) > (tn+1 f (tn+1 ))1/2 = ∞
n=1 Since the events in question are independent, it follows from the second BorelCantelli
lemma that
B (tn+1 ) − B (tn ) > ((2/β 2 )tn+1 log log tn+1 )1/2 i.o.
(7.8.5)
From (7.8.4), we get
lim sup B (tn )/(2tn log log tn )1/2 ≤ 1 (7.8.6) n→∞ Since tn = tn+1 /α and t → log log t is increasing, combining (7.8.5) and (7.8.6), and
recalling β = α/(α − 1) gives
lim sup B (tn+1 )/(2tn+1 log log tn+1 )1/2 ≥
n→∞ 1
α−1
− 1 /2
α
α Letting α → ∞ now gives the desired lower bound, and the proof of Theorem 7.8.1 is
complete.
Exercise 7.8.1. Let tk = exp(ek ). Show that
lim sup B (tk )/(2tk log log log tk )1/2 = 1 a.s. k→∞ Theorem 7.2.6 implies that Xt = tB (1/t) is a Brownian motion. Changing variables and using Theorem 7.8.1, we conclude
lim sup Bt /(2t log log(1/t))1/2 = 1 a.s. (7.8.7) t→0 To take a closer look at the local behavior of Brownian paths, we note that Blumenthal’s 01 law, Theorem 7.2.3 implies P0 (Bt < h(t) for all t suﬃciently small) ∈ {0, 1}.
h is said to belong to the upper class if the probability is 1, the lower class if it is
0. 7.8. LAWS OF THE ITERATED LOGARITHM* 337 Theorem 7.8.2. Kolmogorov’s test. If h(t) ↑ and t−1/2 h(t) ↓ then h is upper or
lower class according as
1 t−3/2 h(t) exp(−h2 (t)/2t) dt converges or diverges 0 The ﬁrst proof of this was given by Petrovsky (1935). Recalling (7.4.8), we see
that the integrand is the probability of hitting h(t) at time t. To see what Theorem
7.8.2 says, deﬁne lgk (t) = log(lgk−1 (t)) for k ≥ 2 and t > ak = exp(ak−1 ), where
lg1 (t) = log(t) and a1 = 0. A little calculus shows that when n ≥ 4,
1 /2 n−1 h(t) = 3
2t lg2 (1/t) + lg3 (1/t) +
lgm (1/t) + (1 + ) lgn (1/t)
2
m=4 is upper or lower class according as > 0 or ≤ 0.
Approximating h from above by piecewise constant functions, it is easy to show
that if the integral in (9.6) converges, h(t) is an upper class function. The proof of
the other direction is much more diﬃcult; see Motoo (1959) or Section 4.12 of Itˆ and
o
McKean (1965).
Turning to random walk, we will prove a result due to Hartman and Wintner
(1941):
2
Theorem 7.8.3. If X1 , X2 , . . . are i.i.d. with EXi = 0 and EXi = 1 then lim sup Sn /(2n log log n)1/2 = 1
n→∞ Proof. By Theorem 7.6.2, we can write Sn = B (Tn ) with Tn /n → 1 a.s. As in the
proof of Donsker’s theorem, this is all we will use in the argument below. Theorem
7.8.3 will follow from Theorem 7.8.1 once we show
(S[t] − Bt )/(t log log t)1/2 → 0
To do this, we begin by observing that if a.s. (7.8.8) > 0 and t ≥ to (ω ) T[t] ∈ [t/(1 + ), t(1 + )] (7.8.9) To estimate S[t] − Bt , we let M (t) = sup{B (s) − B (t) : t/(1 + ) ≤ s ≤ t(1 + )}. To
control the last quantity, we let tk = (1 + )k and notice that if tk ≤ t ≤ tk+1
M (t) ≤ sup{B (s) − B (t) : tk−1 ≤ s, t ≤ tk+2 }
≤ 2 sup{B (s) − B (tk−1 ) : tk−1 ≤ s ≤ tk+2 }
Noticing tk+2 − tk−1 = δtk−1 , where δ = (1 + )3 − 1, scaling implies
P max tk−1 ≤s≤tk+2 =P B (s) − B (t) > (3δtk−1 log log tk−1 )1/2 max B (r) > (3 log log tk−1 )1/2 0≤r ≤1 ≤ 2κ(3 log log tk−1 )−1/2 exp(−3 log log tk−1 /2)
by a now familiar application of (7.8.1) and (7.8.2). Summing over k and using (b)
gives
lim sup(S[t] − Bt )/(t log log t)1/2 ≤ (3δ )1/2
t→∞ If we recall δ = (1 + )3 − 1 and let ↓ 0, (a) follows and the proof is complete. 338 CHAPTER 7. BROWNIAN MOTION Exercise 7.8.2. Show that if E Xi α = ∞ for some α < 2 then
lim sup Xn /n1/α = ∞ a.s. n→∞ so the law of the iterated logarithm fails.
Strassen (1965) has shown an exact converse. If Theorem 7.8.3 holds then EXi = 0
2
and EXi = 1. Another one of his contributions to this subject is
Theorem 7.8.4. Strassen’s (1964) invariance principle. Let X1 , X2 , . . . be
2
i.i.d. with EXi = 0 and EXi = 1, let Sn = X1 + · · · + Xn , and let S(n·) be the
usual linear interpolation. The limit set (i.e., the collection of limits of convergent
subsequences) of
Zn (·) = (2n log log n)−1/2 S (n·) for n ≥ 3
is K = {f : f (x) = x
0 g (y ) dy with 1
0 g (y )2 dy ≤ 1}.
1 Jensen’s inequality implies f (1)2 ≤ 0 g (y )2 dy ≤ 1 with equality if and only if
f (t) = t, so Theorem 7.8.4 contains Theorem 7.8.3 as a special case and provides some
information about how the large value of Sn came about.
Exercise 7.8.3. Give a direct proof that, under the hypotheses of Theorem 7.8.4,
the limit set of {Sn /(2n log log n)1/2 } is [−1, 1]. Appendix A Measure Theory
This Appendix gives a complete treatment of the results from measure theory that
we will need. A.1 LebesgueStieltjes Measures To prove the existence of Lebesgue measure (and some related more general measures),
we will use the Carath´odory extension theorem, Theorem A.1.1. To state that result,
e
we need several deﬁnitions in addition to the ones given in Section 1 of Chapter 1.
A collection A of subsets of Ω is called an algebra (or ﬁeld) if A, B ∈ A implies Ac
and A ∪ B are in A. Since A ∩ B = (Ac ∪ B c )c , it follows that A ∩ B ∈ A. Obviously
a σ algebra is an algebra. Two cases in which the converse is false are:
Example A.1.1. Ω = Z = the integers, A = the collection of A ⊂ Z so that A or
Ac is ﬁnite.
Example A.1.2. Ω = R, A = the collection of sets of the form
∪k=1 (ai , bi ]
i where − ∞ ≤ ai < bi ≤ ∞ Exercise A.1.1. (i) Show that if F1 ⊂ F2 ⊂ . . . are σ algebras, then ∪i Fi is an
algebra. (ii) Give an example to show that ∪i Fi need not be a σ algebra.
Exercise A.1.2. A set A ⊂ {1, 2, . . .} is said to have asymptotic density θ if
lim A ∩ {1, 2, . . . , n}/n = θ n→∞ Let A be the collection of sets for which the asymptotic density exists. Is A a σ algebra? an algebra?
By a measure on an algebra A, we mean a set function µ with
(i) µ(A) ≥ µ(∅) = 0 for all A ∈ A, and
(ii) if Ai ∈ A are disjoint and their union is in A, then
∞ µ (∪∞ Ai ) =
i=1 µ(Ai )
i=1 The italicized clause is unnecessary if A is a σ algebra, so in that case the new
deﬁnition coincides with the old one. The next exercise generalizes Exercise 1.1 in
Chapter 1.
339 340 APPENDIX A. MEASURE THEORY Exercise A.1.3. We assume that all the sets mentioned are in A.
(i) monotonicity. If A ⊂ B then µ(A) ≤ µ(B ).
(ii) subadditivity. If A ⊂ ∪i Ai then µ(A) ≤ i µ(Ai ). (iii) continuity from below. If Ai ↑ A (i.e., A1 ⊂ A2 , . . . and ∪i Ai = A) then
µ(Ai ) ↑ µ(A).
(iv) continuity from above. If Ai ↓ A (i.e., A1 ⊃ A2 , . . . and ∩i Ai = A) and
µ(A1 ) < ∞ then µ(Ai ) ↓ µ(A) as i ↑ ∞.
µ is said to be σ ﬁnite if there is a sequence of sets An ∈ A so that µ(An ) < ∞
and ∪n An = Ω. Letting A1 = A1 and for n ≥ 2,
An = ∪n =1 Am
m or An = An ∩ ∩n−1 Ac ∈ A
m=1 m we can without loss of generality assume that An ↑ Ω or the An are disjoint.
Theorem A.1.1. Carath´odory’s Extension Theorem. Let µ be a σ ﬁnite meae
sure on an algebra A. Then µ has a unique extension to σ (A) = the smallest σ algebra
containing A.
Exercise A.1.4. Let Z = the integers and A = the collection of subsets so that A
or Ac is ﬁnite. Let µ(A) = 0 in the ﬁrst case and µ(A) = 1 in the second. Show that
µ has no extension to σ (A).
The next section is devoted to the proof of Theorem A.1.1. To check the hypotheses
of Theorem A.1.1 for Lebesgue measure, we will prove a theorem, Theorem A.1.3, that
will be useful for other examples. To state that result, we will need several deﬁnitions.
A collection S of sets is said to be a semialgebra if (i) it is closed under intersection,
i.e., S , T ∈ S implies S ∩ T ∈ S , and (ii) if S ∈ S then S c is a ﬁnite disjoint union
of sets in S . An important example of a semialgebra is Rd = the collection of sets of
o
the form
(a1 , b1 ] × · · · × (ad , bd ] ⊂ Rd where − ∞ ≤ ai < bi ≤ ∞
Exercise A.1.5. Show that σ (Rd ) = Rd , the Borel subsets of Rd .
o
¯
Lemma A.1.2. If S is a semialgebra then S = {ﬁnite disjoint unions of sets in S}
is an algebra, called the algebra generated by S .
Proof. Suppose A = +i Si and B = +j Tj , where + denotes disjoint union and we
¯
assume the index sets are ﬁnite. Then A ∩ B = +i,j Si ∩ Tj ∈ S . As for complements,
c
c
¯
if A = +i Si then Ac = ∩i Si . The deﬁnition of S implies Si ∈ S . We have shown
¯
¯
that S is closed under intersection, so it follows by induction that Ac ∈ S .
Let λ denote Lebesgue measure. The deﬁnition gives the values of λ on a semialgebra S (the halfopen intervals). It is easy to see how to extend the deﬁnition to
¯
the algebra S deﬁned in Lemma A.1.2. We let λ(+i (ai , bi ]) = i (bi − ai ). To assert
¯
that λ has an extension to σ (S ) = R, we have to check that λ is a measure on S , i.e.,
¯ is a countable disjoint union of sets Ai ∈ S , then λ(A) =
¯
if A ∈ S
i λ(Ai ). The
next result simpliﬁes that task somewhat.
Theorem A.1.3. Let S be a semialgebra and let µ deﬁned on S have µ(∅) = 0.
Suppose (i) if S ∈ S is a ﬁnite disjoint union of sets Si ∈ S then µ(S ) = i µ(Si ),
and (ii) if Si , S ∈ S with S = +i≥1 Si then µ(S ) ≤ i µ(Si ). Then µ has a unique
¯
extension µ that is a measure on S the algebra generated by S . If the extension is
¯
σ ﬁnite then by Theorem A.1.1 there is a unique extension ν that is a measure on
σ (S ). A.1. LEBESGUESTIELTJES MEASURES 341 Remark. In (ii) above, and in what follows, i ≥ 1 indicates a countable union, while
a plain subscript i or j indicates a ﬁnite union.
¯
Proof. We deﬁne µ on S by µ(A) = i µ(Si ) whenever A = +i Si . To check that
¯
¯
µ is well deﬁned, suppose that A = +j Tj and observe Si = +j (Si ∩ Tj ) and Tj =
¯
+i (Si ∩ Tj ), so (i) implies
µ(Si ∩ Tj ) = µ(Si ) =
i µ(Tj ) i,j j ¯
Our next result takes the ﬁrst step toward proving that µ is a measure on S . It
¯
includes an extra statement, (b), that will be useful in checking (ii).
Lemma A.1.4. Suppose only that (i) holds.
¯
(a) If A, Bi ∈ S with A = +n Bi then µ(A) =
¯
i=1
¯ with A ⊂ ∪n Bi then µ(A) ≤
(b) If A, Bi ∈ S
¯
i=1 µ(Bi ).
¯
µ(Bi ).
¯
i
i Proof. Observe that it follows from the deﬁnition that if A = +i Bi is a ﬁnite disjoint
¯
union of sets in S and Bi = +j Si,j , then
µ(A) =
¯ µ(Si,j ) =
i,j µ(Bi )
¯
i ¯
To prove (b), we begin with the case n = 1, B1 = B . B = A +(B ∩ Ac ) and B ∩ Ac ∈ S ,
so
µ(A) ≤ µ(A) + µ(B ∩ Ac ) = µ(B )
¯
¯
¯
¯
c
c
To handle n > 1 now, let Fk = B1 ∩ . . . ∩ Bk−1 ∩ Bk and note ∪i Bi = F1 + · · · + Fn
A = A ∩ (∪i Bi ) = (A ∩ F1 ) + · · · + (A ∩ Fn )
so using (a), (b) with n = 1, and (a) again
n n µ(A ∩ Fk ) ≤
¯ µ(A) =
¯
k=1 µ(Fk ) = µ (∪i Bi )
¯
¯
k=1 ¯
To extend the additivity property to A ∈ S that are countable disjoint unions
¯, we observe that each Bi = +j Si,j with Si,j ∈ S and
A = +i≥1 Bi , where Bi ∈ S
¯
i≥1,j µ(Si,j ), so replacing the Bi ’s by Si,j ’s we can without loss of
i≥1 µ(Bi ) =
¯
generality suppose that the Bi ∈ S . Now A ∈ S implies A = +j Tj (a ﬁnite disjoint
union) and Tj = +i≥1 Tj ∩ Bi , so (ii) implies
µ(Tj ) ≤ µ(Tj ∩ Bi )
i≥1 Summing over j and observing that nonnegative numbers can be summed in any
order,
µ(A) =
¯
µ(Tj ) ≤
µ(Tj ∩ Bi ) =
µ(Bi )
i≥1 j j i≥1 the last equality following from (i). To prove the opposite inequality, let An = B1 +
¯
¯
· · · + Bn , and Cn = A ∩ Ac . Cn ∈ S , since S is an algebra, so ﬁnite additivity of µ
¯
n
implies
µ(A) = µ(B1 ) + · · · + µ(Bn ) + µ(Cn ) ≥ µ(B1 ) + · · · + µ(Bn )
¯
¯
¯
¯
¯
¯
and letting n → ∞, µ(A) ≥
¯ i≥1 µ(Bi ).
¯ 342 APPENDIX A. MEASURE THEORY With Theorem A.1.3 established, we are ready to prove the existence of Lebesgue
measure and a number of other measures.
Theorem A.1.5. Suppose F is (i) nondecreasing and (ii) right continuous, i.e.,
F (y ) ↓ F (x) when y ↓ x. There is a unique measure µ on (R, R) with µ((a, b]) =
F (b) − F (a) for all a, b.
Remark. A function F that has properties (i) and (ii) is called a Stieltjes measure
function. To see the reasons for the two conditions, observe that (a) if µ is a measure
then F (b) − F (a) ≥ 0 and (b) part (iv) of Exercise A.1.3 implies that if µ((a, y ]) < ∞
and y ↓ x > a
F (y ) − F (a) = µ((a, y ]) ↓ µ((a, x]) = F (x) − F (a)
Conversely, if µ is a measure on R with µ((a, b]) < ∞ when −∞ < a < b < ∞ then
F (x) = c + µ((0, x]) for x ≥ 0
c − µ((x, 0]) for x < 0 is a function F with F (b) − F (a) = µ((a, b]), and any such function has this form with
c = F (0).
Proof. Let S be the semialgebra of halfopen intervals (a, b] with −∞ ≤ a < b ≤ ∞.
To deﬁne µ on S , we begin by observing that
F (∞) = lim F (x)
x↑∞ and F (−∞) = lim F (x)
x↓−∞ exist and µ((a, b]) = F (b) − F (a) makes sense for all −∞ ≤ a < b ≤ ∞ since F (∞) > −∞
and F (−∞) < ∞.
If (a, b] = +n (ai , bi ] then after relabeling the intervals we must have a1 = a,
i=1
bn = b, and ai = bi−1 for 2 ≤ i ≤ n, so condition (i) in (1.3) holds. To check (ii),
suppose ﬁrst that −∞ < a < b < ∞, and (a, b] ⊂ ∪i≥1 (ai , bi ] where (without loss of
generality) −∞ < ai < bi < ∞. Pick δ > 0 so that F (a + δ ) < F (a) + and pick ηi
so that
F (bi + ηi ) < F (bi ) + 2−i
The open intervals (ai , bi + ηi ) cover [a + δ, b], so there is a ﬁnite subcover (αj , βj ),
1 ≤ j ≤ J . Since (a + δ, b] ⊂ ∪J=1 (αj , βj ], (b) in (1.4) implies
j
∞ J F (b) − F (a + δ ) ≤ F (βj ) − F (αj ) ≤
j =1 (F (bi + ηi ) − F (ai ))
i=1 So, by the choice of δ and ηi ,
∞ F (b) − F (a) ≤ 2 + (F (bi ) − F (ai ))
i=1 and since is arbitrary, we have proved the result in the case −∞ < a < b < ∞. To
remove the last restriction, observe that if (a, b] ⊂ ∪i (ai , bi ] and (A, B ] ⊂ (a, b] has
−∞ < A < B < ∞, then we have
∞ F (B ) − F (A) ≤ (F (bi ) − F (ai ))
i=1 Since the last result holds for any ﬁnite (A, B ] ⊂ (a, b], the desired result follows. A.1. LEBESGUESTIELTJES MEASURES 343 Our next goal is to prove a version of Theorem A.1.5 for Rd . The ﬁrst step is to
introduce the assumptions on the deﬁning function F .
(i) It is nondecreasing, i.e., if x ≤ y (meaning xi ≤ yi for all i) then F (x) ≤ F (y ).
(ii) F is right continuous, i.e., limy↓x F (y ) = F (x) (here y ↓ x means each coordinate
yi ↓ xi ).
To formulate the third and ﬁnal condition, let
A = (a1 , b1 ] × · · · × (ad , bd ]
V = {a1 , b1 } × · · · × {ad , bd }
where −∞ < ai < bi < ∞. To emphasize that ∞’s are not allowed, we will call A a
ﬁnite rectangle. Then V = the vertices of the rectangle A. If v ∈ V , let
sgn (v ) = (−1)# of a’s in v
∆A F = sgn (v )F (v )
v ∈V We will let µ(A) = ∆A F , so we must assume
(iii) ∆A F ≥ 0 for all rectangles A.
To see the reason for this deﬁnition, consider the special case d = 2 and then divide
one large rectangle into four small ones. For more on this assumption, see Section
2.9.
d
i=1 Example A.1.3. Suppose F (x) =
Theorem A.1.5. In this case, Fi (x), where the Fi satisfy (i) and (ii) of d (Fi (bi ) − Fi (ai )) ∆A F =
i=1 When Fi (x) = x for all i, the resulting measure is Lebesgue measure on Rd .
Theorem A.1.6. Suppose F : Rd → [0, 1] satisﬁes (i)–(iii) in Section 2.9. Then
there is a unique probability measure µ on (Rd , Rd ) so that µ(A) = ∆A F for all ﬁnite
rectangles.
Proof. Let S be the semialgebra of rectangles A = (a, b], where −∞ ≤ ai < bi ≤ ∞.
We let µ(A) = ∆A F for all ﬁnite rectangles and then use monotonicity to extend the
deﬁnition to all rectangles.
To check (i), call A = +k Bk a regular subdivision of A if there are sequences
ai = αi,0 < αi,1 . . . < αi,ni = bi so that each rectangle Bk has the form
(α1,j1 −1 , α1,j1 ] × · · · × (αd,jd −1 , αd,jd ] where 1 ≤ ji ≤ ni It is easy to see that for regular subdivisions λ(A) = k λ(Bk ). (First consider the
case in which all the endpoints are ﬁnite and then take limits to get the general case.)
To extend this result to a general ﬁnite subdivision A = +j Aj , subdivide further to
get a regular one.
The proof of (ii) is almost identical to that in Theorem A.1.5. To make things
easier to write and to bring out the analogies with Theorem A.1.5, we let
(x, y ) = (x1 , y1 ) × · · · × (xd , yd )
(x, y ] = (x1 , y1 ] × · · · × (xd , yd ]
[x, y ] = [x1 , y1 ] × · · · × [xd , yd ] 344 APPENDIX A. MEASURE THEORY for x, y ∈ Rd . Suppose ﬁrst that −∞ < a < b < ∞, where the inequalities mean
that each component is ﬁnite, and suppose (a, b] ⊂ ∪i≥1 (ai , bi ], where (without loss
of generality) −∞ < ai < bi < ∞. Let ¯ = (1, . . . , 1), pick δ > 0 so that
1
µ((a + δ ¯, b]) < µ((a, b]) +
1
and pick ηi so that µ((a, bi + ηi ¯ < µ((ai , bi ]) + 2−i
1]) ¯
The open rectangles (ai , bi + ηi 1) cover [a + δ ¯, b], so there is a ﬁnite subcover (αj , β j ),
1
1 ≤ j ≤ J . Since (a + δ ¯, b] ⊂ ∪J=1 (αj , β j ], (b) in (1.4) implies
1
j
∞ J µ([a + δ ¯, b]) ≤
1 µ((ai , bi + ηi ¯
1]) µ((αj , β j ]) ≤
j =1 i=1 So, by the choice of δ and ηi ,
∞ µ((ai , bi ]) µ((a, b]) ≤ 2 +
i=1 and since is arbitrary, we have proved the result in the case −∞ < a < b < ∞. The
proof can now be completed exactly as before. ´
A.2. CARATHEODORY’S EXTENSION THEOREM A.2 345 Carath´odory’s Extension Theorem
e This section is devoted to the proof of Theorem A.1.1. The proof is slick but rather
mysterious. The reader should not worry too much about the details but concentrate
on the structure of the proof and the deﬁnitions introduced.
Uniqueness. We will prove that the extension is unique before tackling the more
diﬃcult problem of proving its existence. The key to our uniqueness proof is Dynkin’s
π − λ theorem, a result that we will use many times in the book. As usual, we need
a few deﬁnitions before we can state the result. P is said to be a π system if it is
closed under intersection, i.e., if A, B ∈ P then A ∩ B ∈ P . For example, the collection
of rectangles (a1 , b1 ] × · · · × (ad , bd ] is a π system. L is said to be a λsystem if it
satisﬁes: (i) Ω ∈ L. (ii) If A, B ∈ L and A ⊂ B then B − A ∈ L . (iii) If An ∈ L
and An ↑ A then A ∈ L . The reader will see in a moment that the next result is just
what we need to prove uniqueness of the extension.
Theorem A.2.1. π − λ Theorem. If P is a π system and L is a λsystem that
contains P then σ (P ) ⊂ L.
Proof. We will show that
(a) if (P ) is the smallest λsystem containing P then (P ) is a σ ﬁeld.
The desired result follows from (a). To see this, note that since σ (P ) is the smallest
σ ﬁeld and (P ) is the smallest λsystem containing P we have
σ (P ) ⊂ (P ) ⊂ L
To prove (a) we begin by noting that a λsystem that is closed under intersection is
a σ ﬁeld since
if A ∈ L then Ac = Ω − A ∈ L
A ∪ B = (Ac ∩ B c )c
∪n Ai ↑ ∪∞ Ai as n ↑ ∞
i=1
i=1
Thus, it is enough to show
(b) (P ) is closed under intersection.
To prove (b), we let GA = {B : A ∩ B ∈ (P )} and prove
(c) if A ∈ (P ) then GA is a λsystem.
To check this, we note: (i) Ω ∈ GA since A ∈ (P ).
(ii) if B, C ∈ GA and B ⊃ C then A ∩ (B − C ) = (A ∩ B ) − (A ∩ C ) ∈ (P ) since
A ∩ B, A ∩ C ∈ (P ) and (P ) is a λsystem.
(iii) if Bn ∈ GA and Bn ↑ B then A ∩ Bn ↑ A ∩ B ∈ (P ) since A ∩ Bn ∈ (P ) and
(P ) is a λsystem.
To get from (c) to (b), we note that since P is a π system,
if A ∈ P then GA ⊃ P and so (c) implies GA ⊃ (P )
i.e., if A ∈ P and B ∈ (P ) then A ∩ B ∈ (P ). Interchanging A and B in the last
sentence: if A ∈ (P ) and B ∈ P then A ∩ B ∈ (P ) but this implies
if A ∈ (P ) then GA ⊃ P and so (c) implies GA ⊃ (P ).
This conclusion implies that if A, B ∈ (P ) then A ∩ B ∈ (P ), which proves (b) and
completes the proof. 346 APPENDIX A. MEASURE THEORY To prove that the extension in Theorem A.1.1 is unique, we will show:
Theorem A.2.2. Let P be a π system. If ν1 and ν2 are measures (on σ ﬁelds F1 and
F2 ) that agree on P and there is a sequence An ∈ P with An ↑ Ω and νi (An ) < ∞,
then ν1 and ν2 agree on σ (P ).
Proof. Let A ∈ P have ν1 (A) = ν2 (A) < ∞. Let
L = {B ∈ σ (P ) : ν1 (A ∩ B ) = ν2 (A ∩ B )}
We will now show that L is a λsystem. Since A ∈ P , ν1 (A) = ν2 (A) and Ω ∈ L. If
B, C ∈ L with C ⊂ B then
ν1 (A ∩ (B − C )) = ν1 (A ∩ B ) − ν1 (A ∩ C )
= ν2 (A ∩ B ) − ν2 (A ∩ C ) = ν2 (A ∩ (B − C ))
Here we use the fact that νi (A) < ∞ to justify the subtraction. Finally, if Bn ∈ L
and Bn ↑ B , then part (iii) of Exercise A.1.3 implies
ν1 (A ∩ B ) = lim ν1 (A ∩ Bn ) = lim ν2 (A ∩ Bn ) = ν2 (A ∩ B )
n→∞ n→∞ Since P is closed under intersection by assumption, the π − λ theorem implies L ⊃
σ (P ), i.e., if A ∈ P with ν1 (A) = ν2 (A) < ∞ and B ∈ σ (P ) then ν1 (A ∩ B ) =
ν2 (A ∩ B ). Letting An ∈ P with An ↑ Ω, ν1 (An ) = ν2 (An ) < ∞, and using the last
result and part (iii) of Exercise A.1.3, we have the desired conclusion.
Exercise A.2.1. Give an example of two probability measures µ = ν on F = all
subsets of {1, 2, 3, 4} that agree on a collection of sets C with σ (C ) = F , i.e., the
smallest σ algebra containing C is F .
Existence. Our next step is to show that a measure (not necessarily σ ﬁnite) deﬁned
on an algebra A has an extension to the σ algebra generated by A. If E ⊂ Ω, we let
µ∗ (E ) = inf i µ(Ai ) where the inﬁmum is taken over all sequences from A so that
E ⊂ ∪i Ai . Intuitively, if ν is a measure that agrees with µ on A, then it follows from
part (ii) of Exercise A.1.3 that
ν (E ) ≤ ν (∪i Ai ) ≤ ν (Ai ) =
i µ(Ai )
i so µ∗ (E ) is an upper bound on the measure of E . Intuitively, the measurable sets are
the ones for which the upper bound is tight. Formally, we say that E is measurable
if
µ∗ (F ) = µ∗ (F ∩ E ) + µ∗ (F ∩ E c ) for all sets F ⊂ Ω
(A.2.1)
The last deﬁnition is not very intuitive, but we will see in the proofs below that it
works very well.
It is immediate from the deﬁnition that µ∗ has the following properties:
(i) monotonicity. If E ⊂ F then µ∗ (E ) ≤ µ∗ (F ).
(ii) subadditivity. If F ⊂ ∪i Fi , a countable union, then µ∗ (F ) ≤ i µ∗ (Fi ). Any set function with µ∗ (∅) = 0 that satisﬁes (i) and (ii) is called an outer
measure. Using (ii) with F1 = F ∩ E and F2 = F ∩ E c (and Fi = ∅ otherwise), we
see that to prove a set is measurable, it is enough to show
µ∗ (F ) ≥ µ∗ (F ∩ E ) + µ∗ (F ∩ E c )
We begin by showing that our new deﬁnition extends the old one. (A.2.2) ´
A.2. CARATHEODORY’S EXTENSION THEOREM 347 Lemma A.2.3. If A ∈ A then µ∗ (A) = µ(A) and A is measurable.
Proof. Part (ii) of Exercise A.1.3 implies that if A ⊂ ∪i Ai then
µ(A) ≤ µ(Ai )
i so µ(A) ≤ µ∗ (A). Of course, we can always take A1 = A and the other Ai = ∅ so
µ∗ (A) ≤ µ(A).
To prove that any A ∈ A is measurable, we begin by noting that the inequality is
(∗ ) trivial when µ∗ (F ) = ∞, so we can without loss of generality assume µ∗ (F ) < ∞.
To prove that (∗ ) holds when E = A, we observe that since µ∗ (F ) < ∞ there is a
sequence Bi ∈ A so that ∪i Bi ⊃ F and
µ(Bi ) ≤ µ∗ (F ) +
i Since µ is additive on A, and µ = µ∗ on A we have
µ(Bi ) = µ∗ (Bi ∩ A) + µ∗ (Bi ∩ Ac )
Summing over i and using the subadditivity of µ∗ gives
µ∗ (F ) + ≥ µ∗ (Bi ∩ A) +
i µ∗ (Bi ∩ Ac ) ≥ µ∗ (F ∩ A) + µ∗ (F c ∩ A)
i which proves the desired result since is arbitrary. Lemma A.2.4. The class A∗ of measurable sets is a σ ﬁeld, and the restriction of
µ∗ to A∗ is a measure.
Remark. This result is true for any outer measure.
Proof. It is clear from the deﬁnition that:
(a) If E is measurable then E c is.
Our ﬁrst nontrivial task is to prove:
(b) If E1 and E2 are measurable then E1 ∪ E2 and E1 ∩ E2 are.
Proof of (b). To prove the ﬁrst conclusion, let G be any subset of Ω. Using subaddic
tivity, the measurability of E2 (let F = G ∩ E1 in (∗)), and the measurability of E1 ,
we get
c
c
µ∗ (G ∩ (E1 ∪ E2 )) + µ∗ (G ∩ (E1 ∩ E2 ))
c
c
c
≤ µ∗ (G ∩ E1 ) + µ∗ (G ∩ E1 ∩ E2 ) + µ∗ (G ∩ E1 ∩ E2 )
c
= µ∗ (G ∩ E1 ) + µ∗ (G ∩ E1 ) = µ∗ (G)
c
c
To prove that E1 ∩ E2 is measurable, we observe E1 ∩ E2 = (E1 ∪ E2 )c and use (a). (c) Let G ⊂ Ω and E1 , . . . , En be disjoint measurable sets. Then
n µ∗ (G ∩ ∪n Ei ) =
i=1 µ∗ (G ∩ Ei )
i=1 Proof of (c). Let Fm = ∪i≤m Ei . En is measurable, Fn ⊃ En , and Fn−1 ∩ En = ∅, so
c
µ∗ (G ∩ Fn ) = µ∗ (G ∩ Fn ∩ En ) + µ∗ (G ∩ Fn ∩ En )
= µ∗ (G ∩ En ) + µ∗ (G ∩ Fn−1 ) The desired result follows from this by induction. 348 APPENDIX A. MEASURE THEORY (d) If the sets Ei are measurable then E = ∪∞ Ei is measurable.
i=1
c
Proof of (d). Let Ei = Ei ∩ ∩j<i Ej . (a) and (b) imply Ei is measurable, so
we can suppose without loss of generality that the Ei are pairwise disjoint. Let
Fn = E1 ∪ . . . ∪ En . Fn is measurable by (b), so using monotonicity and (c) we have
c
µ∗ (G) = µ∗ (G ∩ Fn ) + µ∗ (G ∩ Fn ) ≥ µ∗ (G ∩ Fn ) + µ∗ (G ∩ E c )
n µ∗ (G ∩ Ei ) + µ∗ (G ∩ E c ) =
i=1 Letting n → ∞ and using subadditivity
∞ µ∗ (G) ≥ µ∗ (G ∩ Ei ) + µ∗ (G ∩ E c ) ≥ µ∗ (G ∩ E ) + µ∗ (G ∩ E c )
i=1 which is (∗ ).
The last step in the proof of Theorem A.2.4 is
(e) If E = ∪i Ei where E1 , E2 , . . . are disjoint and measurable, then
∞ µ∗ (E ) = µ∗ (Ei )
i=1 Proof of (e). Let Fn = E1 ∪ . . . ∪ En . By monotonicity and (c)
n µ∗ (E ) ≥ µ∗ (Fn ) = µ∗ (Ei )
i=1 Letting n → now and using subadditivity gives the desired conclusion. A.3. COMPLETION, ETC. A.3 349 Completion, Etc. The proof of Theorem A.1.1 given in the last section deﬁnes an extension to A∗ ⊃
σ (A). Our next goal is to describe the relationship between these two σ algebras.
Let Aσ denote the collection of countable unions of sets in A, and let Bδ denote the
collection of countable intersections of sets in B . Taking B = Aσ , we see that Aσδ
denotes the collection of countable intersections of sets in Aσ .
Lemma A.3.1. Let E be any set with µ∗ (E ) < ∞. (i) For any > 0, there is an
A ∈ Aσ with A ⊃ E and µ∗ (A) ≤ µ∗ (E ) + . (ii) There is a B ∈ Aσδ with B ⊃ E
and µ∗ (B ) = µ∗ (E ).
Proof. By the deﬁnition of µ∗ , there is a sequence Ai so that A ≡ ∪i Ai ⊃ E and
∗
∗
∗
i µ(Ai ) ≤ µ (E ) + . The deﬁnition of µ implies µ (A) ≤
i µ(Ai ), establishing
∗
∗
(i). For (ii), let An ∈ Aσ with An ⊃ E and µ (An ) ≤ µ (E )+1/n, and let B = ∩n An .
Clearly, B ∈ Aσδ , B ⊃ E , and hence by monotonicity, µ∗ (B ) ≥ µ∗ (E ). To prove the
other inequality, notice that B ⊂ An and hence µ∗ (B ) ≤ µ∗ (An ) ≤ µ∗ (E ) + 1/n for
any n.
Exercise A.3.1. Let A be an algebra, µ a measure on σ (A), and B ∈ σ (A) with
µ(B ) < ∞. For any > 0, there is an A ∈ A with µ(A∆B ) < , where A∆B =
(A − B ) ∪ (B − A).
Theorem A.3.2. Suppose µ is σ ﬁnite on A. B ∈ A∗ if and only if there is an
A ∈ Aσδ and a set N with µ∗ (N ) = 0 so that B = A − N (= A ∩ N c ).
Proof. It follows from Lemma A.2.3 and A.2.4 if A ∈ Aσδ then A ∈ A∗ . (∗ ) in
Section A.2 and monotonicity imply sets with µ∗ (N ) = 0 are measurable, so using
Lemma A.2.4 again it follows that A ∩ N c ∈ A∗ . To prove the other direction, let Ωi
be a disjoint collection of sets with µ(Ωi ) < ∞ and Ω = ∪i Ωi . Let Bi = B ∩ Ωi and
use Lemma A.3.1 to ﬁnd An ∈ Aσ so that An ⊃ Bi and µ(An ) ≤ µ∗ (Ei ) + 1/n2i . Let
i
i
i
An = ∪i An . B ⊂ An and
i
∞ (An − Bi )
i An − B ⊂
i=1 so, by subadditivity,
∞ µ∗ (An − B ) ≤ µ∗ (An − Bi ) ≤ 1/n
i
i=1 Since An ∈ Aσ , the set A = ∩n An ∈ Aσδ . Clearly, A ⊃ B . Since N ≡ A − B ⊂ An − B
for all n, monotonicity implies µ∗ (N ) = 0, and the proof of is complete.
A measure space (Ω, F , µ) is said to be complete if F contains all subsets of sets
of measure 0. In the proof of Theorem A.3.2, we showed that (Ω, A∗ , µ∗ ) is complete.
Our next result shows that (Ω, A∗ , µ∗ ) is the completion of (Ω, σ (A), µ).
Theorem A.3.3. If (Ω, F , µ) is a measure space, then there is a complete measure
¯¯
¯
space (Ω, F , µ), called the completion of (Ω, F , µ), so that: (i) E ∈ F if and only if
E = A ∪ B , where A ∈ F and B ⊂ N ∈ F with µ(N ) = 0, (ii) µ agrees with µ on F .
¯
¯
Proof. The ﬁrst step is to check that F is a σ algebra. If Ei = Ai ∪ Bi where
Ai ∈ F and Bi ⊂ Ni where µ(Ni ) = 0, then ∪i Ai ∈ F and subadditivity implies 350 APPENDIX A. MEASURE THEORY ¯
µ(∪i Ni ) ≤ i µ(Ni ) = 0, so ∪i Ei ∈ F . As for complements, if E = A ∪ B and
c
c
B ⊂ N , then B ⊃ N so
E c = Ac ∩ B c = (Ac ∩ N c ) ∪ (Ac ∩ B c ∩ N )
¯
Ac ∩ N c is in F and Ac ∩ B c ∩ N ⊂ N , so E c ∈ F .
We deﬁne µ in the obvious way: If E = A ∪ B where A ∈ F and B ⊂ N where
¯
µ(N ) = 0, then we let µ(E ) = µ(A). The ﬁrst thing to show is that µ is well deﬁned,
¯
¯
i.e., if E = Ai ∪ Bi , i = 1, 2, are two decompositions, then µ(A1 ) = µ(A2 ). Let
A0 = A1 ∩ A2 and B0 = B1 ∪ B2 . E = A0 ∪ B0 is a third decomposition with A0 ∈ F
and B0 ⊂ N1 ∪ N2 , and has the pleasant property that if i = 1 or 2
µ(A0 ) ≤ µ(Ai ) ≤ µ(A0 ) + µ(N1 ∪ N2 ) = µ(A0 )
The last detail is to check that µ is measure, but that is easy. If Ei = Ai ∪ Bi are
¯
disjoint, then ∪i Ei can be decomposed as ∪i Ai ∪ (∪i Bi ), and the Ai ⊂ Ei are disjoint,
so
µ(∪i Ei ) = µ(∪i Ai ) =
¯
µ(Ai ) =
µ(Ei )
¯
i i Theorem A.1.6 allows us to construct Lebesgue measure λ on (Rd , Rd ). Using
¯
¯
Theorem A.3.3, we can extend λ to be a measure on (R, Rd ) where Rd is the compled
tion of R . Having done this, it is natural (if somewhat optimistic) to ask: Are there
¯
any sets that are not in Rd ? The answer is “Yes” and we will now give an example of
a nonmeasurable B in R.
A nonmeasurable subset of [0,1)
The key to our construction is the observation that λ is translation invariant: i.e.,
¯
¯
if A ∈ R and x + A = {x + y : y ∈ A}, then x + A ∈ R and λ(A) = λ(x + A). We
say that x, y ∈ [0, 1) are equivalent and write x ∼ y if x − y is a rational number.
By the axiom of choice, there is a set B that contains exactly one element from each
equivalence class. B is our nonmeasurable set, that is,
Theorem A.3.4. B ∈ R.
/¯
Proof. The key is the following:
¯
Lemma A.3.5. If E ⊂ [0, 1) is in R, x ∈ (0, 1), and x + E = {(x + y ) mod 1 :
y ∈ E }, then λ(E ) = λ(x + E ).
Proof. Let A = E ∩ [0, 1 − x) and B = E ∩ [1 − x, 1). Let A = x + A = {x + y :
¯
¯
y ∈ A} and B = x − 1 + B . A, B ∈ R, so by translation invariance A , B ∈ R and
λ(A) = λ(A ), λ(B ) = λ(B ). Since A ⊂ [x, 1) and B ⊂ [0, x) are disjoint,
λ(E ) = λ(A) + λ(B ) = λ(A ) + λ(B ) = λ(x + E )
From Lemma A.3.5, it follows easily that B is not measurable; if it were, then q + B ,
q ∈ Q ∩ [0, 1) would be a countable disjoint collection of measurable subsets of [0,1),
all with the same measure α and having
∪q∈Q∩[0,1) (q + B ) = [0, 1)
If α > 0 then λ([0, 1)) = ∞, and if α = 0 then λ([0, 1)) = 0. Neither conclusion is
compatible with the fact that λ([0, 1)) = 1 so B ∈ R.
/¯ A.3. COMPLETION, ETC. 351 Exercise A.3.2. Let B be the nonmeasurable set constructed in Theorem A.3.4. (i)
Let Bq = q + B and show that if Dq ⊂ Bq is measurable, then λ(Dq ) = 0. (ii) Use
(i) to conclude that if A ⊂ R has λ(A) > 0, there is a nonmeasurable S ⊂ A.
Letting B = B × [0, 1]d−1 where B is our nonmeasurable subset of (0,1), we get
a nonmeasurable set in d > 1. In d = 3, there is a much more interesting example,
but we need the reader to do some preliminary work. In Euclidean geometry, two
subsets of Rd are said to be congruent if one set can be mapped onto the other by
translations and rotations.
Claim. Two congruent measurable sets must have the same Lebesgue measure.
Exercise A.3.3. Prove the claim in d = 2 by showing (i) if B is a rotation of a
rectangle A then λ∗ (B ) = λ(A). (ii) If C is congruent to D then λ∗ (C ) = λ∗ (D).
BanachTarski Theorem
Banach and Tarski (1924) used the axiom of choice to show that it is possible to
partition the sphere {x : x ≤ 1} in R3 into a ﬁnite number of sets A1 , . . . , An and
ﬁnd congruent sets B1 , . . . , Bn whose union is two disjoint spheres of radius 1! Since
congruent sets have the same Lebesgue measure, at least one of the sets Ai must
be nonmeasurable. The construction relies on the fact that the group generated by
rotations in R3 is not Abelian. Lindenbaum (1926) showed that this cannot be done
with any bounded set in R2 . For a popular account of the BanachTarski theorem,
see French (1988).
Solovay’s Theorem
The axiom of choice played an important role in the last two constructions of
nonmeasurable sets. Solovay (1970) proved that its use is unavoidable. In his own
words, “We show that the existence of a nonLebesgue measurable set cannot be
proved in ZermeloFrankel set theory if the use of the axiom of choice is disallowed.”
This should convince the reader that all subsets of Rd that arise “in practice” are in
¯
Rd . 352 A.4 APPENDIX A. MEASURE THEORY Integration Let µ be a σ ﬁnite measure on (Ω, F ). In this section we will deﬁne
of measurable functions. This is a fourstep procedure:
n
i=1 Step 1. ϕ is said to be a simple function if ϕ(ω ) =
sets with µ(Ai ) < ∞. If ϕ is a simple function, we let f dµ for a class ai 1Ai and Ai are disjoint n ϕ dµ = ai µ(Ai )
i=1 The representation of ϕ is not unique since we have not supposed that the ai are
distinct. However, it is easy to see that the last deﬁnition does not contradict itself.
We will prove the next three conclusions four times, but before we can state them
for the ﬁrst time, we need a deﬁnition. ϕ ≥ ψ µ almost everywhere (or ϕ ≥ ψ µa.e.) means µ({ω : ϕ(ω ) < ψ (ω )}) = 0. When there is no doubt about what measure
we are referring to, we drop the µ.
Lemma A.4.1. Let ϕ and ψ be simple functions.
(i) If ϕ ≥ 0 a.e. then ϕ dµ ≥ 0.
(ii) For any a ∈ R, aϕ dµ = a ϕ dµ.
(iii) ϕ + ψ dµ = ϕ dµ + ψ dµ.
Proof. (i) and (ii) are immediate consequences of the deﬁnition. To prove (iii), suppose
m n ϕ= ai 1Ai and ψ = i=1 bj 1Bj
j =1 To make the supports of the two functions the same, we let A0 = ∪i Bi − ∪i Ai , let
B0 = ∪i Ai − ∪i Bi , and let a0 = b0 = 0. Now
m n ϕ+ψ = (ai + bj )1(Ai ∩Bj )
i=0 j =0 and the Ai ∩ Bj are pairwise disjoint, so
m n (ai + bj )µ(Ai ∩ Bj ) (ϕ + ψ ) dµ =
i=0 j =0
mn n m ai µ(Ai ∩ Bj ) + =
i=0 j =0
m n ai µ(Ai ) + =
i=0 bj µ(Ai ∩ Bj )
j =0 i=0 bj µ(Bj ) = ϕ dµ + ψ dµ j =0 In the nexttolast step, we used Ai = +j (Ai ∩ Bj ) and Bj = +i (Ai ∩ Bj ), where +
denotes a disjoint union.
We will prove (i)–(iii) three more times as we generalize our integral. As a consequence of (i)–(iii), we get three more useful properties. To keep from repeating their
proofs, which do not change, we will prove A.4. INTEGRATION 353 Lemma A.4.2. If (i) and (iii) hold then we have:
(iv) If ϕ ≤ ψ a.e. then ϕ dµ ≤ ψ dµ.
(v) If ϕ = ψ a.e. then ϕ dµ = ψ dµ.
If, in addition, (ii) holds when a = −1 we have
(vi)  φ dµ ≤ φ dµ
Proof. By (iii), ψ dµ = φ dµ + (ψ − φ) dµ and the second integral is ≥ 0 by
(i), so (iv) holds. ϕ = ψ a.e. implies ϕ ≤ ψ a.e. and ψ ≤ ϕ a.e. so (v) follows
from two applications of (iv). To prove (vi) now, notice that φ ≤ φ so (iv) implies
φ dµ ≤ φ dµ. −φ ≤ φ, so (iv) and (ii) imply − φ dµ ≤ φ dµ. Since
y  = max(y, −y ), the result follows.
Step 2. Let E be a set with µ(E ) < ∞ and let f be a bounded function that vanishes
on E c . To deﬁne the integral of f , we observe that if ϕ, ψ are simple functions that
have ϕ ≤ f ≤ ψ , then we want to have
ϕ dµ ≤ f dµ ≤ ψ dµ so we let
f dµ = sup ϕ dµ = inf ψ dµ ψ ≥f φ≤f (A.4.1) Here and for the rest of Step 2, we assume that ϕ and ψ vanish on E c . To justify
the deﬁnition, we have to prove that the sup and inf are equal. It follows from (iv) in
Lemma A.4.2 that
sup ϕ dµ ≤ inf
ψ dµ
φ≤f ψ ≥f To prove the other inequality, suppose f  ≤ M and let
Ek = x∈E:
n ψn (x) =
k=−n (k − 1)M
kM
≥ f (x) >
n
n kM
1Ek
n n ϕn (x) =
k=−n for − n ≤ k ≤ n
(k − 1)M
1Ek
n By deﬁnition, ψn (x) − ϕn (x) = (M/n)1E , so
ψn (x) − ϕn (x) dµ = M
µ(E )
n Since ϕn (x) ≤ f (x) ≤ ψn (x), it follows from (iii) in Lemma A.4.1 that
sup
φ≤f ϕ dµ ≥ M
µ(E ) + ψn dµ
n
M
ψ dµ
≥ − µ(E ) + inf
ψ ≥f
n ϕn dµ = − The last inequality holds for all n, so the proof is complete.
Lemma A.4.3. Let E be a set with µ(E ) < ∞. If f and g are bounded functions
that vanish on E c then:
(i) If f ≥ 0 a.e. then f dµ ≥ 0.
(ii) For any a ∈ R, af dµ = a f dµ. 354 APPENDIX A. MEASURE THEORY (iii) f + g dµ = f dµ + g dµ.
(iv) If g ≤ f a.e. then g dµ ≤ f dµ.
(v) If g = f a.e. then g dµ = f dµ.
(vi)  f dµ ≤ f  dµ.
Proof. Since we can take φ ≡ 0, (i) is clear from the deﬁnition. To prove (ii), we
observe that if a > 0, then aϕ ≤ af if and only if ϕ ≤ f , so
af dµ = sup aϕ dµ = sup a φ≤f ϕ dµ = a sup φ≤f ϕ dµ = a f dµ ϕ dµ = a f dµ φ≤f For a < 0, we observe that aϕ ≤ af if and only if ϕ ≥ f , so
af dµ = sup aϕ dµ = sup a φ≥f ϕ dµ = a inf φ≥f φ≥f To prove (iii), we observe that if ψ1 ≥ f and ψ2 ≥ g , then ψ1 + ψ2 ≥ f + g so
ψ dµ ≤ inf ψ ≥f +g inf ψ1 + ψ2 dµ ψ1 ≥f,ψ2 ≥g Using linearity for simple functions, it follows that
f + g dµ =
≤ inf ψ dµ ψ ≥f +g inf ψ1 ≥f,ψ2 ≥g ψ1 dµ + ψ2 dµ = f dµ + g dµ To prove the other inequality, observe that the last conclusion applied to −f and −g
and (ii) imply
− f + g dµ ≤ − f dµ − g dµ (iv)–(vi) follow from (i)–(iii) by Lemma A.4.2.
Notation. We deﬁne the integral of f over the set E :
f dµ ≡ f · 1E dµ E Step 3. If f ≥ 0 then we let
f dµ = sup h dµ : 0 ≤ h ≤ f, h is bounded and µ({x : h(x) > 0}) < ∞ The last deﬁnition is nice since it is clear that this is well deﬁned. The next result
will help us compute the value of the integral.
Lemma A.4.4. Let En ↑ Ω have µ(En ) < ∞ and let a ∧ b = min(a, b). Then
f ∧ n dµ ↑
En f dµ as n ↑ ∞ A.4. INTEGRATION 355 Proof. It is clear that from (iv) in Lemma A.4.3 that the lefthand side increases as
n does. Since h = (f ∧ n)1En is a possibility in the sup, each term is smaller than
the integral on the right. To prove that the limit is f dµ, observe that if 0 ≤ h ≤ f ,
h ≤ M , and µ({x : h(x) > 0}) < ∞, then for n ≥ M using h ≤ M , (iv), and (iii),
f ∧ n dµ ≥ h dµ = En Now 0 ≤ c
En h dµ − h dµ
c
En En c
h dµ ≤ M µ(En ∩ {x : h(x) > 0}) → 0 as n → ∞, so f ∧ n dµ ≥ lim inf
n→∞ h dµ En which proves the desired result since h is an arbitrary member of the class that deﬁnes
the integral of f .
Lemma A.4.5. Suppose f , g ≥ 0.
(i) f dµ ≥ 0
(ii) If a > 0 then af dµ = a f dµ.
(iii) f + g dµ = f dµ + g dµ
(iv) If 0 ≤ g ≤ f a.e. then g dµ ≤ f dµ.
(v) If 0 ≤ g = f a.e. then g dµ = f dµ.
Proof. (i) is trivial from the deﬁnition. (ii) is clear, since when a > 0, ah ≤ af if and
only if h ≤ f and we have ah dµ = a h du for h in the deﬁning class. For (iii), we
observe that if f ≥ h and g ≥ k , then f + g ≥ h + k so taking the sup over h and k
in the deﬁning classes for f and g gives
f + g dµ ≥ f dµ + g dµ To prove the other direction, we observe (a + b) ∧ n ≤ (a ∧ n) + (b ∧ n) so (iv) from
Lemma A.4.3 and (iii) from Lemma A.4.4 imply
(f + g ) ∧ n dµ ≤
En f ∧ n dµ +
En g ∧ n dµ
En Letting n → ∞ and using Lemma A.4.4 gives (iii). As before, (iv) and (v) follow from
(i), (iii), and Lemma A.4.2.
Exercise A.4.1. Show that if f ≥ 0 and f dµ = 0 then f = 0 a.e. Exercise A.4.2. Let f ≥ 0 and En,m = {x : m/2n ≤ f (x) < (m + 1)/2n }. As n ↑ ∞,
∞ m
µ(En,m ) ↑
2n
m=1
Step 4. We say f is integrable if f dµ f  dµ < ∞. Let f + (x) = f (x) ∨ 0 and f − (x) = (−f (x)) ∨ 0 where a ∨ b = max(a, b). Clearly,
f (x) = f + (x) − f − (x) and f (x) = f + (x) + f − (x) 356 APPENDIX A. MEASURE THEORY We deﬁne the integral of f by
f dµ = f + dµ − f − dµ The righthand side is well deﬁned since f + , f − ≤ f  and we have (iv) in Lemma
A.4.5. For the ﬁnal time, we will prove our six properties. To do this, it is useful to
know:
Lemma A.4.6. If f = f1 − f2 where f1 , f2 ≥ 0 and
f dµ = f1 dµ − fi dµ < ∞ then f2 dµ Proof. f1 + f − = f2 + f + and all four functions are ≥ 0, so by (iii) of Lemma A.4.5,
f1 dµ + f − dµ = f1 + f − dµ = f2 + f + dµ = f + dµ f2 dµ + Rearranging gives the desired conclusion.
Theorem A.4.7. Suppose f and g are integrable.
(i) If f ≥ 0 a.e. then f dµ ≥ 0.
(ii) For all a ∈ R, af dµ = a f dµ.
(iii) f + g dµ = f dµ + g dµ
(iv) If g ≤ f a.e. then g dµ ≤ f dµ.
(v) If g = f a.e. then g dµ = f dµ.
(vi)  f dµ ≤ f  dµ.
Proof. (i) is trivial. (ii) is clear since if a > 0, then (af )+ = a(f + ), and so on. To
prove (iii), observe that f + g = (f + + g + ) − (f − + g − ), so using Lemma A.4.6 and
Lemma A.4.5
f + g dµ =
= f + + g + dµ −
f + dµ + f − + g − dµ g + dµ − f − dµ − g − dµ As usual, (iv)–(vi) follow from (i)–(iii) and Lemma A.4.2.
Notation for special cases:
(a) When (Ω, F , µ) = (Rd , Rd , λ), we write f (x) dx for (b) When (Ω, F , µ) = (R, R, λ) and E = [a, b], we write f dλ.
b
a f (x) dx for E f dλ. (c) When (Ω, F , µ) = (R, R, µ) with µ((a, b]) = G(b) − G(a) for a < b, we write
f (x) dG(x) for f dµ.
(d) When Ω is a countable set, F = all subsets of Ω, and µ is counting measure, we
write i∈Ω f (i) for f dµ.
We mention example (d) primarily to indicate that results for sums follow from those
for integrals.
For the rest of this section, we will consider the case (Ω, F , µ) = (R, R, λ).
Littlewood’s principles A.4. INTEGRATION 357 Speaking of the theory of functions of a real variable Littlewood (1944) said
“The extent of knowledge required is nothing like so great as is sometimes supposed.
There are three principles, roughly expressible in the following terms:
1. Every measurable set is roughly a ﬁnite union of intervals.
2. Every measurable function is almost continuous.
3. Every convergent sequence of measurable functions is almost uniformly convergent.
Most of the results of the theory are fairly intuitive applications of these ideas and
the student armed with them should be equal to most occasions when real variable
theory is called for.”
Exercise A.3.1 above gives a versions of the ﬁrst principles. The next two exercises
develop a version of the second.
Exercise A.4.3. Let g be an integrable function on R and > 0. (i) Use the
deﬁnition of the integral to conclude there is a simple function ϕ = k bk 1Ak with
g − ϕ dx < . (ii) Use Exercise A.3.1 to approximate the Ak by ﬁnite unions of
intervals to get a step function
k q= cj 1(aj−1 ,jm )
j =1 with a0 < a1 < . . . < ak , so that ϕ − q  < . (iii) Round the corners of q to get a
continuous function r so that q − r dx < .
Exercise A.4.4. Prove the RiemannLebesgue lemma. If g is integrable then
lim n→∞ g (x) cos nx dx = 0 Hint: If g is a step function, this is easy. Now use Exercise A.4.3.
* Riemann Integration
Our treatment of the Lebesgue integral would not be complete if we did not prove
the classic theorem of Lebesgue that identiﬁes the functions for which the Riemann
integral exists. Let −∞ < a < b < ∞. A subdivision σ of [a, b] is a ﬁnite sequence
a = x0 < x1 . . . < xn = b. Given a subdivision σ , we deﬁne the
n upper Riemann sum (xi+1 − xi ) sup{f (y ) : y ∈ [xi−1 , xi ]} U (σ ) =
i=1
n lower Riemann sum (xi+1 − xi ) inf {f (y ) : y ∈ [xi−1 , xi ]} L(σ ) =
i=1 We say that f is Riemann integrable on [a, b] in the liberal sense if
∞ > inf U (σ ) = sup L(σ ) > −∞
σ σ The function q (x) that is 1 if x is irrational and 0 if x is rational is the classic
example of a function that is not Riemann integrable on [0, 1] in the liberal sense but
is Lebesgue integrable on [0, 1]. (q 1[0,1] is a simple function!) The next result gives a
necessary condition for Riemann integrability. 358 APPENDIX A. MEASURE THEORY Theorem A.4.8. If f is Riemann integrable on [a, b] in the liberal sense, then f is
bounded and continuous a.e. on [a, b].
Proof. If f is unbounded above, then U (σ ) = ∞ for all subdivisions. Likewise if f
is unbounded below L(σ ) = −∞ for all subdivisions. Thus, f must be bounded. To
prove that it must be continuous a.e., we begin by letting
un (x) = sup{f (y ) : x − y  < 2−n and y ∈ [a, b]}
vn (x) = inf {f (y ) : x − y  < 2−n and y ∈ [a, b]}
Exercise 1.2.6 implies un and vn are measurable. Let
f 0 = lim un and f0 = lim vn n→∞ n→∞ f 0 (x) ≥ f0 (x) with equality if and only if f is continuous at x. Given a subdivision
σ,
sup{f (y ) : y ∈ [xi−1 , xi ]} ≥ f 0 (x) for x ∈ (xi−1 , xi )
so U (σ ) ≥ [a,b] f 0 dx, the (Lebesgue) integral existing since f 0 is bounded and measurable. Similar reasoning shows that any lower Riemann sum has [a,b] f0 dx ≥ L(σ ),
so if f is Riemann integrable in the liberal sense [a,b] f 0 − f0 dx = 0, and it follows
from Exercise A.4.1 that f 0 = f0 a.e.
To state a converse to Theorem A.4.8, we need two deﬁnitions. The mesh of a
subdivision = sup(xi − xi−1 ). f is said to be Riemann integrable on [a, b] in the
strict sense if for any sequence of subdivisions with mesh → 0, U (σn ) − L(σn ) → 0.
Theorem A.4.9. If f is bounded and continuous a.e. on [a, b] then f is Riemann
integrable on [a, b] in the strict sense.
Proof. We need a little more theory before you can give a simple proof of this. See
Exercise A.5.4.
Exercise A.4.5. Give examples to show that for a function f deﬁned on R, neither
statement implies the other. (a) f is continuous a.e. (b) There is a continuous function
g so that f = g a.e.
Exercise A.4.6. Let (Ω, F , µ) be a ﬁnite measure space and let f be a function with
f  < M . Given a sequence of subdivisions −M = xn < xn < . . . < xn = M , deﬁne
n
0
1
the
n upper Lebesgue sum ¯
U (σn ) = xn µ({ω : f (ω ) ∈ [xn −1 , xn )})
m
m
m
m=1
n lower Lebesgue sum ¯
L(σn ) = xn −1 µ({ω : f (ω ) ∈ [xn −1 , xn )})
m
m
m
m=1 ¯
¯
Show that if mesh(σn ) → 0, then U (σn ), L(σn ) →
f dµ. In short, in Riemann
integration we subdivide the domain, and in Lebesgue integration we subdivide the
range. A.5. PROPERTIES OF THE INTEGRAL A.5 359 Properties of the Integral In this section, we will develop properties of the integral deﬁned in the last section.
Our ﬁrst result generalizes (vi) from Theorem A.4.7.
Theorem A.5.1. Jensen’s inequality. Suppose ϕ is convex, that is,
λϕ(x) + (1 − λ)ϕ(y ) ≥ ϕ(λ x + (1 − λ)y )
for all λ ∈ (0, 1) and x, y ∈ R. If µ is a probability measure, i.e., µ(R) = 1 and f
and ϕ(f ) are integrable then
ϕ ≤ f dµ ϕ(f ) dµ Proof. Let c = f dµ and let (x) = ax + b be a linear function that has (c) = ϕ(c)
and ϕ(x) ≥ (x). To see that such a function exists, recall that convexity implies
lim
h↓0 ϕ(c) − ϕ(c − h)
ϕ(c + h) − ϕ(c)
≤ lim
h↓0
h
h (The limits exist since the sequences are monotone.) If we let a be any number between
the two limits and let (x) = a(x − c) + ϕ(c), then has the desired properties. With
the existence of established, the rest is easy. (iv) in Theorem A.4.7 implies
ϕ(f ) dµ ≥
since c =
Let f
number c. (af + b) dµ = a f dµ + b = f dµ =ϕ f dµ f dµ and (c) = φ(c).
p = ( f p dµ)1/p for 1 ≤ p < ∞, and notice cf p = c · f p for any real Theorem A.5.2. H¨lder’s inequality. If p, q ∈ (1, ∞) with 1/p + 1/q = 1 then
o
f g  dµ ≤ f p g q Proof. If f p or g q = 0 then f g  = 0 a.e., so it suﬃces to prove the result when
f p and g q > 0 or by dividing both sides by f p g q , when f p = g q = 1.
Fix y ≥ 0 and let
ϕ(x) = xp /p + y q /q − xy
p−1 ϕ (x) = x −y and for x ≥ 0
ϕ (x) = (p − 1)xp−2 so ϕ has a minimum at xo = y 1/(p−1) . xp = y p/(p−1) = y q and q = p/(p − 1) so
o
ϕ(xo ) = y q (1/p + 1/q ) − y 1/(p−1) y = 0
Since xo is the minimum, it follows that xy ≤ xp /p + y q /q . Letting x = f , y = g ,
and integrating
11
f g  dµ ≤ + = 1 = f p g q
pq 360 APPENDIX A. MEASURE THEORY Remark. The special case p = q = 2 is called the CauchySchwarz inequality.
one can give a direct proof of the result in this case by observing that for any θ,
0≤ (f + θg )2 dµ = f 2 dµ + θ 2 f g dµ + θ2 g 2 dµ so the quadratic aθ2 + bθ + c on the righthand side has at most one real root. Recalling
the formula for the roots of a quadratic
√
−b ± b2 − 4ac
2a
we see b2 − 4ac ≤ 0, which is the desired result.
Exercise A.5.1. Let f ∞ = inf {M : µ({x : f (x) > M }) = 0}. Prove that
f g dµ ≤ f g 1 ∞ Exercise A.5.2. Show that if µ is a probability measure then
f ∞ = lim f
p→∞ p Exercise A.5.3. Minkowski’s inequality. (i) Suppose p ∈ (1, ∞). The inequality
f + g p ≤ 2p (f p + g p ) shows that if f p and g p are < ∞ then f + g p < ∞.
Apply H¨lder’s inequality to f f + g p−1 and g f + g p−1 to show f + g p ≤
o
f p + g p . (ii) Show that the last result remains true when p = 1 or p = ∞.
Our next goal is to give conditions that guarantee
lim n→∞ fn dµ = lim fn dµ n→∞ First, we need a deﬁnition. We say that fn → f in measure, i.e., for any > 0,
µ({x : fn (x) − f (x) > }) → 0 as n → ∞. This is a weaker assumption than fn → f
a.e., but the next result is easier to prove in the greater generality.
Theorem A.5.3. Bounded convergence theorem. Let E be a set with µ(E ) < ∞.
Suppose fn vanishes on E c , fn (x) ≤ M , and fn → f in measure. Then
f dµ = lim n→∞ fn dµ Example A.5.1. The functions fn (x) = 1[n,n+1) (x), on R equipped with the Borel
sets R and Lebesgue measure λ, show that the conclusion of Theorem A.5.3 does not
hold when µ(E ) = ∞.
Proof. Let > 0, Gn = {x : fn (x) − f (x) < } and Bn = E − Gn . Using (iii) and
(iv) from Theorem A.4.7,
f dµ − (f − fn ) dµ ≤ fn dµ = f − fn  dµ f − fn  dµ + =
Gn f − fn  dµ
Bn ≤ µ(E ) + 2M µ(Bn )
fn → f in measure implies µ(Bn ) → 0.
is complete. > 0 is arbitrary and µ(E ) < ∞, so the proof A.5. PROPERTIES OF THE INTEGRAL 361 Exercise A.5.4. Use Theorem A.5.3 to prove Theorem A.4.9. Hint: Given a subdivision σ let
f σ (x) = sup{f (y ) : y ∈ [xi−1 , xi ]} for x ∈ (xi−1 , xi )
so that [a,b] f σ (x)dx = U (σ ). Theorem A.5.4. Fatou’s lemma. If fn ≥ 0 then
fn dµ ≥ lim inf
n→∞ lim inf fn dµ
n→∞ Example A.5.2. Example A.5.1 shows that we may have strict inequality in Theorem
A.5.4. The functions fn (x) = n1(0,1/n] (x) on (0,1) equipped with the Borel sets and
Lebesgue measure show that this can happen on a space of ﬁnite measure.
Proof. Let gn (x) = inf m≥n fm (x). fn (x) ≥ gn (x) and as n ↑ ∞,
gn (x) ↑ g (x) = lim inf fn (x)
n→∞ Since fn dµ ≥ gn dµ, it suﬃces then to show that
lim inf
n→∞ gn dµ ≥ g dµ Let Em ↑ Ω be sets of ﬁnite measure. Since gn ≥ 0 and for ﬁxed m
(gn ∧ m) · 1Em → (g ∧ m) · 1Em a.e. the bounded convergence theorem, A.5.3, implies
lim inf
n→∞ gn dµ ≥ gn ∧ m dµ →
Em g ∧ m dµ
Em Taking the sup over m and using Theorem A.4.4 gives the desired result.
Theorem A.5.5. Monotone convergence theorem. If fn ≥ 0 and fn ↑ f then
fn dµ ↑
Proof. Fatou’s lemma, A.5.4, implies liminf
fn ≤ f implies lim sup fn dµ ≤ f dµ.
Exercise A.5.5. If gn ↑ g and
Exercise A.5.6. If gm ≥ 0 then f dµ
fn dµ ≥ −
g1 dµ < ∞ then
∞
m=0 gm dµ = f dµ. On the other hand, gn dµ ↑
∞
m=0 g dµ.
gm dµ. Exercise A.5.7. Let f ≥ 0. (i) Show that f ∧ n dµ ↑ f dµ as n → ∞. (ii) Use (i)
to conclude that if g is integrable and > 0 then we can pick δ > 0 so that µ(A) < δ
implies A g dµ < .
Theorem A.5.6. Dominated convergence theorem. If fn → f a.e., fn  ≤ g
for all n, and g is integrable, then fn dµ → f dµ. 362 APPENDIX A. MEASURE THEORY Proof. fn + g ≥ 0 so Fatou’s lemma implies
lim inf
n→∞ Subtracting fn + g dµ ≥ f + g dµ g dµ from both sides gives
fn dµ ≥ n→∞ f dµ fn dµ ≤ lim inf f dµ Applying the last result to −fn , we get
lim sup
n→∞ and the proof is complete.
Exercise A.5.8. If f is integrable and Em are disjoint sets with union E then
∞ f dµ =
m=0 So if f ≥ 0, then ν (E ) = E Em f dµ
E f dµ deﬁnes a measure. Exercise A.5.9. Show that if f is integrable on [a, b], g (x) =
uous on (a, b).
Exercise A.5.10. Show that if f has f
simple functions φn so that φn − f p → 0.
Exercise A.5.11. Show that if n p [a,x] f (y ) dy is contin = ( f p dµ)1/p < ∞, then there are fn dµ < ∞ then n fn dµ = n fn dµ. A.6. PRODUCT MEASURES, FUBINI’S THEOREM A.6 363 Product Measures, Fubini’s Theorem Let (X, A, µ1 ) and (Y, B , µ2 ) be two σ ﬁnite measure spaces. Let
Ω = X × Y = {(x, y ) : x ∈ X, y ∈ Y }
S = {A × B : A ∈ A, B ∈ B}
Sets in S are called rectangles. It is easy to see that S is a semialgebra:
(A × B ) ∩ (C × D) = (A ∩ C ) × (B ∩ D)
(A × B )c = (Ac × B ) ∪ (A × B c ) ∪ (Ac × B c )
Let F = A × B be the σ algebra generated by S .
Theorem A.6.1. There is a unique measure µ on F with
µ(A × B ) = µ1 (A)µ2 (B )
Notation. µ is often denoted by µ1 × µ2 .
Proof. By Theorem A.1.3 it is enough to show that if A × B = +i (Ai × Bi ) then
µ(A × B ) = µ(Ai × Bi )
i For each x ∈ A, let I (x) = {i : x ∈ Ai }. B = +i∈I (x) Bi , so
1A (x)µ2 (B ) = 1Ai (x)µ2 (Bi )
i Integrating with respect to µ1 and using Exercise 5.6 gives
µ1 (A)µ2 (B ) = µ1 (Ai )µ2 (Bi )
i which proves the result.
Exercise A.6.1. Let Ao ⊂ A and Bo ⊂ B be semialgebras with σ (Ao ) = A and
σ (Bo ) = B . Given a measure µ1 on A and a measure µ2 on B , there is a unique
measure µ on A × B that has µ(A × B ) = µ1 (A)µ2 (B ) for A ∈ Ao and B ∈ Bo . The
point of this exercise is that we can deﬁne Lebesgue measure on R2 by the requirement
that λ((a, b] × (c, d]) = (b − a)(d − c).
Using Theorem A.6.1 and induction, it follows that if (Ωi , Fi , µi ), i = 1, . . . , n, are
σ ﬁnite measure spaces and Ω = Ω1 × · · · × Ωn , there is a unique measure µ on the
σ algebra F generated by sets of the form A1 × · · · × An , Ai ∈ Fi , that has
n µ(A1 × · · · × An ) = µm (Am )
m=1 When (Ωi , Fi , µi ) = (R, R, λ) for all i, the result is Lebesgue measure on the Borel
subsets of n dimensional Euclidean space Rn .
Returning to the case in which (Ω, F , µ) is the product of two measure spaces,
(X, A, µ) and (Y, B , ν ), our next goal is to prove: 364 APPENDIX A. MEASURE THEORY Theorem A.6.2. Fubini’s theorem. If f ≥ 0 or
(∗) f (x, y ) µ2 (dy ) µ1 (dx) =
X f dµ =
X ×Y Y f  dµ < ∞ then
f (x, y ) µ1 (dx) µ2 (dy )
Y X Proof. We will prove only the ﬁrst equality, since the second one is similar. Two
technical things that need to be proved before we can assert that the ﬁrst integral
makes sense are:
When x is ﬁxed, y → f (x, y ) is B measurable.
x→ Y f (x, y )µ2 (dy ) is A measurable. We begin with the case f = 1E . Let Ex = {y : (x, y ) ∈ E } be the crosssection at
x.
Lemma A.6.3. If E ∈ F then Ex ∈ B .
Proof. (E c )x = (Ex )c and (∪i Ei )x = ∪i (Ei )x , so if E is the collection of sets E for
which Ex ∈ B, then E is a σ algebra. Since E contains the rectangles, the result
follows.
Lemma A.6.4. If E ∈ F then g (x) ≡ µ2 (Ex ) is A measurable and
g dµ1 = µ(E )
X Notice that it is not obvious that the collection of sets for which the conclusion is
true is a σ algebra since µ(E1 ∪ E2 ) = µ(E1 ) + µ(E2 ) − µ(E1 ∩ E2 ). Dynkin’s π − λ
theorem, A.2.1 was tailormade for situations like this.
Proof. If conclusions hold for En and En ↑ E , then Theorem 1.2.5 and the monotone
convergence theorem imply that they hold for E . Since µ1 and µ2 are σ ﬁnite, it is
enough then to prove the result for E ⊂ A × B with µ1 (A) < ∞ and µ2 (B ) < ∞, or
taking Ω = A × B we can suppose without loss of generality that µ(Ω) < ∞. Let L
be the collection of sets E for which the conclusions hold. We will now check that L
is a λsystem. Property (i) of a λsystem is trivial. (iii) follows from the ﬁrst sentence
in the proof. To check (ii) we observe that
µ2 ((A − B )x ) = µ2 (Ax − Bx ) = µ2 (Ax ) − µ2 (Bx )
and integrating over x gives the second conclusion. Since L contains the rectangles,
a π system that generates F , the desired result follows from the π − λ theorem.
We are now ready to prove Theorem A.6.2 by verifying it in four increasingly more
general special cases.
Case 1. If E ∈ F and f = 1E then (∗) follows from Lemma A.6.4
Case 2. Since each integral is linear in f , it follows that (∗) holds for simple functions.
Case 3. Now if f ≥ 0 and we let fn (x) = ([2n f (x)]/2n ) ∧ n, where [x] = the largest
integer ≤ x, then the fn are simple and fn ↑ f , so it follows from the monotone
convergence theorem that (∗) holds for all f ≥ 0.
Case 4. The general case now follows by writing f (x) = f (x)+ − f (x)− and applying
Case 3 to f + , f − , and f . A.6. PRODUCT MEASURES, FUBINI’S THEOREM 365 To illustrate why the various hypotheses of Theorem A.6.2 are needed, we will
now give some examples where the conclusion fails.
Example A.6.1. Let X = Y = {1, 2, . . .} with A = B = all subsets and µ1 = µ2 =
counting measure. For m ≥ 1, let f (m, m) = 1 and f (m + 1, m) = −1, and let
f (m, n) = 0 otherwise. We claim that
f (m, n) = 1
m but f (m, n) = 0 n n m A picture is worth several dozen words: ↑
n .
.
.
0
0
0
1 .
.
.
0
0
1
−1
m .
.
.
0
1
−1
0
→ .
.
.
1
−1
0
0 ...
...
...
... In words, if we sum the columns ﬁrst, the ﬁrst one gives us a 1 and the others 0, while
if we sum the rows each one gives us a 0.
Example A.6.2. Let X = (0, 1), Y = (1, ∞), both equipped with the Borel sets and
Lebesgue measure. Let f (x, y ) = e−xy − 2e−2xy .
∞ 1 1 x−1 (e−x − e−2x ) dx > 0 f (x, y ) dy dx =
0 1 0 ∞ ∞ 1 y −1 (e−2y − e−y ) dy < 0 f (x, y ) dx dy =
1 0 1 The next example indicates why µ1 and µ2 must be σ ﬁnite.
Example A.6.3. Let X = (0, 1) with A = the Borel sets and µ1 = Lebesgue measure.
Let Y = (0, 1) with B = all subsets and µ2 = counting measure. Let f (x, y ) = 1 if
x = y and 0 otherwise
f (x, y ) µ2 (dy ) = 1 for all x so Y f (x, y ) µ2 (dy ) µ1 (dx) = 1
X Y f (x, y ) µ1 (dx) = 0
X Y X for all y so f (x, y ) µ1 (dy ) µ2 (dx) = 0 Our last example shows that measurability is important or maybe that some of
the axioms of set theory are not as innocent as they seem.
Example A.6.4. By the axiom of choice and the continuum hypothesis one can
deﬁne an order relation < on (0,1) so that {x : x < y } is countable for each y . Let
X = Y = (0, 1), let A = B = the Borel sets and µ1 = µ2 = Lebesgue measure.
Let f (x, y ) = 1 if x < y , = 0 otherwise. Since {x : x < y } and {y : x < y }c are
countable,
f (x, y ) µ1 (dx) = 0 for all y f (x, y ) µ2 (dy ) = 1 for all x X Y 366 APPENDIX A. MEASURE THEORY We turn now to applications of Theorem A.6.2.
Exercise A.6.2. If X Y f (x, y )µ2 (dy )µ1 (dx) < ∞ then
f d(µ1 × µ2 ) = f (x, y )µ2 (dy )µ1 (dx) =
X X ×Y Y f (x, y )µ1 (dx)µ2 (dy )
Y X Example A.6.5. Let X = {1, 2, . . .} , A = all subsets of X , and µ1 = counting
measure. If n fn dµ < ∞ then n fn dµ =
n fn dµ.
Exercise A.6.3. Let g ≥ 0 be a measurable function on (X, A, µ). Use Theorem
A.6.2 to conclude that
∞ g dµ = (µ × λ)({(x, y ) : 0 ≤ y < g (x)}) =
X µ({x : g (x) > y }) dy
0 Exercise A.6.4. Let F , G be Stieltjes measure functions and let µ, ν be the corresponding measures on (R, R). Show that
(i)
(ii) (a,b] {F (y ) − F (a)}dG(y ) = (µ × ν )({(x, y ) : a < x ≤ y ≤ b}) (a,b] F (y ) dG(y ) + (a,b] G(y ) dF (y ) = F (b)G(b) − F (a)G(a) + µ({x})ν ({x})
x∈(a,b] (iii) If F = G is continuous then (a,b] 2F (y )dF (y ) = F 2 (b) − F 2 (a). To see the second term in (ii) is needed, let F (x) = G(x) = 1[0,∞) (x) and a < 0 < b.
Exercise A.6.5. Let µ be a ﬁnite measure on R and F (x) = µ((−∞, x]). Show that
(F (x + c) − F (x)) dx = cµ(R)
Exercise A.6.6. Show that e−xy sin x is integrable in the strip 0 < x < a, 0 < y .
Perform the double integral in the two orders to get:
a
0 π
sin x
dx = − (cos a)
x
2 and replace 1 + y 2 by 1 to conclude ∞
0 e−ay
dy − (sin a)
1 + y2 a
(sin x)/x dx
0 ∞
0 ye−ay
dy
1 + y2 − (π/2) ≤ 2/a for a ≥ 1. A.7. KOLMOGOROV’S EXTENSION THEOREM A.7 367 Kolmogorov’s Extension Theorem To construct some of the basic objects of study in probability theory, we will need an
existence theorem for measures on inﬁnite product spaces. Let N = {1, 2, . . .} and
RN = {(ω1 , ω2 , . . .) : ωi ∈ R}
We equip RN with the product σ algebra RN , which is generated by the ﬁnite
dimensional rectangles = sets of the form {ω : ωi ∈ (ai , bi ] for i = 1, . . . , n}, where
−∞ ≤ ai < bi ≤ ∞.
Theorem A.7.1. Kolmogorov’s extension theorem. Suppose we are given probability measures µn on (Rn , Rn ) that are consistent, that is,
µn+1 ((a1 , b1 ] × . . . × (an , bn ] × R) = µn ((a1 , b1 ] × . . . × (an , bn ])
Then there is a unique probability measure P on (RN , RN ) with
(∗) P (ω : ωi ∈ (ai , bi ], 1 ≤ i ≤ n) = µn ((a1 , b1 ] × . . . × (an , bn ])
An important example of a consistent sequence of measures is Example A.7.1. Let F1 , F2 , . . . be distribution functions and let µn be the measure
on Rn with
n (Fm (bm ) − Fm (am )) µn ((a1 , b1 ] × . . . × (an , bn ]) =
m=1 In this case, if we let Xn (ω ) = ωn , then the Xn are independent and Xn has distribution Fn .
Proof of Theorem A.7.1. Let S be the sets of the form {ω : ωi ∈ (ai , bi ], 1 ≤ i ≤ n},
and use (∗) to deﬁne P on S . S is a semialgebra, so by Theorem A.1.3 it is enough to
show that if A ∈ S is a disjoint union of Ai ∈ S , then P (A) ≤ i P (Ai ). If the union
is ﬁnite, then all the Ai are determined by the values of a ﬁnite number of coordinates
and the conclusion follows from results in Section A.6.
Suppose now that the union is inﬁnite. Let A = { ﬁnite disjoint unions of sets in
S} be the algebra generated by S . Since A is an algebra (by Lemma A.1.2)
Bn ≡ A − ∪n Ai
i=1
is a ﬁnite disjoint union of rectangles, and by the result for ﬁnite unions,
n P (A) = P (Ai ) + P (Bn )
i=1 It suﬃces then to show
Lemma A.7.2. If Bn ∈ A and Bn ↓ ∅ then P (Bn ) ↓ 0.
Proof. Suppose P (Bn ) ↓ δ > 0. By repeating sets in the sequence, we can suppose
Bn = ∪Kn {ω : ωi ∈ (ak , bk ], 1 ≤ i ≤ n}
i
i
k=1 where − ∞ ≤ ak < bk ≤ ∞
i
i 368 APPENDIX A. MEASURE THEORY The strategy of the proof is to approximate the Bn from within by compact rectangles
with almost the same probability and then use a diagonal argument to show that
∩n Bn = ∅. There is a set Cn ⊂ Bn of the form
ai bi
Cn = ∪Kn {ω : ωi ∈ [¯k , ¯k ], 1 ≤ i ≤ n}
k=1 with − ∞ < ai < ¯i < ∞
¯ k bk that has P (Bn − Cn ) ≤ δ/2n+1 . Let Dn = ∩n =1 Cm .
m
n P (Bn − Dn ) ≤ P (Bm − Cm ) ≤ δ/2
m=1 ∗
∗
so P (Dn ) ↓ a limit ≥ δ/2. Now there are sets Cn , Dn ⊂ Rn so that
∗
Cn = {ω : (ω1 , . . . , ωn ) ∈ Cn } ∗
and Dn = {ω : (ω1 , . . . , ωn ) ∈ Dn } Note that
∗
Cn = Cn × R × R × . . . ∗
and Dn = Dn × R × R × . . . ∗
∗
∗
so Cn and Cn (and Dn and Dn ) are closely related but Cn ⊂ Ω and Cn ⊂ Rn .
∗
Cn is a ﬁnite union of closed rectangles, so
∗
∗
∗
Dn = Cn ∩n−1 (Cm × Rn−m )
m=1 is a compact set. For each m, let ωm ∈ Dm . Dm ⊂ D1 so ωm,1 (i.e., the ﬁrst coordinate
∗
∗
of ωm ) is in D1 Since D1 is compact, we can pick a subsequence m(1, j ) ≥ j so that
as j → ∞,
ωm(1,j ),1 → a limit θ1
∗
∗
For m ≥ 2, Dm ⊂ D2 and hence (ωm,1 , ωm,2 ) ∈ D2 . Since D2 is compact, we can
pick a subsequence of the previous subsequence (i.e., m(2, j ) = m(1, ij ) with ij ≥ j )
so that as j → ∞
ωm(2,j ),2 → a limit θ2 Continuing in this way, we deﬁne m(k, j ) a subsequence of m(k − 1, j ) so that as
j → ∞,
ωm(k,j ),k → a limit θk
Let ωi = ωm(i,i) . ωi is a subsequence of all the subsequences so ωi,k → θk for all k .
∗
∗
∗
Now ωi,1 ∈ D1 for all i ≥ 1 and D1 is closed so θ1 ∈ D1 . Turning to the second
∗
∗
∗
set, (ωi,1 , ωi,2 ) ∈ D2 for i ≥ 2 and D2 is closed, so (θ1 , θ2 ) ∈ D2 . Repeating the last
∗
argument, we conclude that (θ1 , . . . , θk ) ∈ Dk for all k , so ω = (θ1 , θ2 , . . .) ∈ Dk (no
star here since we are now talking about subsets of Ω) for all k and
∅ = ∩k Dk ⊂ ∩k Bk
a contradiction that proves the desired result. A.8. RADONNIKODYM THEOREM A.8 369 RadonNikodym Theorem In this section, we prove the RadonNikodym theorem. To develop that result, we
begin with a topic that at ﬁrst may appear to be unrelated. Let (Ω, F ) be a measurable
space. α is said to be a signed measure on (Ω, F ) if (i) α takes values in (−∞, ∞],
(ii) α(∅) = 0, and (iii) if E = +i Ei is a disjoint union then α(E ) = i α(Ei ), in the
following sense:
If α(E ) < ∞, the sum converges absolutely and = α(E ).
If α(E ) = ∞, then i α(Ei )− < ∞ and i α(Ei )+ = ∞. Clearly, a signed measure cannot be allowed to take both the values ∞ and −∞,
since α(A) + α(B ) might not make sense. In most formulations, a signed measure
is allowed to take values in either (−∞, ∞] or [−∞, ∞). We will ignore the second
possibility to simplify statements later. As usual, we turn to examples to help explain
the deﬁnition.
Example A.8.1. Let µ be a measure, f be a function with f − dµ < ∞, and let
α(A) = A f dµ. Exercise 5.8 implies that α is a signed measure.
Example A.8.2. Let µ1 and µ2 be measures with µ2 (Ω) < ∞, and let α(A) =
µ1 (A) − µ2 (A).
The Jordan decomposition, (8.4) below, will show that Example A.8.2 is the general case. To derive that result, we begin with two deﬁnitions. A set A is positive
if every measurable B ⊂ A has α(B ) ≥ 0. A set A is negative if every measurable
B ⊂ A has α(B ) ≤ 0.
Exercise A.8.1. In Example A.8.1, A is positive if and only if µ(A ∩ {x : f (x) <
0}) = 0.
Lemma A.8.1. (i) Every measurable subset of a positive set is positive. (ii) If the
sets An are positive then A = ∪n An is also positive.
Proof. (i) is trivial. To prove (ii), observe that
Bn = An ∩ ∩n−1 Ac ⊂ An
m=1 m
are positive, disjoint, and ∪n Bn = ∪n An . Let E ⊂ A be measurable, and let En =
E ∩ Bn . α(En ) ≥ 0 since Bn is positive, so α(E ) = n α(En ) ≥ 0.
The conclusions in Lemma A.8.1 remain valid if positive is replaced by negative.
The next result is the key to the proof of Theorem A.8.3.
Lemma A.8.2. Let E be a measurable set with α(E ) < 0. Then there is a negative
set F ⊂ E with α(F ) < 0.
Proof. If E is negative, this is true. If not, let n1 be the smallest positive integer so
that there is an E1 ⊂ E with α(E1 ) ≥ 1/n1 . Let k ≥ 2. If Fk = E − (E1 ∪ . . . ∪ Ek−1 ) is
negative, we are done. If not, we continue the construction letting nk be the smallest
positive integer so that there is an Ek ⊂ Fk with α(Ek ) ≥ 1/nk . If the construction
does not stop for any k < ∞, let
F = ∩k Fk = E − (∪k Ek ) 370 APPENDIX A. MEASURE THEORY Since 0 > α(E ) > −∞ and α(Ek ) ≥ 0, it follows from the deﬁnition of signed measure
that
∞ α(Ek ) α(E ) = α(F ) +
k=1 α(F ) ≤ α(E ) < 0, and the sum is ﬁnite. From the last observation and the construction, it follows that F can have no subset G with α(G) > 0, for then α(G) ≥ 1/N for
some N and we would have a contradiction.
Theorem A.8.3. Hahn decompositon. Let α be a signed measure. Then there is
a positive set A and a negative set B so that Ω = A ∪ B and A ∩ B = ∅.
Proof. Let c = inf {α(B ) : B is negative} ≤ 0. Let Bi be negative sets with α(Bi ) ↓ c.
Let B = ∪i Bi . By Lemma A.8.1, B is negative, so by the deﬁnition of c, α(B ) ≥ c.
To prove α(B ) ≤ c, we observe that α(B ) = α(Bi ) + α(B − Bi ) ≤ α(Bi ), since B is
negative, and let i → ∞. The last two inequalities show that α(B ) = c, and it follows
from our deﬁnition of a signed measure that c > −∞. Let A = B c . To show A is
positive, observe that if A contains a set with α(E ) < 0, then by Lemma A.8.2, it
contains a negative set F with α(F ) < 0, but then B ∪ F would be a negative set that
has α(B ∪ F ) = α(B ) + α(F ) < c, a contradiction.
The Hahn decomposition is not unique. In Example A.8.1, A can be any set with
{x : f (x) > 0} ⊂ A ⊂ {x : f (x) ≥ 0} a.e. where B ⊂ C a.e. means µ(B ∩ C c ) = 0. The last example is typical of the general
situation. Suppose Ω = A1 ∪ B1 = A2 ∪ B2 are two Hahn decompositions. A2 ∩ B1 is
positive and negative, so it is a null set: All its subsets have measure 0. Similarly,
A1 ∩ B2 is a null set.
Two measures µ1 and µ2 are said to be mutually singular if there is a set A with
µ1 (A) = 0 and µ2 (Ac ) = 0. In this case, we also say µ1 is singular with respect to
µ2 and write µ1 ⊥ µ2 .
Exercise A.8.2. Show that the uniform distribution on the Cantor set (Example
2.3.13) is singular with respect to Lebesgue measure.
Theorem A.8.4. Jordan decompositon. Let α be a signed measure. There are
mutually singular measures α+ and α− so that α = α+ − α− . Moreover, there is only
one such pair.
Proof. Let Ω = A ∪ B be a Hahn decomposition. Let
α+ (E ) = α(E ∩ A) and α− (E ) = −α(E ∩ B ) Since A is positive and B is negative, α+ and α− are measures. α+ (Ac ) = 0 and
α− (A) = 0, so they are mutually singular. To prove uniqueness, suppose α = ν1 − ν2
and D is a set with ν1 (D) = 0 and ν2 (Dc ) = 0. If we set C = Dc , then Ω = C ∪ D is
a Hahn decomposition, and it follows from the choice of D that
ν1 (E ) = α(C ∩ E ) and ν2 (E ) = −α(D ∩ E ) Our uniqueness result for the Hahn decomposition shows that A ∩ D = A ∩ C c and
B ∩ C = Ac ∩ C are null sets, so α(E ∩ C ) = α(E ∩ (A ∪ C )) = α(E ∩ A) and
ν1 = α+ . A.8. RADONNIKODYM THEOREM 371 Exercise A.8.3. Show that α+ (E ) = sup{α(F ) : F ⊂ E }.
Remark. Let α be a ﬁnite signed measure (i.e., one that does not take the value
∞ or −∞) on (R, R). Let α = α+ − α− be its Jordan decomposition. Let A(x) =
α((−∞, x]), F (x) = α+ ((−∞, x]), and G(x) = α− ((−∞, x]). A(x) = F (x) − G(x)
so the distribution function for a ﬁnite signed measure can be written as a diﬀerence
of two bounded increasing functions. It follows from Example 8.2 that the converse
is also true. Let α = α+ + α− . α is called the total variation of α, since in
this example α((a, b]) is the total variation of A over (a, b] as deﬁned in analysis
textbooks. See, for example, Royden (1988), p. 103. We exclude the left endpoint of
the interval since a jump there makes no contribution to the total variation on [a, b],
but it does appear in α.
Our third and ﬁnal decomposition is:
Theorem A.8.5. Lebesgue decomposition. Let µ, ν be σ ﬁnite measures. ν can
be written as νr + νs , where νs is singular with respect to µ and
νr (E ) = g dµ
E Proof. By decomposing Ω = +i Ωi , we can suppose without loss of generality that µ
and ν are ﬁnite measures. Let G be the set of g ≥ 0 so that E g dµ ≤ ν (E ) for all E.
(a) If g, h ∈ G then g ∨ h ∈ G .
Proof of (a). Let A = {g > h}, B = {g ≤ h}.
g ∨ h dµ = h dµ ≤ ν (E ∩ A) + ν (E ∩ B ) = ν (E ) g dµ +
E ∩A E E ∩B Let κ = sup{ g dµ : g ∈ G} ≤ ν (Ω) < ∞. Pick gn so that gn dµ > κ − 1/n and
let hn = g1 ∨ . . . ∨ gn . By (a), hn ∈ G . As n ↑ ∞, hn ↑ h. The deﬁnition of κ, the
monotone convergence theorem, and the choice of gn imply that
κ≥
Let νr (E ) = E h dµ = lim n→∞ hn dµ ≥ lim n→∞ gn dµ = κ h dµ and νs (E ) = ν (E ) − νr (E ). The last detail is to show: (b) νs is singular with respect to µ.
Proof of (b). Let > 0 and let Ω = A ∪ B be a Hahn decomposition for νs − µ. Using
the deﬁnition of νr and then the fact that A is positive for νs − µ (so µ(A ∩ E ) ≤
νs (A ∩ E )),
(h + 1A ) dµ = νr (E ) + µ(A ∩ E ) ≤ ν (E )
E This holds for all E , so k = h + 1A ∈ G . It follows that µ(A ) = 0 ,for if not,
then k dµ > κ a contradiction. Letting A = ∪n A1/n , we have µ(A) = 0. To see
that νs (Ac ) = 0, observe that if νs (Ac ) > 0, then (νs − µ)(Ac ) > 0 for small , a
contradiction since Ac ⊂ B , a negative set.
Exercise A.8.4. Prove that the Lebesgue decomposition is unique. Note that you
can suppose without loss of generality that µ and ν are ﬁnite.
We are ﬁnally ready for the main business of the section. We say a measure ν is
absolutely continuous with respect to µ (and write ν < µ) if µ(A) = 0 implies
<
that ν (A) = 0. 372 APPENDIX A. MEASURE THEORY Exercise A.8.5. If µ1 < µ2 and µ2 ⊥ ν then µ1 ⊥ ν.
<
Theorem A.8.6. RadonNikodym theorem. If µ, ν are σ ﬁnite measures and ν
is absolutely continuous with respect to µ, then there is a g ≥ 0 so that ν (E ) = E g dµ.
If h is another such function then g = h µ a.e.
Proof. Let ν = νr + νs be any Lebesgue decomposition. Let A be chosen so that
νs (Ac ) = 0 and µ(A) = 0. Since ν < µ, 0 = ν (A) ≥ νs (A) and νs ≡ 0. To prove
<
uniqueness, observe that if E g dµ = E h dµ for all E , then letting E ⊂ {g > h, g ≤
n} be any subset of ﬁnite measure, we conclude µ(g > h, g ≤ n) = 0 for all n, so
µ(g > h) = 0, and, similarly, µ(g < h) = 0.
Example A.8.3. Theorem A.8.6 may fail if µ is not σ ﬁnite. Let (Ω, F ) = (R, R),
µ = counting measure and ν = Lebesgue measure.
The function g whose existence is proved in Theorem A.8.6 is often denoted dν/dµ.
This notation suggests the following properties, whose proofs are left to the reader.
Exercise A.8.6. If ν1 , ν2 < µ then ν1 + ν2 < µ
<
<
d(ν1 + ν2 )/dµ = dν1 /dµ + dν2 /dµ
Exercise A.8.7. If ν < µ and f ≥ 0 then
< f dν = f dν
dµ dµ. Exercise A.8.8. If π < ν < µ then dπ/dµ = (dπ/dν ) · (dν/dµ).
<
<
Exercise A.8.9. If ν < µ and µ < ν then dµ/dν = (dν/dµ)−1 .
<
< A.9. DIFFERENTIATING UNDER THE INTEGRAL A.9 373 Diﬀerentiating Under the Integral At several places in the text, we need to interchange diﬀerentiate inside a sum or
an integral. This section is devoted to results that can be used to justify those
computations.
Theorem A.9.1. Let (S, S , µ) be a measure space. Let f be a complex valued function
deﬁned on R × S . Let δ > 0, and suppose that for x ∈ (y − δ, y + δ ) we have
(i) u(x) = S f (x, s) µ(ds) with S f (x, s) µ(ds) < ∞ (ii) for ﬁxed s, ∂f /∂x(x, s) exists and is a continuous function of x,
(iii) v (x) = ∂f
(x, s) µ(ds)
S ∂x and (iv) δ
∂f
(y
−δ ∂x S is continuous at x = y , + θ, s) dθ µ(ds) < ∞ then u (y ) = v (y ).
Proof. Letting h ≤ δ and using (i), (ii), (iv), and Fubini’s theorem in the form given
in Exercise A.6.4, we have
u(y + h) − u(y ) = f (y + h, s) − f (y, s) µ(ds)
S
h =
S0
h =
0 S ∂f
(y + θ, s) dθ µ(ds)
∂x
∂f
(y + θ, s) µ(ds) dθ
∂x The last equation implies
u(y + h) − u(y )
1
=
h
h h v (y + θ) dθ
0 Since v is continuous at y by (iii), letting h → 0 gives the desired result.
Example A.9.1. For a result in Section 2.3, we need to know that we can diﬀerentiate
under the integral sign in
u(x) = cos(xs)e−s 2 /2 ds For convenience, we have dropped a factor (2π )−1/2 and changed variables to match
Theorem A.9.1. Clearly, (i) and (ii) hold. The dominated convergence theorem implies
(iii)
x→ −s sin(sx)e−s 2 /2 ds is a continuous. For (iv), we note
∂f
(x, s) ds =
∂x se−s 2 /2 ds < ∞ and the value does not depend on x, so (iv) holds.
For some examples the following form is more convenient: 374 APPENDIX A. MEASURE THEORY Theorem A.9.2. Let (S, S , µ) be a measure space. Let f be a complex valued function
deﬁned on R × S . Let δ > 0, and suppose that for x ∈ (y − δ, y + δ ) we have
(i) u(x) = S f (x, s) µ(ds) with S f (x, s) µ(ds) < ∞ (ii) for ﬁxed s, ∂f /∂x(x, s) exists and is a continuous function of x,
(iii ) sup
S θ ∈[−δ,δ ] ∂f
(y + θ, s) µ(ds) < ∞
∂x then u (y ) = v (y ).
Proof. In view of Theorem A.9.1 it is enough to show that (iii) and (iv) of that result
hold. Since
δ
∂f
∂f
(y + θ, s) dθ ≤ 2δ sup
(y + θ, s)
∂x
θ ∈[−δ,δ ] ∂x
−δ
it is clear that (iv) holds. To check (iii), we note that
v (x) − v (y ) ≤
S ∂f
∂f
(x, s) −
(y, s) µ(ds)
∂x
∂x (ii) implies that the integrand → 0 as x → y . The desired result follows from (iii )
and the dominated convergence theorem.
To indicate the usefulness of the new result, we prove:
Theorem A.9.3. If φ(θ) = EeθZ < ∞ for θ ∈ [− , ] then φ (0) = EZ .
Proof. Here θ plays the role of x, and we take µ to be the distribution of Z . Let
δ = /2. f (x, s) = exs ≥ 0, so (i) holds by assumption. ∂f /∂x = sexs is clearly a
continuous function, so (ii) holds. To check (iii ), we note that there is a constant C
so that if x ∈ (−δ, δ ), then sexs ≤ C (e− s + e s ).
Taking S = Z with S = all subsets of S and µ = counting measure in Theorem
A.9.1 and using Lemma ?? gives the following:
Theorem A.9.4. Let δ > 0. Suppose that for x ∈ (y − δ, y + δ ) we have
(i) u(x) = ∞
n=1 fn (x) with ∞
n=1 fn (x) < ∞ (ii) for each n, fn (x) exists and is a continuous function of x,
and (iii) ∞
n=1 supθ∈(−δ,δ) fn (y + θ) < ∞ then u (x) = v (x).
Proof. We want to show that if p ∈ (0, 1) then
∞ ∞ (1 − p)n
n=1 n(1 − p)n−1 =−
n=1 Let fn (x) = (1 − x)n , y = p, and pick δ so that [y − δ, y + δ ] ⊂ (0, 1). Clearly (i)
∞
n
n−1
is continuous for x in [y − δ, y + δ ].
n=1 (1 − x)  < ∞ and (ii) fn (x) = n(1 − x)
To check (iii), we note that if we let 2η = y − δ then there is a constant C so that if
x ∈ [y − δ, y + δ ] and n ≥ 1, then
n(1 − x)n−1 = n(1 − x)n−1
· (1 − η )n−1 ≤ C (1 − η )n−1
(1 − η )n−1 ...
View
Full
Document
This note was uploaded on 12/06/2009 for the course MATH 671 taught by Professor Dynkin during the Fall '08 term at Cornell University (Engineering School).
 Fall '08
 Dynkin
 Probability, The Land

Click to edit the document details