This preview shows page 1. Sign up to view the full content.
Unformatted text preview: under which optimal stationary
policies and solutions to Bellman’s equation exist. However, we call a stationary policy
optimal if it is the optimal dynamic policy for one special ﬁnal reward vector. In the next
subsection, we will show that if an optimal stationary policy is unique and is an ergodic
unichain, then that policy is optimal except for a ﬁnal transient no matter what the ﬁnal
reward vector is. 4.6.5 Stationary policies with arbitrary ﬁnal rewards We start out this subsection with the main theorem, then build up some notation and
preliminary ideas for the proof, then prove a couple of lemmas, and ﬁnally prove the theorem.
Theorem 4.13. Assume that k0 is a unique optimal stationary policy and is an ergodic
unichain with the ergodic class R = {1, 2, . . . , m}. Let w0 and g 0 be the relative gain vector
and gain per stage for k0 . Then, for any ﬁnal gain vector u, the fol lowing limit exists and
is independent of i
lim v ∗ (n, u)
n→1 i 0
− ng 0 − wi = (π 0 u)(u), (4.84) where (π 0 u)(u) satisﬁes
(π 0 u)(u) = lim π 0 [v∗ (n, u) − ng 0 e − w0 ]
n→1 (4.85) and π 0 is the steadystate vector for k0
Discussion: The theorem says that, asymptotically, the relative advantage of starting in
one state rather than another is independent of the ﬁnal gain vector, i.e., that for any states
i, j , limn→1 [u∗ (n, u ) − u∗ (n, u )] is independent of u . For the shortest path problem, for
i
j
example, this says that v ∗ (n, u ) converges to the shortest path vector for any choice of u
for which ui = 0. This means that if the arc lengths change, we can start the algorithm at
the shortest paths for the previous arc lengths, and the algorithm is guaranteed to converge
to the correct new shortest paths.
To see why the theorem can be false without the ergodic assumption, consider Example
4.5.4 where, even without any choice of decisions, (4.84) is false. Exercise 4.34 shows why
the theorem can be false without the uniqueness assumption.
It can also be shown (see Exercise 4.35) that for any Markov decision problem satisfying the
hypotheses of Theorem 4.13, there is some n0 such that the optimal dynamic policy uses
the optimal stationary policy for all stages n ≥ n0 . Thus, the dynamic part of the optimal
dynamic policy is strictly a transient.
The proof of the theorem is quite lengthy. Under the restricted conditions that k 0 is an
ergodic Markov chain, the proof is simpler and involves only Lemma 4.5. 178 CHAPTER 4. FINITESTATE MARKOV CHAINS We now develop some notation required for the proof of the theorem. Given a ﬁnal reward
(k) P
(k) ∗
vector u , deﬁne ki (n) for each i and n as the k that maximizes ri + j Pij vj (n, u ). Then
(ki (n)) ∗
vi (n + 1, u ) = ri X + (k (n)) ∗
vj (n, u ) Pij i j (k) 0
Similarly, since ki maximizes ri
0
(ki ) ∗
vi (n + 1, w 0 ) = ri + X
j + P j 0
(ki ) ≥ ri X + (k0 ) ∗
Pij i vj (n, u ). (4.86) (k (n)) ∗
vj (n, w 0 ). (4.87) j (k) ∗
Pij vj (n, w 0 ), (k0 ) (ki (n)) ∗
Pij i vj (n, w 0 ) ≥ ri + X Pij i j Subtracting (4.87) from (4.86), we get the following two inequalities,
X (k0 )
∗
∗
∗
∗
vi (n + 1, u ) − vi (n + 1, w 0 ) ≥
Pij i [vj (n, u ) − vj (n, w 0 )]. (4.88) j ∗
∗
vi (n + 1, u ) − vi (n + 1, w 0 ) ≤ X (k (n)) Pij i j ∗
∗
[vj (n, u ) − vj (n, w 0 )]. (4.89) Deﬁne
∗
∗
δi (n) = vi (n, u ) − vi (n, w 0 ). Then (4.88) and (4.89) become
δi (n + 1) ≥ δi (n + 1) ≤ X (k0 ) Pij i δj (n). (4.90) j X (k (n)) Pij i δj (n). (4.91) j ∗
0
Since vi (n, w 0 ) = ng 0 + wi for all i, n,
∗
0
δi (n) = vi (n, u ) − ng 0 − wi . Thus the theorem can be restated as asserting that limn→1 δi (n) = β (u ) for each state i.
Deﬁne
δmax (n) = max δi (n);
i Then, from (4.90), δi (n + 1) ≥ P δmin (n) = min δi (n).
i (k0 ) j Pij i δmin (n) = δmin (n). Since this is true for all i, δmin (n + 1) ≥ δmin (n). (4.92) δmax (n + 1) ≤ δmax (n). (4.93) In the same way, from (4.91), The following lemma shows that (4.84) is valid for each of the recurrent states. 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 179 Lemma 4.5. Under the hypotheses of Theorem 4.12, the limiting expression for β (u) in
(4.85) exists and
lim δi (n) = β (u) for 1 ≤ i ≤ m. (4.94) n→1 0
Proof* of lemma 4.5: Multiplying each side of (4.90) by πi and summing over i,
0 π 0δ (n + 1) ≥ π 0 [P k ]δ (n) = π 0δ (n).
Thus π 0δ (n) is nondecreasing in n. Also, from (4.93), π 0δ (n) ≤ δmax (n) ≤ δmax (1). Since
π 0δ (n) is nondecreasing and bounded, it has a limit β (u ) as deﬁned by (4.85) and
π 0δ (n) ≤ β (u ) lim π 0δ (n) = β (u ). (4.95) n→1 Next, iterating (4.90) m times, we get
0 δ (n + m) ≥ [P k ]mδ (n).
0 Since the recurrent class of k 0 is ergodic, (4.28) shows that limm→1 [P k ]m = e π 0 . Thus,
0 [P k ]m = e π 0 + [χ(m)].
where [χ(m)] is a sequence of matrices for which limm→1 [χ(m)] = 0.
δ (n + m) ≥ e π 0δ (n) + [χ(m)]δ (n).
For any ≤ > 0, (4.95) shows that for all suﬃcient...
View Full
Document
 Spring '09
 R.Srikant

Click to edit the document details