Discrete-time stochastic processes

# D let pr x 0 i qi 1 i m be an arbitrary set of

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: under which optimal stationary policies and solutions to Bellman’s equation exist. However, we call a stationary policy optimal if it is the optimal dynamic policy for one special ﬁnal reward vector. In the next subsection, we will show that if an optimal stationary policy is unique and is an ergodic unichain, then that policy is optimal except for a ﬁnal transient no matter what the ﬁnal reward vector is. 4.6.5 Stationary policies with arbitrary ﬁnal rewards We start out this subsection with the main theorem, then build up some notation and preliminary ideas for the proof, then prove a couple of lemmas, and ﬁnally prove the theorem. Theorem 4.13. Assume that k0 is a unique optimal stationary policy and is an ergodic unichain with the ergodic class R = {1, 2, . . . , m}. Let w0 and g 0 be the relative gain vector and gain per stage for k0 . Then, for any ﬁnal gain vector u, the fol lowing limit exists and is independent of i lim v ∗ (n, u) n→1 i 0 − ng 0 − wi = (π 0 u)(u), (4.84) where (π 0 u)(u) satisﬁes (π 0 u)(u) = lim π 0 [v∗ (n, u) − ng 0 e − w0 ] n→1 (4.85) and π 0 is the steady-state vector for k0 Discussion: The theorem says that, asymptotically, the relative advantage of starting in one state rather than another is independent of the ﬁnal gain vector, i.e., that for any states i, j , limn→1 [u∗ (n, u ) − u∗ (n, u )] is independent of u . For the shortest path problem, for i j example, this says that v ∗ (n, u ) converges to the shortest path vector for any choice of u for which ui = 0. This means that if the arc lengths change, we can start the algorithm at the shortest paths for the previous arc lengths, and the algorithm is guaranteed to converge to the correct new shortest paths. To see why the theorem can be false without the ergodic assumption, consider Example 4.5.4 where, even without any choice of decisions, (4.84) is false. Exercise 4.34 shows why the theorem can be false without the uniqueness assumption. It can also be shown (see Exercise 4.35) that for any Markov decision problem satisfying the hypotheses of Theorem 4.13, there is some n0 such that the optimal dynamic policy uses the optimal stationary policy for all stages n ≥ n0 . Thus, the dynamic part of the optimal dynamic policy is strictly a transient. The proof of the theorem is quite lengthy. Under the restricted conditions that k 0 is an ergodic Markov chain, the proof is simpler and involves only Lemma 4.5. 178 CHAPTER 4. FINITE-STATE MARKOV CHAINS We now develop some notation required for the proof of the theorem. Given a ﬁnal reward (k) P (k) ∗ vector u , deﬁne ki (n) for each i and n as the k that maximizes ri + j Pij vj (n, u ). Then (ki (n)) ∗ vi (n + 1, u ) = ri X + (k (n)) ∗ vj (n, u ) Pij i j (k) 0 Similarly, since ki maximizes ri 0 (ki ) ∗ vi (n + 1, w 0 ) = ri + X j + P j 0 (ki ) ≥ ri X + (k0 ) ∗ Pij i vj (n, u ). (4.86) (k (n)) ∗ vj (n, w 0 ). (4.87) j (k) ∗ Pij vj (n, w 0 ), (k0 ) (ki (n)) ∗ Pij i vj (n, w 0 ) ≥ ri + X Pij i j Subtracting (4.87) from (4.86), we get the following two inequalities, X (k0 ) ∗ ∗ ∗ ∗ vi (n + 1, u ) − vi (n + 1, w 0 ) ≥ Pij i [vj (n, u ) − vj (n, w 0 )]. (4.88) j ∗ ∗ vi (n + 1, u ) − vi (n + 1, w 0 ) ≤ X (k (n)) Pij i j ∗ ∗ [vj (n, u ) − vj (n, w 0 )]. (4.89) Deﬁne ∗ ∗ δi (n) = vi (n, u ) − vi (n, w 0 ). Then (4.88) and (4.89) become δi (n + 1) ≥ δi (n + 1) ≤ X (k0 ) Pij i δj (n). (4.90) j X (k (n)) Pij i δj (n). (4.91) j ∗ 0 Since vi (n, w 0 ) = ng 0 + wi for all i, n, ∗ 0 δi (n) = vi (n, u ) − ng 0 − wi . Thus the theorem can be restated as asserting that limn→1 δi (n) = β (u ) for each state i. Deﬁne δmax (n) = max δi (n); i Then, from (4.90), δi (n + 1) ≥ P δmin (n) = min δi (n). i (k0 ) j Pij i δmin (n) = δmin (n). Since this is true for all i, δmin (n + 1) ≥ δmin (n). (4.92) δmax (n + 1) ≤ δmax (n). (4.93) In the same way, from (4.91), The following lemma shows that (4.84) is valid for each of the recurrent states. 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 179 Lemma 4.5. Under the hypotheses of Theorem 4.12, the limiting expression for β (u) in (4.85) exists and lim δi (n) = β (u) for 1 ≤ i ≤ m. (4.94) n→1 0 Proof* of lemma 4.5: Multiplying each side of (4.90) by πi and summing over i, 0 π 0δ (n + 1) ≥ π 0 [P k ]δ (n) = π 0δ (n). Thus π 0δ (n) is non-decreasing in n. Also, from (4.93), π 0δ (n) ≤ δmax (n) ≤ δmax (1). Since π 0δ (n) is non-decreasing and bounded, it has a limit β (u ) as deﬁned by (4.85) and π 0δ (n) ≤ β (u ) lim π 0δ (n) = β (u ). (4.95) n→1 Next, iterating (4.90) m times, we get 0 δ (n + m) ≥ [P k ]mδ (n). 0 Since the recurrent class of k 0 is ergodic, (4.28) shows that limm→1 [P k ]m = e π 0 . Thus, 0 [P k ]m = e π 0 + [χ(m)]. where [χ(m)] is a sequence of matrices for which limm→1 [χ(m)] = 0. δ (n + m) ≥ e π 0δ (n) + [χ(m)]δ (n). For any ≤ &gt; 0, (4.95) shows that for all suﬃcient...
View Full Document

Ask a homework question - tutors are online