Discrete-time stochastic processes

Thus p can be partitioned pr 0 as p ptr ptt pr

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: so again g 0 < π r k . Since k is a unichain with the recurrent class R, we have g 0 < g again. For the third possibility in Lemma 4.3, i is transient in R0 = R. Thus πi = 0, so π 0 = π , and g 0 = g . Thus, to complete the proof, we must demonstrate the validity of (4.71) for this case. We first show that, for each n ≥ 1, v k (n, w 0 ) − ng 0 e ≤ v k (n+1, w 0 ) − (n+1)g 0 e . (4.74) v k (1, w 0 ) = r k + [P k ]w 0 . (4.75) For n = 1, Using this, (4.70) can be rewritten as w0 ≤ 6= v k (1, w 0 ) − g 0 e . (4.76) Using (4.75) and then (4.76), v k (1, w 0 ) − g 0 e = r k + [P k ]w 0 − g 0 e ≤ r k + [P k ]{v k (1, w 0 ) − g 0 e } − g 0 e k k k 0 (4.77) 0 = r + [P ]v (1, w ) − 2g e = v k (2, w 0 ) − 2g 0 e . We now use induction on n, using n = 1 as the basis, to demonstrate (4.74) in general. For any n > 1, assume (4.74) for n − 1 as the inductive hypothesis. v k (n, w 0 ) − ng 0 e = r k + [P k ]v k (n − 1, w 0 ) − ng 0 e = r k + [P k ]{v k (n − 1, w 0 ) − (n − 1)g 0 e } − g 0 e ≤ r k + [P k ]{v k (n, w 0 ) − ng 0 e } − g 0 e = v k (n+1, w 0 ) − (n+1)g 0 e . This completes the induction, verifying (4.74) and showing that v k (n, w 0 ) − ng 0 e is nondecreasing in n. Since k is a unichain, Lemma 4.1 asserts that k has a shifted relative gain vector w , i.e., a solution to (4.42). From (4.46), v k (n, w 0 ) = w + ng 0 e + [P k ]n {w 0 − w }. (4.78) Since [P k ]n is a stochastic matrix, its elements are each between 0 and 1, so the sequence of vectors v k − ng 0 e must be bounded independent of n. Since this sequence is also non˜ decreasing, it must have a limit, say w , ˜ lim v k (n, w 0 ) − ng 0 e = w . n→1 (4.79) 176 CHAPTER 4. FINITE-STATE MARKOV CHAINS ˜ We next show that w satisfies (4.42) for k . ˜ w = = lim {v k (n+1, w 0 ) − (n + 1)g 0 e } n→1 lim {r k + [P k ]v k (n, w 0 ) − (n + 1)g 0 e } n→1 k ˜ = r − g 0 e + [P k ] lim {v k (n, w 0 ) − ng 0 e } = r k − g 0 e + [P k ]w . n→1 (4.80) (4.81) ˜ ˜ Thus w is a shifted relative gain vector for k . Finally we must show that w satisfies the conditions on w in (4.71). Using (4.76) and iterating with (4.74), w0 ≤ 6= ˜ v k (n, w 0 ) − ng 0 e ≤ w for all n ≥ 1. (4.82) Premultiplying each term in (4.82) by the steady-state probability vector π for k , ˜ π w 0 ≤ π v k (n, w 0 ) − ng 0 ≤ π w . (4.83) Now, k is the same as k 0 over the recurrent class, and π = π 0 since π is non-zero only over the recurrent class. This means that the first inequality above is actually an equality. Also, 0 ˜ going to the limit, we see that π w 0 = π w . Since πi ≥ 0 and wi ≤ wi , this implies that ˜ 0 = w for all recurrent i, completing the proof. wi ˜i We now see that each iteration of the algorithm either increases the gain per stage or holds the gain constant and increases the shifted relative gain vector w . Thus the sequence of policies found by the algorithm can never repeat. Since there are a finite number of stationary policies, the algorithm must eventually terminate at step 3. Thus we have proved the following important theorem. Theorem 4.11. For any inherently recurrent Markov decision problem, there is a solution to Bel lman’s equation and a maximizing stationary policy that is a unichain. There are also many interesting Markov decision problems, such as shortest path problems, that contain not only an inherently recurrent class but also some inherently transient states. The following theorem then applies. Theorem 4.12. Consider a Markov decision problem with a single inherently recurrent class of states and one or more inherently transient states. Let g ∗ be the maximum gain per stage over al l recurrent classes of al l stationary policies and assume that each recurrent class with gain per stage equal to g ∗ is contained in the inherently recurrent class. Then there is a solution to Bel lman’s equation and a maximizing stationary policy that is a unichain. Proof*: Let k be a stationary policy which has a recurrent class, R, with gain per stage ˜ g ∗ . Let j be any state in R. Since j is inherently recurrent, there is a decision vector k 0 such that k 0 = k for all i ∈ R under which j is accessible from all other states. Choose k i i 0 ˜ and ki = ki for all i ∈ R. Then k 0 is a unichain policy with gain per stage g ∗ . Suppose the / policy improvement algorithm is started with this unichain policy. If the algorithm stops at step 3, then k 0 satisfies Bellman’s equation and we are done. Otherwise, from Lemma 4.4, the unichain policy in step 6 of the algorithm either has a larger gain per stage (which is impossible) or has the same recurrent class R and has a relative gain vector w satisfying 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 177 (4.74). Iterating the algorithm, we find successively larger relative gain vectors. Since the policies cannot repeat, the algorithm must eventually stop with a solution to Bellman’s equation. The above theorems give us a good idea of the situations...
View Full Document

This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online