Discrete-time stochastic processes

# B show that every nite state markov chain contains at

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ichain stationary policies, until eventually one of them satisﬁes Bellman’s equation and is thus optimal. Lemma 4.2. Let k = (k1 , . . . , kM ) be an arbitrary stationary policy in an inherently recurrent Markov decision problem. Let R be a recurrent class of states in k. Then a unichain ˜ ˜ ˜ ˜ stationary policy k = (k1 , . . . , kM ) exists with the recurrent class R and with kj = kj for j ∈ R. Proof: Let j be any state in R. By the inherently recurrent assumption, there is a decision vector, say k 0 under which j is accessible from all other states (see Exercise 4.38). Choosing 0 ˜ ˜ ki = ki for i ∈ R and ki = ki for i ∈ R completes the proof. / Now that we are assured that unichain stationary policies exist and can be found, we can state the policy improvement algorithm for inherently recurrent Markov decision problems. This algorithm is a generalization of Howard’s policy iteration algorithm, [How60]. Policy Improvement Algorithm 1. Choose an arbitrary unichain policy k 0 0 0 2. For policy k 0 , calculate w 0 and g 0 from w 0 + g 0 e = r k + [P k ]w 0 . 3. If w 0 + g 0 e = maxk {r k + [P k ]w 0 }, then stop; k 0 is optimal. P (k ) 0 (k ) 0 0 4. Otherwise, choose i and ki so that wi + g 0 < ri i + j Pij i wj . For j 6= i, let kj = kj . 5. If the policy k = (k1 , . . . kM ) is not a unichain, then let R be the recurrent class in ˜ policy k that contains state i, and let k be the unichain policy of Lemma 4.2. Update ˜. k to the value of k 174 CHAPTER 4. FINITE-STATE MARKOV CHAINS 6. Update k 0 to the value of k and return to step 2. (k ) 0 If the stopping test in step 3 fails, then there is some i for which wi + g 0 < maxki {ri i + P (ki ) 0 j Pij wj }, so step 4 can always b e executed if the algorithm do es not stop in step 3. The resulting policy k then satisﬁes w 0 + g0 e ≤ 6= r k + [P k ]w 0 , (4.70) where ≤ means that the inequality is strict for at least one component (namely i) of the 6= vectors. 0 Note that at the end of step 4, [P k ] diﬀers from [P k ] only in the transitions out of state i. Thus the set of states from which i is accessible is the same in k 0 as k . If i is recurrent in the unichain k 0 , then it is accessible from all states in k 0 and thus also accessible from all states in k . It follows that i is also recurrent in k and that k is a unichain (see Exercise 4.2. On the other hand, if i is transient in k 0 , and if R0 is the recurrent class of k 0 , then R0 must also be a recurrent class of k , since the transitions from states in R0 are unchanged. There (k ) are then two possibilities when i is transient in k 0 . First, if the changes in Pij i eliminate all the paths from i to R0 , then a new recurrent class R will be formed with i a member. This is the case in which step 5 is used to change k back to a unichain. Alternatively, if a path still exists to R0 , then i is transient in k and k is a unichain with the same recurrent class R0 as k 0 . These results are summarized in the following lemma: Lemma 4.3. There are only three possibilities for k at the end of step 4 of the policy improvement algorithm for inherently recurrent Markov decision problems. First, k is a unichain and i is recurrent in both k0 and k. Second, k is not a unichain and i is transient in k0 and recurrent in k. Third, k is a unichain with the same recurrent class as k0 and i is transient in both k0 and k. The following lemma now asserts that the new policy on returning to step 2 of the algorithm is an improvement over the previous policy k 0 . Lemma 4.4. Let k0 be the unichain policy of step 2 in an iteration of the policy improvement algorithm for an inherently recurrent Markov decision problem. Let g 0 , w0 , R0 be the gain per stage, relative gain vector, and recurrent class respectively of k0 . Assume the algorithm doesn’t stop at step 3 and let k be the unichain policy of step 6. Then either the gain per stage g of k satisﬁes g > g 0 or else the recurrent class of k is R0 , the gain per stage satisﬁes g = g 0 , and there is a shifted relative gain vector, w, of k satisfying w0 ≤ 6= w 0 and wj = wj for each j ∈ R0 . (4.71) Proof*: The policy k of step 4 satisﬁes (4.70) with strict inequality for the component i in which k 0 and k diﬀer. Let R be any recurrent class of k and let π be the steady-state probability vector for R. Premultiplying both sides of (4.70) by π , we get π w 0 + g 0 ≤ π r k + π [P k ]w 0 . (4.72) 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 175 Recognizing that π [P k ] = π and cancelling terms, this shows that g 0 ≤ π r k . Now (4.70) is satisﬁed with strict inequality for component i, and thus, if πi > 0, (4.72) is satisﬁed with strict inequality. Thus, g 0 ≤ π r k with equality iﬀ πi = 0. (4.73) For the ﬁrst possibility of Lemma 4.3, k is a unichain and i ∈ R. Thus g 0 < π r k = g . Similarly, for the second possibility in Lemma 4.3, i ∈ R for the new recurrent class that is ˜ formed in k ,...
View Full Document

## This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online