This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ichain stationary policies, until eventually one of them satisﬁes
Bellman’s equation and is thus optimal.
Lemma 4.2. Let k = (k1 , . . . , kM ) be an arbitrary stationary policy in an inherently recurrent Markov decision problem. Let R be a recurrent class of states in k. Then a unichain
˜
˜
˜
˜
stationary policy k = (k1 , . . . , kM ) exists with the recurrent class R and with kj = kj for
j ∈ R.
Proof: Let j be any state in R. By the inherently recurrent assumption, there is a decision
vector, say k 0 under which j is accessible from all other states (see Exercise 4.38). Choosing
0
˜
˜
ki = ki for i ∈ R and ki = ki for i ∈ R completes the proof.
/
Now that we are assured that unichain stationary policies exist and can be found, we can
state the policy improvement algorithm for inherently recurrent Markov decision problems.
This algorithm is a generalization of Howard’s policy iteration algorithm, [How60].
Policy Improvement Algorithm
1. Choose an arbitrary unichain policy k 0
0 0 2. For policy k 0 , calculate w 0 and g 0 from w 0 + g 0 e = r k + [P k ]w 0 .
3. If w 0 + g 0 e = maxk {r k + [P k ]w 0 }, then stop; k 0 is optimal.
P (k ) 0
(k )
0
0
4. Otherwise, choose i and ki so that wi + g 0 < ri i + j Pij i wj . For j 6= i, let kj = kj . 5. If the policy k = (k1 , . . . kM ) is not a unichain, then let R be the recurrent class in
˜
policy k that contains state i, and let k be the unichain policy of Lemma 4.2. Update
˜.
k to the value of k 174 CHAPTER 4. FINITESTATE MARKOV CHAINS 6. Update k 0 to the value of k and return to step 2.
(k ) 0
If the stopping test in step 3 fails, then there is some i for which wi + g 0 < maxki {ri i +
P (ki ) 0
j Pij wj }, so step 4 can always b e executed if the algorithm do es not stop in step 3. The
resulting policy k then satisﬁes w 0 + g0 e ≤
6= r k + [P k ]w 0 , (4.70) where ≤ means that the inequality is strict for at least one component (namely i) of the
6=
vectors.
0 Note that at the end of step 4, [P k ] diﬀers from [P k ] only in the transitions out of state i.
Thus the set of states from which i is accessible is the same in k 0 as k . If i is recurrent in
the unichain k 0 , then it is accessible from all states in k 0 and thus also accessible from all
states in k . It follows that i is also recurrent in k and that k is a unichain (see Exercise 4.2.
On the other hand, if i is transient in k 0 , and if R0 is the recurrent class of k 0 , then R0 must
also be a recurrent class of k , since the transitions from states in R0 are unchanged. There
(k )
are then two possibilities when i is transient in k 0 . First, if the changes in Pij i eliminate
all the paths from i to R0 , then a new recurrent class R will be formed with i a member.
This is the case in which step 5 is used to change k back to a unichain. Alternatively, if a
path still exists to R0 , then i is transient in k and k is a unichain with the same recurrent
class R0 as k 0 . These results are summarized in the following lemma:
Lemma 4.3. There are only three possibilities for k at the end of step 4 of the policy
improvement algorithm for inherently recurrent Markov decision problems. First, k is a
unichain and i is recurrent in both k0 and k. Second, k is not a unichain and i is transient
in k0 and recurrent in k. Third, k is a unichain with the same recurrent class as k0 and i is
transient in both k0 and k.
The following lemma now asserts that the new policy on returning to step 2 of the algorithm
is an improvement over the previous policy k 0 .
Lemma 4.4. Let k0 be the unichain policy of step 2 in an iteration of the policy improvement algorithm for an inherently recurrent Markov decision problem. Let g 0 , w0 , R0 be the
gain per stage, relative gain vector, and recurrent class respectively of k0 . Assume the algorithm doesn’t stop at step 3 and let k be the unichain policy of step 6. Then either the gain
per stage g of k satisﬁes g > g 0 or else the recurrent class of k is R0 , the gain per stage
satisﬁes g = g 0 , and there is a shifted relative gain vector, w, of k satisfying
w0 ≤
6= w 0
and wj = wj for each j ∈ R0 . (4.71) Proof*: The policy k of step 4 satisﬁes (4.70) with strict inequality for the component i
in which k 0 and k diﬀer. Let R be any recurrent class of k and let π be the steadystate
probability vector for R. Premultiplying both sides of (4.70) by π , we get
π w 0 + g 0 ≤ π r k + π [P k ]w 0 . (4.72) 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 175 Recognizing that π [P k ] = π and cancelling terms, this shows that g 0 ≤ π r k . Now (4.70) is
satisﬁed with strict inequality for component i, and thus, if πi > 0, (4.72) is satisﬁed with
strict inequality. Thus,
g 0 ≤ π r k with equality iﬀ πi = 0. (4.73) For the ﬁrst possibility of Lemma 4.3, k is a unichain and i ∈ R. Thus g 0 < π r k = g .
Similarly, for the second possibility in Lemma 4.3, i ∈ R for the new recurrent class that is
˜
formed in k ,...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details