This preview shows page 1. Sign up to view the full content.
Unformatted text preview: so again g 0 < π r k . Since k is a unichain with the recurrent class R, we have
g 0 < g again. For the third possibility in Lemma 4.3, i is transient in R0 = R. Thus πi = 0,
so π 0 = π , and g 0 = g . Thus, to complete the proof, we must demonstrate the validity of
(4.71) for this case.
We ﬁrst show that, for each n ≥ 1,
v k (n, w 0 ) − ng 0 e ≤ v k (n+1, w 0 ) − (n+1)g 0 e . (4.74) v k (1, w 0 ) = r k + [P k ]w 0 . (4.75) For n = 1, Using this, (4.70) can be rewritten as
w0 ≤
6= v k (1, w 0 ) − g 0 e . (4.76) Using (4.75) and then (4.76),
v k (1, w 0 ) − g 0 e = r k + [P k ]w 0 − g 0 e ≤ r k + [P k ]{v k (1, w 0 ) − g 0 e } − g 0 e
k k k 0 (4.77) 0 = r + [P ]v (1, w ) − 2g e
= v k (2, w 0 ) − 2g 0 e . We now use induction on n, using n = 1 as the basis, to demonstrate (4.74) in general. For
any n > 1, assume (4.74) for n − 1 as the inductive hypothesis.
v k (n, w 0 ) − ng 0 e = r k + [P k ]v k (n − 1, w 0 ) − ng 0 e = r k + [P k ]{v k (n − 1, w 0 ) − (n − 1)g 0 e } − g 0 e
≤ r k + [P k ]{v k (n, w 0 ) − ng 0 e } − g 0 e
= v k (n+1, w 0 ) − (n+1)g 0 e . This completes the induction, verifying (4.74) and showing that v k (n, w 0 ) − ng 0 e is nondecreasing in n. Since k is a unichain, Lemma 4.1 asserts that k has a shifted relative gain
vector w , i.e., a solution to (4.42). From (4.46),
v k (n, w 0 ) = w + ng 0 e + [P k ]n {w 0 − w }. (4.78) Since [P k ]n is a stochastic matrix, its elements are each between 0 and 1, so the sequence
of vectors v k − ng 0 e must be bounded independent of n. Since this sequence is also non˜
decreasing, it must have a limit, say w ,
˜
lim v k (n, w 0 ) − ng 0 e = w . n→1 (4.79) 176 CHAPTER 4. FINITESTATE MARKOV CHAINS ˜
We next show that w satisﬁes (4.42) for k .
˜
w =
= lim {v k (n+1, w 0 ) − (n + 1)g 0 e } n→1 lim {r k + [P k ]v k (n, w 0 ) − (n + 1)g 0 e } n→1
k ˜
= r − g 0 e + [P k ] lim {v k (n, w 0 ) − ng 0 e } = r k − g 0 e + [P k ]w .
n→1 (4.80)
(4.81) ˜
˜
Thus w is a shifted relative gain vector for k . Finally we must show that w satisﬁes the
conditions on w in (4.71). Using (4.76) and iterating with (4.74),
w0 ≤
6= ˜
v k (n, w 0 ) − ng 0 e ≤ w for all n ≥ 1. (4.82) Premultiplying each term in (4.82) by the steadystate probability vector π for k ,
˜
π w 0 ≤ π v k (n, w 0 ) − ng 0 ≤ π w . (4.83) Now, k is the same as k 0 over the recurrent class, and π = π 0 since π is nonzero only over
the recurrent class. This means that the ﬁrst inequality above is actually an equality. Also,
0
˜
going to the limit, we see that π w 0 = π w . Since πi ≥ 0 and wi ≤ wi , this implies that
˜
0 = w for all recurrent i, completing the proof.
wi
˜i
We now see that each iteration of the algorithm either increases the gain per stage or holds
the gain constant and increases the shifted relative gain vector w . Thus the sequence
of policies found by the algorithm can never repeat. Since there are a ﬁnite number of
stationary policies, the algorithm must eventually terminate at step 3. Thus we have proved
the following important theorem.
Theorem 4.11. For any inherently recurrent Markov decision problem, there is a solution
to Bel lman’s equation and a maximizing stationary policy that is a unichain.
There are also many interesting Markov decision problems, such as shortest path problems,
that contain not only an inherently recurrent class but also some inherently transient states.
The following theorem then applies.
Theorem 4.12. Consider a Markov decision problem with a single inherently recurrent
class of states and one or more inherently transient states. Let g ∗ be the maximum gain per
stage over al l recurrent classes of al l stationary policies and assume that each recurrent class
with gain per stage equal to g ∗ is contained in the inherently recurrent class. Then there is
a solution to Bel lman’s equation and a maximizing stationary policy that is a unichain.
Proof*: Let k be a stationary policy which has a recurrent class, R, with gain per stage
˜
g ∗ . Let j be any state in R. Since j is inherently recurrent, there is a decision vector k
0 such that k 0 = k for all i ∈ R
under which j is accessible from all other states. Choose k
i
i
0
˜
and ki = ki for all i ∈ R. Then k 0 is a unichain policy with gain per stage g ∗ . Suppose the
/
policy improvement algorithm is started with this unichain policy. If the algorithm stops
at step 3, then k 0 satisﬁes Bellman’s equation and we are done. Otherwise, from Lemma
4.4, the unichain policy in step 6 of the algorithm either has a larger gain per stage (which
is impossible) or has the same recurrent class R and has a relative gain vector w satisfying 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 177 (4.74). Iterating the algorithm, we ﬁnd successively larger relative gain vectors. Since the
policies cannot repeat, the algorithm must eventually stop with a solution to Bellman’s
equation.
The above theorems give us a good idea of the situations...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details