Unformatted text preview: al decision at stage
n + 1. This completes the inductive step and thus the proof.
Since our ma jor interest in stationary policies is to help understand the relationship between
the optimal dynamic policy and stationary policies, we deﬁne an optimal stationary policy
Deﬁnition 4.13. A stationary policy k0 is optimal if there is some ﬁnal reward vector w0
for which k0 is the optimal dynamic policy. 172 CHAPTER 4. FINITE-STATE MARKOV CHAINS From Theorem 4.10, we see that if there is a solution to (4.60), then the stationary policy
k 0 that maximizes r k + [P k ]w 0 is an optimal stationary policy. Eqn. (4.60) is known as
Bel lman’s equation, and we now explore the situations in which it has a solution (since these
solutions give rise to optimal stationary policies).
Theorem 4.10 made no assumptions beyond Bellman’s equation about w 0 , g 0 , or the stationary policy k 0 that maximizes r k + [P k ]w 0 . However, if k 0 corresponds to a unichain, then,
from Lemma 4.1 and its following discussion, w 0 and g 0 are uniquely determined (aside from
an additive factor of αe in w 0 ) as the relative gain vector and gain per stage for k 0 .
If Bellman’s equation has a solution, w 0 , g 0 , then, for every decision k , we have
w 0 + g 0 e ≥ r k + [P k ]w 0 with equality for some k 0 . (4.67) The Markov chains with transition matrices [P k ] might have multiple recurrent classes, so
we let π k ,R denote the steady-state probability vector for a given recurrent class R of k .
Premultiplying both sides of (4.67) by π k ,R ,
π k ,R w 0 + g 0π k ,R e ≥ π k ,R r k + π k ,R [P k ]w 0 with equality for some k 0 . (4.68) Recognizing that π k ,R e = 1 and π k ,R [P k ] = π k ,R , this simpliﬁes to
g 0 ≥ π k ,R r k with equality for some k 0 . (4.69) This says that if Bellman’s equation has a solution w 0 , g 0 , then the gain per stage g 0 in
that solution is greater than or equal to the gain per stage in each recurrent class of each
stationary policy, and is equal to the gain per stage in each recurrent class of the maximizing
stationary policy, k 0 . Thus, the maximizing stationary policy is either a unichain or consists
of several recurrent classes all with the same gain per stage.
We have been discussing the properties that any solution of Bellman’s equation must have,
but still have no guarantee that any such solution must exist. The following subsection
describes a fairly general algorithm (policy iteration) to ﬁnd a solution of Bellman’s algorithm, and also shows why, in some cases, no solution exists. Before doing this, however,
we look brieﬂy at the overall relations between the states in a Markov decision problem.
For any Markov decision problem, consider a directed graph for which the nodes of the
graph are the states in the Markov decision problem, and, for each pair of states (i, j ),
there is a directed arc from i to j if Pij i > 0 for some decision ki .
Deﬁnition 4.14. A state i in a Markov decision problem is reachable from state j if there
is a path from j to i in the above directed graph.
Note that if i is reachable from j , then there is a stationary policy in which i is accessible
from j (i.e., for each arc (m, l) on the path, a decision km in state m is used for which
Pmlm > 0).
Deﬁnition 4.15. A state i in a Markov decision problem is inherently transient if it is
not reachable from some state j that is reachable from i. A state i is inherently recurrent 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 173 if it is not inherently transient. A class I of states is inherently recurrent if each i ∈ I
is inherently recurrent, each is reachable from each other, and no state j ∈ I is reachable
from any i ∈ I . A Markov decision problem is inherently recurrent if al l states form an
inherently recurrent class.
An inherently recurrent class of states is a class that, once entered, can never be left, but
which has no subclass with that property. An inherently transient state is transient in at
least one stationary policy, but might be recurrent in other policies (but all the states in any
such recurrent class must be inherently transient). In the following subsection, we analyze
inherently recurrent Markov decision problems. Multiple inherently recurrent classes can
be analyzed one by one using the same approach, and we later give a short discussion of
inherently transient states. 4.6.4 Policy iteration and the solution of Bellman’s equation The general idea of policy iteration is to start with an arbitrary unichain stationary policy
k 0 and to ﬁnd its gain per stage g 0 and its relative gain vector w 0 . We then check whether
Bellman’s equation, (4.60), is satisﬁed, and if not, we ﬁnd another stationary policy k
that is ‘better’ than k 0 in a sense to be described later. Unfortunately, the ‘better’ policy
that we ﬁnd might not be a unichain, so the following lemma shows that any such policy
can be converted into an equally ‘good’ unichain policy. The algorithm then iteratively
ﬁnds better and better un...
View Full Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
- Spring '09