Discrete-time stochastic processes

# Proof of 4110 note that j n m n for all j m and

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: al decision at stage n + 1. This completes the inductive step and thus the proof. Since our ma jor interest in stationary policies is to help understand the relationship between the optimal dynamic policy and stationary policies, we deﬁne an optimal stationary policy as follows: Deﬁnition 4.13. A stationary policy k0 is optimal if there is some ﬁnal reward vector w0 for which k0 is the optimal dynamic policy. 172 CHAPTER 4. FINITE-STATE MARKOV CHAINS From Theorem 4.10, we see that if there is a solution to (4.60), then the stationary policy k 0 that maximizes r k + [P k ]w 0 is an optimal stationary policy. Eqn. (4.60) is known as Bel lman’s equation, and we now explore the situations in which it has a solution (since these solutions give rise to optimal stationary policies). Theorem 4.10 made no assumptions beyond Bellman’s equation about w 0 , g 0 , or the stationary policy k 0 that maximizes r k + [P k ]w 0 . However, if k 0 corresponds to a unichain, then, from Lemma 4.1 and its following discussion, w 0 and g 0 are uniquely determined (aside from an additive factor of αe in w 0 ) as the relative gain vector and gain per stage for k 0 . If Bellman’s equation has a solution, w 0 , g 0 , then, for every decision k , we have w 0 + g 0 e ≥ r k + [P k ]w 0 with equality for some k 0 . (4.67) The Markov chains with transition matrices [P k ] might have multiple recurrent classes, so we let π k ,R denote the steady-state probability vector for a given recurrent class R of k . Premultiplying both sides of (4.67) by π k ,R , π k ,R w 0 + g 0π k ,R e ≥ π k ,R r k + π k ,R [P k ]w 0 with equality for some k 0 . (4.68) Recognizing that π k ,R e = 1 and π k ,R [P k ] = π k ,R , this simpliﬁes to g 0 ≥ π k ,R r k with equality for some k 0 . (4.69) This says that if Bellman’s equation has a solution w 0 , g 0 , then the gain per stage g 0 in that solution is greater than or equal to the gain per stage in each recurrent class of each stationary policy, and is equal to the gain per stage in each recurrent class of the maximizing stationary policy, k 0 . Thus, the maximizing stationary policy is either a unichain or consists of several recurrent classes all with the same gain per stage. We have been discussing the properties that any solution of Bellman’s equation must have, but still have no guarantee that any such solution must exist. The following subsection describes a fairly general algorithm (policy iteration) to ﬁnd a solution of Bellman’s algorithm, and also shows why, in some cases, no solution exists. Before doing this, however, we look brieﬂy at the overall relations between the states in a Markov decision problem. For any Markov decision problem, consider a directed graph for which the nodes of the graph are the states in the Markov decision problem, and, for each pair of states (i, j ), (k ) there is a directed arc from i to j if Pij i > 0 for some decision ki . Deﬁnition 4.14. A state i in a Markov decision problem is reachable from state j if there is a path from j to i in the above directed graph. Note that if i is reachable from j , then there is a stationary policy in which i is accessible from j (i.e., for each arc (m, l) on the path, a decision km in state m is used for which (k ) Pmlm > 0). Deﬁnition 4.15. A state i in a Markov decision problem is inherently transient if it is not reachable from some state j that is reachable from i. A state i is inherently recurrent 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 173 if it is not inherently transient. A class I of states is inherently recurrent if each i ∈ I is inherently recurrent, each is reachable from each other, and no state j ∈ I is reachable / from any i ∈ I . A Markov decision problem is inherently recurrent if al l states form an inherently recurrent class. An inherently recurrent class of states is a class that, once entered, can never be left, but which has no subclass with that property. An inherently transient state is transient in at least one stationary policy, but might be recurrent in other policies (but all the states in any such recurrent class must be inherently transient). In the following subsection, we analyze inherently recurrent Markov decision problems. Multiple inherently recurrent classes can be analyzed one by one using the same approach, and we later give a short discussion of inherently transient states. 4.6.4 Policy iteration and the solution of Bellman’s equation The general idea of policy iteration is to start with an arbitrary unichain stationary policy k 0 and to ﬁnd its gain per stage g 0 and its relative gain vector w 0 . We then check whether Bellman’s equation, (4.60), is satisﬁed, and if not, we ﬁnd another stationary policy k that is ‘better’ than k 0 in a sense to be described later. Unfortunately, the ‘better’ policy that we ﬁnd might not be a unichain, so the following lemma shows that any such policy can be converted into an equally ‘good’ unichain policy. The algorithm then iteratively ﬁnds better and better un...
View Full Document

## This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online