Discrete-time stochastic processes

# 88 and 489 become i n 1 x j pij i j n k0 490 i

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: um (or, if one wishes, all costs can be replaced with negative rewards. ♥ 2 ✚ ❍2 4 ✗ ✚✄ 0 ❂ ✚ ❥♥ ❍3 ♥ ✿ ✘1 ✯② ✟ ⑥ ✎✙ ✄✟ ♥ 4 4 Figure 4.11: A shortest path problem. The arcs are marked with their lengths. Any unmarked link has length 1 We start the dynamic programming algorithm with a ﬁnal cost vector that is 0 for node 1 and inﬁnite for all other nodes. In stage 1, we choose the arc from node 2 to 1 and that 170 CHAPTER 4. FINITE-STATE MARKOV CHAINS from 4 to 1; the choice at node 3 is immaterial. The stage 1 costs are then v1 (1, u ) = 0, v2 (1, u ) = 4, v3 (1, u ) = 1, In stage 2, the cost v3 (2, u ), for example, is h v3 (2, u ) = min 2 + v2 (1, u ), The set of costs at stage 2 are v1 (2, u ) = 0, v2 (2, u ) = 2, v4 (1, u ) = 1. i 4 + v4 (1, u ) = 5. v3 (2, u ) = 5, v4 (2, u ) = 1. and the policy is for node 2 to go to 4, node 3 to 4, and 4 to 1. At stage, node 3 switches to node 2, reducing its path length to 4, and nodes 2 and 4 are unchanged. Further iterations yield no change, and the resulting policy is also the optimal stationary policy. It can be seen without too much diﬃculty, for the example of Figure 4.11, that these ﬁnal aggregate costs and shortest paths also result no matter what ﬁnal cost vector u (with u1 = 0) is used. We shall see later that this always happens so long as all the cycles in the directed graph (other than the self loop from node 1 to node 1) have positive cost. 4.6.3 Optimal stationary policies In Example 4.6.1, we saw that there was a ﬁnal transient (for stage 1) in which decision 1 was taken, and in all other stages, decision 2 was taken. Thus, the optimal dynamic policy used a stationary policy (using decision 2) except for a ﬁnal transient. It seems reasonable to expect this same type of behavior for typical but more complex Markov decision problems. We can get a clue about how to demonstrate this by ﬁrst looking at a situation in which the aggregate expected gain of a stationary policy is equal to that of the optimal dynamic 0 0 policy. Denote some given stationary policy by the vector k 0 = (k1 , . . . , kM ) of decisions in 0 each state. Assume that the Markov chain with transition matrix [P k ] is a unichain, i.e., recurrent with perhaps additional transient states. The expected aggregate reward for this stationary policy is then given by (4.46), using the Markov chain with transition matrix 0 0 [P k ] and reward vector r k . Let w 0 be the relative gain vector for the stationary policy k 0 . Recall from (4.44) that if w 0 is used as the ﬁnal reward vector, then the expected aggregate gain simpliﬁes to 0 v k (n, w 0 ) − ng 0 e = w 0 , (4.57) P k0 k0 0 where g 0 = i πi ri is the steady-state gain, π k is the steady-state probability vector, and the relative gain vector w 0 satisﬁes 0 0 w 0 + g 0 e = r k + [P k ]w 0 ; 0 π k w 0 = 0. (4.58) The fact that the right hand side of (4.57) is independent of the stage, n, leads us to hypothesize that if the stationary policy k 0 is the same as the dynamic policy except for a ﬁnal transient, then that ﬁnal transient might disappear if we use w 0 as a ﬁnal reward 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 171 vector. To pursue this hypothesis, assume a ﬁnal reward equal to w 0 . Then, if k 0 maximizes r k + [P k ]w 0 over k , we have 0 0 v ∗ (1, w 0 ) = r k + [P k ]w 0 = max{r k + [P k ]w 0 }. k (4.59) Substituting (4.58) into (4.59), we see that the vector decision k 0 is optimal at stage 1 if 0 0 w 0 + g 0 e = r k + [P k ]w 0 = max{r k + [P k ]w 0 }. k (4.60) If (4.60) is also satisﬁed, then the optimal gain is given by v ∗ (1, w 0 ) = w 0 + g 0 e . (4.61) The following theorem now shows that if (4.60) is satisﬁed, then, not only is the decision k 0 that maximizes r k + [P k ]w 0 an optimal dynamic policy for stage 1 but is also optimal at all stages (i.e., the stationary policy k 0 is also an optimal dynamic policy). Theorem 4.10. Assume that (4.60) is satisﬁed for some w0 , g 0 , and k0 . Then, if the ﬁnal reward vector is equal to w0 , the stationary policy k0 is an optimal dynamic policy and the optimal expected aggregate gain satisﬁes v∗ (n, w0 ) = w0 + ng 0 e. (4.62) Proof: Since k 0 maximizes r k + [P k ]w 0 , it is an optimal decision at stage 1 for the ﬁnal 0 0 vector w 0 . From (4.60), w 0 + g 0 e = r k + [P k ]w 0 , so v ∗ (1, w 0 ) = w 0 + g 0 e . Thus (4.62) is satisﬁed for n = 1, and we use induction on n, with n = 1 as a basis, to verify (4.62) in general. Thus, assume that (4.62) is satisﬁed for n. Then, from (4.55), v ∗ (n + 1, w 0 ) = max{r k + [P k ]v ∗ (n, w 0 )} k n o k = max r k + [P k ]{w 0 + ng 0 e } k (4.63) (4.64) = ng 0 e + max{r k + [P k ]w 0 } (4.65) = (n + 1)g 0 e + w 0 .. (4.66) k Eqn (4.64) follows from the inductive hypothesis of (4.62), (4.65) follows because [P k ]e = e for all k , and (4.66) follows from (4.60). This veriﬁes (4.62) for n + 1. Also, since k 0 maximizes (4.65), it also maximizes (4.63), showing that k 0 is the optim...
View Full Document

## This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online