Discrete-time stochastic processes

88 and 489 become i n 1 x j pij i j n k0 490 i

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: um (or, if one wishes, all costs can be replaced with negative rewards. ♥ 2 ✚ ❍2 4 ✗ ✚✄ 0 ❂ ✚ ❥♥ ❍3 ♥ ✿ ✘1 ✯② ✟ ⑥ ✎✙ ✄✟ ♥ 4 4 Figure 4.11: A shortest path problem. The arcs are marked with their lengths. Any unmarked link has length 1 We start the dynamic programming algorithm with a final cost vector that is 0 for node 1 and infinite for all other nodes. In stage 1, we choose the arc from node 2 to 1 and that 170 CHAPTER 4. FINITE-STATE MARKOV CHAINS from 4 to 1; the choice at node 3 is immaterial. The stage 1 costs are then v1 (1, u ) = 0, v2 (1, u ) = 4, v3 (1, u ) = 1, In stage 2, the cost v3 (2, u ), for example, is h v3 (2, u ) = min 2 + v2 (1, u ), The set of costs at stage 2 are v1 (2, u ) = 0, v2 (2, u ) = 2, v4 (1, u ) = 1. i 4 + v4 (1, u ) = 5. v3 (2, u ) = 5, v4 (2, u ) = 1. and the policy is for node 2 to go to 4, node 3 to 4, and 4 to 1. At stage, node 3 switches to node 2, reducing its path length to 4, and nodes 2 and 4 are unchanged. Further iterations yield no change, and the resulting policy is also the optimal stationary policy. It can be seen without too much difficulty, for the example of Figure 4.11, that these final aggregate costs and shortest paths also result no matter what final cost vector u (with u1 = 0) is used. We shall see later that this always happens so long as all the cycles in the directed graph (other than the self loop from node 1 to node 1) have positive cost. 4.6.3 Optimal stationary policies In Example 4.6.1, we saw that there was a final transient (for stage 1) in which decision 1 was taken, and in all other stages, decision 2 was taken. Thus, the optimal dynamic policy used a stationary policy (using decision 2) except for a final transient. It seems reasonable to expect this same type of behavior for typical but more complex Markov decision problems. We can get a clue about how to demonstrate this by first looking at a situation in which the aggregate expected gain of a stationary policy is equal to that of the optimal dynamic 0 0 policy. Denote some given stationary policy by the vector k 0 = (k1 , . . . , kM ) of decisions in 0 each state. Assume that the Markov chain with transition matrix [P k ] is a unichain, i.e., recurrent with perhaps additional transient states. The expected aggregate reward for this stationary policy is then given by (4.46), using the Markov chain with transition matrix 0 0 [P k ] and reward vector r k . Let w 0 be the relative gain vector for the stationary policy k 0 . Recall from (4.44) that if w 0 is used as the final reward vector, then the expected aggregate gain simplifies to 0 v k (n, w 0 ) − ng 0 e = w 0 , (4.57) P k0 k0 0 where g 0 = i πi ri is the steady-state gain, π k is the steady-state probability vector, and the relative gain vector w 0 satisfies 0 0 w 0 + g 0 e = r k + [P k ]w 0 ; 0 π k w 0 = 0. (4.58) The fact that the right hand side of (4.57) is independent of the stage, n, leads us to hypothesize that if the stationary policy k 0 is the same as the dynamic policy except for a final transient, then that final transient might disappear if we use w 0 as a final reward 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 171 vector. To pursue this hypothesis, assume a final reward equal to w 0 . Then, if k 0 maximizes r k + [P k ]w 0 over k , we have 0 0 v ∗ (1, w 0 ) = r k + [P k ]w 0 = max{r k + [P k ]w 0 }. k (4.59) Substituting (4.58) into (4.59), we see that the vector decision k 0 is optimal at stage 1 if 0 0 w 0 + g 0 e = r k + [P k ]w 0 = max{r k + [P k ]w 0 }. k (4.60) If (4.60) is also satisfied, then the optimal gain is given by v ∗ (1, w 0 ) = w 0 + g 0 e . (4.61) The following theorem now shows that if (4.60) is satisfied, then, not only is the decision k 0 that maximizes r k + [P k ]w 0 an optimal dynamic policy for stage 1 but is also optimal at all stages (i.e., the stationary policy k 0 is also an optimal dynamic policy). Theorem 4.10. Assume that (4.60) is satisfied for some w0 , g 0 , and k0 . Then, if the final reward vector is equal to w0 , the stationary policy k0 is an optimal dynamic policy and the optimal expected aggregate gain satisfies v∗ (n, w0 ) = w0 + ng 0 e. (4.62) Proof: Since k 0 maximizes r k + [P k ]w 0 , it is an optimal decision at stage 1 for the final 0 0 vector w 0 . From (4.60), w 0 + g 0 e = r k + [P k ]w 0 , so v ∗ (1, w 0 ) = w 0 + g 0 e . Thus (4.62) is satisfied for n = 1, and we use induction on n, with n = 1 as a basis, to verify (4.62) in general. Thus, assume that (4.62) is satisfied for n. Then, from (4.55), v ∗ (n + 1, w 0 ) = max{r k + [P k ]v ∗ (n, w 0 )} k n o k = max r k + [P k ]{w 0 + ng 0 e } k (4.63) (4.64) = ng 0 e + max{r k + [P k ]w 0 } (4.65) = (n + 1)g 0 e + w 0 .. (4.66) k Eqn (4.64) follows from the inductive hypothesis of (4.62), (4.65) follows because [P k ]e = e for all k , and (4.66) follows from (4.60). This verifies (4.62) for n + 1. Also, since k 0 maximizes (4.65), it also maximizes (4.63), showing that k 0 is the optim...
View Full Document

This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online