This preview shows page 1. Sign up to view the full content.
Unformatted text preview: um (or, if one
wishes, all costs can be replaced with negative rewards.
♥
2
✚ ❍2
4
✗
✚✄
0
❂
✚
❥♥
❍3
♥
✿
✘1
✯②
✟
⑥ ✎✙
✄✟
♥ 4
4 Figure 4.11: A shortest path problem. The arcs are marked with their lengths. Any
unmarked link has length 1 We start the dynamic programming algorithm with a ﬁnal cost vector that is 0 for node 1
and inﬁnite for all other nodes. In stage 1, we choose the arc from node 2 to 1 and that 170 CHAPTER 4. FINITESTATE MARKOV CHAINS from 4 to 1; the choice at node 3 is immaterial. The stage 1 costs are then
v1 (1, u ) = 0, v2 (1, u ) = 4, v3 (1, u ) = 1, In stage 2, the cost v3 (2, u ), for example, is
h
v3 (2, u ) = min 2 + v2 (1, u ),
The set of costs at stage 2 are
v1 (2, u ) = 0, v2 (2, u ) = 2, v4 (1, u ) = 1. i
4 + v4 (1, u ) = 5. v3 (2, u ) = 5, v4 (2, u ) = 1. and the policy is for node 2 to go to 4, node 3 to 4, and 4 to 1. At stage, node 3 switches to
node 2, reducing its path length to 4, and nodes 2 and 4 are unchanged. Further iterations
yield no change, and the resulting policy is also the optimal stationary policy.
It can be seen without too much diﬃculty, for the example of Figure 4.11, that these ﬁnal
aggregate costs and shortest paths also result no matter what ﬁnal cost vector u (with
u1 = 0) is used. We shall see later that this always happens so long as all the cycles in the
directed graph (other than the self loop from node 1 to node 1) have positive cost. 4.6.3 Optimal stationary policies In Example 4.6.1, we saw that there was a ﬁnal transient (for stage 1) in which decision 1
was taken, and in all other stages, decision 2 was taken. Thus, the optimal dynamic policy
used a stationary policy (using decision 2) except for a ﬁnal transient. It seems reasonable to
expect this same type of behavior for typical but more complex Markov decision problems.
We can get a clue about how to demonstrate this by ﬁrst looking at a situation in which
the aggregate expected gain of a stationary policy is equal to that of the optimal dynamic
0
0
policy. Denote some given stationary policy by the vector k 0 = (k1 , . . . , kM ) of decisions in
0
each state. Assume that the Markov chain with transition matrix [P k ] is a unichain, i.e.,
recurrent with perhaps additional transient states. The expected aggregate reward for this
stationary policy is then given by (4.46), using the Markov chain with transition matrix
0
0
[P k ] and reward vector r k . Let w 0 be the relative gain vector for the stationary policy k 0 .
Recall from (4.44) that if w 0 is used as the ﬁnal reward vector, then the expected aggregate
gain simpliﬁes to
0 v k (n, w 0 ) − ng 0 e = w 0 , (4.57) P k0 k0
0
where g 0 = i πi ri is the steadystate gain, π k is the steadystate probability vector, and
the relative gain vector w 0 satisﬁes
0 0 w 0 + g 0 e = r k + [P k ]w 0 ; 0 π k w 0 = 0. (4.58) The fact that the right hand side of (4.57) is independent of the stage, n, leads us to
hypothesize that if the stationary policy k 0 is the same as the dynamic policy except for
a ﬁnal transient, then that ﬁnal transient might disappear if we use w 0 as a ﬁnal reward 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 171 vector. To pursue this hypothesis, assume a ﬁnal reward equal to w 0 . Then, if k 0 maximizes
r k + [P k ]w 0 over k , we have
0 0 v ∗ (1, w 0 ) = r k + [P k ]w 0 = max{r k + [P k ]w 0 }.
k (4.59) Substituting (4.58) into (4.59), we see that the vector decision k 0 is optimal at stage 1 if
0 0 w 0 + g 0 e = r k + [P k ]w 0 = max{r k + [P k ]w 0 }.
k (4.60) If (4.60) is also satisﬁed, then the optimal gain is given by
v ∗ (1, w 0 ) = w 0 + g 0 e . (4.61) The following theorem now shows that if (4.60) is satisﬁed, then, not only is the decision
k 0 that maximizes r k + [P k ]w 0 an optimal dynamic policy for stage 1 but is also optimal
at all stages (i.e., the stationary policy k 0 is also an optimal dynamic policy).
Theorem 4.10. Assume that (4.60) is satisﬁed for some w0 , g 0 , and k0 . Then, if the ﬁnal
reward vector is equal to w0 , the stationary policy k0 is an optimal dynamic policy and the
optimal expected aggregate gain satisﬁes
v∗ (n, w0 ) = w0 + ng 0 e. (4.62) Proof: Since k 0 maximizes r k + [P k ]w 0 , it is an optimal decision at stage 1 for the ﬁnal
0
0
vector w 0 . From (4.60), w 0 + g 0 e = r k + [P k ]w 0 , so v ∗ (1, w 0 ) = w 0 + g 0 e . Thus (4.62) is
satisﬁed for n = 1, and we use induction on n, with n = 1 as a basis, to verify (4.62) in
general. Thus, assume that (4.62) is satisﬁed for n. Then, from (4.55),
v ∗ (n + 1, w 0 ) = max{r k + [P k ]v ∗ (n, w 0 )}
k
n
o
k
= max r k + [P k ]{w 0 + ng 0 e }
k (4.63)
(4.64) = ng 0 e + max{r k + [P k ]w 0 } (4.65) = (n + 1)g 0 e + w 0 .. (4.66) k Eqn (4.64) follows from the inductive hypothesis of (4.62), (4.65) follows because [P k ]e = e
for all k , and (4.66) follows from (4.60). This veriﬁes (4.62) for n + 1. Also, since k 0
maximizes (4.65), it also maximizes (4.63), showing that k 0 is the optim...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details