This preview shows page 1. Sign up to view the full content.
Unformatted text preview: r k in (4.55) is really M separate and independent maximizations, one for each state,
i.e., (4.55) is simply a vector form of (4.54). Another frequently useful way to rewrite (4.54)
or (4.55) is as follows:
0 0 v ∗ (n, u ) = r k + [P k ]v ∗ (n−1) for k 0 such that
0 0 r k + [P k ]v ∗ (n−1) = max r k + [P k ]v ∗ (n−1). (4.56) k If k 0 satisﬁes (4.56), it is called an optimal decision for stage n. Note that (4.54), (4.55), and
(4.56) are valid with no restrictions (such as recurrent or aperiodic states) on the possible
transition probabilities [P k ].
The dynamic programming algorithm is just the calculation of (4.54), (4.55), or (4.56), performed successively for n = 1, 2, 3, . . . . The development of this algorithm, as a systematic
tool for solving this class of problems, is due to Bellman [Bel57]. This algorithm yields the
optimal dynamic policy for any given ﬁnal reward vector, u . Along with the calculation
of v ∗ (n, u ) for each n, the algorithm also yields the optimal decision at each stage. The
∗
surprising simplicity of the algorithm is due to the Markov property. That is, vi (n, u ) is the
aggregate present and future reward conditional on the present state. Since it is conditioned
on the present state, it is independent of the past (i.e., how the process arrived at state i
from previous transitions and choices).
Although dynamic programming is computationally straightforward and convenient7 , the
asymptotic behavior of v ∗ (n, u ) as n → 1 is not evident from the algorithm. After working
out some simple examples, we look at the general question of asymptotic behavior.
Example 4.6.1. Consider Fig. 4.10, repeated below, with the ﬁnal rewards u2 = u1 = 0.
✿♥
✘1
② 0.99 r1 =0 0.01
0.01 ③
♥
2
② ✿♥
✘1
② 0.99 0.99 (1) r2 =1 r1 =0 0.01
1 ③
♥
2
(2) r2 =50 Since there is no reward in stage 0, uj = 0. Also r1 = 0, so, from (4.52), the aggregate gain
in state 1 at stage 1 is
X
∗
v1 (1, u ) = r1 +
Pij uj = 0.
j (1) Similarly, since policy 1 has an immediate reward r2 = 1 in state 2, and policy 2 has an
(2)
immediate reward r2 = 50,
Ωh
X (1) i h (2) X (2) iæ
(1)
∗
v2 (1, u ) = max r2 +
Pij uj ,
r2 +
Pij uj
= max{1, 50} = 50.
j 7 j Unfortunately, many dynamic programming problems of interest have enormous numbers of states and
possible choices of decision (the so called curse of dimensionality), and thus, even though the equations are
simple, the computational requirements might be beyond the range of practical feasibility. 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 169 ∗
We can now go on to stage 2, using the results above for vj (1, u ). From (4.53),
∗
∗
∗
∗
v1 (2) = r1 + P11 v1 (1, u ) + P12 v2 (1, u ) = P12 v2 (1, u ) = 0.5
Ωh
ih
iæ
X (1)
(1)
(2)
(2) ∗
∗
∗
v2 (2) = max r2 +
P2j vj (1, u ) ,
r2 + P21 v1 (1, u )
j n
o
(1) ∗
= max [1 + P22 v2 (1, u )], 50 = max{50.5, 50} = 50.5. Thus, we have seen that, in state 2, decision 1 is preferable at stage 2, while decision 2 is
preferable at stage 1. What is happening is that the choice of decision 2 at stage 1 has made
it very proﬁtable to be in state 2 at stage 1. Thus if the chain is in state 2 at stage 2, it is
preferable to choose decision 1 (i.e., the small unit gain) at stage 2 with the corresponding
high probability of remaining in state 2 at stage 1. Continuing this computation for larger
∗
∗
n, one ﬁnds that v1 (n, u ) = n/2 and v2 (n, u ) = 50 + n/2. The optimum dynamic policy is
decision 2 for stage 1 and decision 1 for all stages n > 1.
This example also illustrates that the maximization of expected gain is not necessarily what
is most desirable in all applications. For example, people who want to avoid risk might well
prefer decision 2 at stage 2. This guarantees a reward of 50, rather than taking a small
chance of losing that reward.
Example 4.6.2 (Shortest Path Problems). The problem of ﬁnding the shortest paths
between nodes in a directed graph arises in many situations, from routing in communication
networks to calculating the time to complete complex tasks. The problem is quite similar
to the expected ﬁrst passage time of example 4.5.1. In that problem, arcs in a directed
graph were selected according to a probability distribution, whereas here, we must make
a decision about which arc to take. Although there are no probabilities here, the problem
can be posed as dynamic programming. We suppose that we want to nd the shortest path
from each node in a directed graph to some particular node, say node 1 (see Figure 4.11).
The link lengths are arbitrary numbers that might reﬂect physical distance, or might reﬂect
an arbitrary type of cost. The length of a path is the sum of the lengths of the arcs on that
path. In terms of dynamic programming, a policy is a choice of arc out of each node. Here
we want to minimize cost (i.e., path length) rather than maximizing reward, so we simply
replace the maximum in the dynamic programming algorithm with a minim...
View
Full
Document
 Spring '09
 R.Srikant

Click to edit the document details