Discrete-time stochastic processes

There are also many interesting markov decision

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: r k in (4.55) is really M separate and independent maximizations, one for each state, i.e., (4.55) is simply a vector form of (4.54). Another frequently useful way to rewrite (4.54) or (4.55) is as follows: 0 0 v ∗ (n, u ) = r k + [P k ]v ∗ (n−1) for k 0 such that 0 0 r k + [P k ]v ∗ (n−1) = max r k + [P k ]v ∗ (n−1). (4.56) k If k 0 satisfies (4.56), it is called an optimal decision for stage n. Note that (4.54), (4.55), and (4.56) are valid with no restrictions (such as recurrent or aperiodic states) on the possible transition probabilities [P k ]. The dynamic programming algorithm is just the calculation of (4.54), (4.55), or (4.56), performed successively for n = 1, 2, 3, . . . . The development of this algorithm, as a systematic tool for solving this class of problems, is due to Bellman [Bel57]. This algorithm yields the optimal dynamic policy for any given final reward vector, u . Along with the calculation of v ∗ (n, u ) for each n, the algorithm also yields the optimal decision at each stage. The ∗ surprising simplicity of the algorithm is due to the Markov property. That is, vi (n, u ) is the aggregate present and future reward conditional on the present state. Since it is conditioned on the present state, it is independent of the past (i.e., how the process arrived at state i from previous transitions and choices). Although dynamic programming is computationally straightforward and convenient7 , the asymptotic behavior of v ∗ (n, u ) as n → 1 is not evident from the algorithm. After working out some simple examples, we look at the general question of asymptotic behavior. Example 4.6.1. Consider Fig. 4.10, repeated below, with the final rewards u2 = u1 = 0. ✿♥ ✘1 ② 0.99 r1 =0 0.01 0.01 ③ ♥ 2 ② ✿♥ ✘1 ② 0.99 0.99 (1) r2 =1 r1 =0 0.01 1 ③ ♥ 2 (2) r2 =50 Since there is no reward in stage 0, uj = 0. Also r1 = 0, so, from (4.52), the aggregate gain in state 1 at stage 1 is X ∗ v1 (1, u ) = r1 + Pij uj = 0. j (1) Similarly, since policy 1 has an immediate reward r2 = 1 in state 2, and policy 2 has an (2) immediate reward r2 = 50, Ωh X (1) i h (2) X (2) iæ (1) ∗ v2 (1, u ) = max r2 + Pij uj , r2 + Pij uj = max{1, 50} = 50. j 7 j Unfortunately, many dynamic programming problems of interest have enormous numbers of states and possible choices of decision (the so called curse of dimensionality), and thus, even though the equations are simple, the computational requirements might be beyond the range of practical feasibility. 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 169 ∗ We can now go on to stage 2, using the results above for vj (1, u ). From (4.53), ∗ ∗ ∗ ∗ v1 (2) = r1 + P11 v1 (1, u ) + P12 v2 (1, u ) = P12 v2 (1, u ) = 0.5 Ωh ih iæ X (1) (1) (2) (2) ∗ ∗ ∗ v2 (2) = max r2 + P2j vj (1, u ) , r2 + P21 v1 (1, u ) j n o (1) ∗ = max [1 + P22 v2 (1, u )], 50 = max{50.5, 50} = 50.5. Thus, we have seen that, in state 2, decision 1 is preferable at stage 2, while decision 2 is preferable at stage 1. What is happening is that the choice of decision 2 at stage 1 has made it very profitable to be in state 2 at stage 1. Thus if the chain is in state 2 at stage 2, it is preferable to choose decision 1 (i.e., the small unit gain) at stage 2 with the corresponding high probability of remaining in state 2 at stage 1. Continuing this computation for larger ∗ ∗ n, one finds that v1 (n, u ) = n/2 and v2 (n, u ) = 50 + n/2. The optimum dynamic policy is decision 2 for stage 1 and decision 1 for all stages n > 1. This example also illustrates that the maximization of expected gain is not necessarily what is most desirable in all applications. For example, people who want to avoid risk might well prefer decision 2 at stage 2. This guarantees a reward of 50, rather than taking a small chance of losing that reward. Example 4.6.2 (Shortest Path Problems). The problem of finding the shortest paths between nodes in a directed graph arises in many situations, from routing in communication networks to calculating the time to complete complex tasks. The problem is quite similar to the expected first passage time of example 4.5.1. In that problem, arcs in a directed graph were selected according to a probability distribution, whereas here, we must make a decision about which arc to take. Although there are no probabilities here, the problem can be posed as dynamic programming. We suppose that we want to nd the shortest path from each node in a directed graph to some particular node, say node 1 (see Figure 4.11). The link lengths are arbitrary numbers that might reflect physical distance, or might reflect an arbitrary type of cost. The length of a path is the sum of the lengths of the arcs on that path. In terms of dynamic programming, a policy is a choice of arc out of each node. Here we want to minimize cost (i.e., path length) rather than maximizing reward, so we simply replace the maximum in the dynamic programming algorithm with a minim...
View Full Document

Ask a homework question - tutors are online