Discrete-time stochastic processes

Discrete-time stochastic processes

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: long term gratification (alternative 1). 0.01 0.99 0.01 0.99 0.99 ③ ♥ ③ ♥ 2 2 ✿♥ ✘1 ② ✿♥ ✘1 ② ② (2) 0.01 r(1) =1 1 r1 =0 r1 =0 r2 =50 2 Decision 1 Decision 2 Figure 4.10: A Markov decision problem with two alternatives in state 2. It is also possible to consider the situation in which the rewards for each decision are (k) associated with transitions; that is, for decision k in state i, the reward rij is associated with a transition from i to j . This means that the expected reward for a transition from P (k) (k) (k) i with decision k is given by ri = j Pij rij . Thus, as in the previous section, there is no essential loss in generality in restricting attention to the case in which rewards are associated with the states. The set of rules used by the decision maker in selecting different alternatives at each stage of the chain is called a policy. We want to consider the expected aggregate reward over n trials of the “Markov chain,” as a function of the policy used by the decision maker. If the policy uses the same decision, say ki , at each occurrence of state i, for each i, then that (k ) policy corresponds to a homogeneous Markov chain with transition probabilities Pij i . We denote the matrix of these transition probabilities as [P k ], where k = (k1 , . . . , kM ). Such a policy, i.e., making the decision for each state i independent of time, is called a stationary policy. The aggregate reward for any such stationary policy was found in the previous section. Since both rewards and transition probabilities depend only on the state and the corresponding decision, and not on time, one feels intuitively that stationary policies make a certain amount of sense over a long period of time. On the other hand, assuming some final reward ui for being in state i at the end of the nth trial, one might expect the best policy to depend on time, at least close to the end of the n trials. In what follows, we first derive the optimal policy for maximizing expected aggregate reward over an arbitrary number n of trials. We shall see that the decision at time m, 0 ≤ m < n, for 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 167 the optimal policy does in fact depend both on m and on the final rewards {ui ; 1 ≤ i ≤ M}. We call this optimal policy the optimal dynamic policy. This policy is found from the dynamic programming algorithm, which, as we shall see, is conceptually very simple. We then go on to find the relationship between the optimal dynamic policy and the optimal stationary policy and show that each has the same long term gain per trial. 4.6.2 Dynamic programming algorithm As in our development of Markov chains with rewards, we consider expected aggregate reward over n time periods and we use stages, counting backwards from the final trial. First consider the optimum decision with just one trial (i.e., with just one stage). We start (k) in a given state i at stage 1, make a decision k, obtain the reward ri , then go to some state (k) j with probability Pij and obtain the final reward uj . This expected aggregate reward is maximized over the choice of k, i.e., X (k) (k) ∗ vi (1, u ) = max{ri + Pij uj }. (4.52) k j ∗ We use the notation vi (n, u ) to represent the maximum expected aggregate reward for ∗ n stages starting in state i. Note that vi (1, u ) depends on the final reward vector u = T (u1 , u2 , . . . , uM ) . Next consider the maximum expected aggregate reward starting in state i at stage 2. For each state j , 1 ≤ j ≤ M, let vj (1, u ) be the expected aggregate reward, over stages 1 and 0, for some arbitrary policy, conditional on the chain being in state j at stage 1. Then if decision k is made in state i at stage 2, the expected aggregate reward for P (k) (k) stage 2 is ri + j Pij vj (1, u ). Note that no matter what policy is chosen at stage 2, this expression is maximized at stage 1 by choosing the stage 1 policy that maximizes vj (1, u ). ∗ Thus, independent of what we choose at stage 2 (or at earlier times), we must use vj (1, u ) for the aggregate gain from stage 1 onward in order to maximize the overall aggregate gain ∗ from stage 2. Thus, at stage 2, we achieve maximum expected aggregate gain, vi (2, u ), by choosing the k that achieves the following maximum: X (k) (k) ∗ ∗ vi (2, u ) = max {ri + Pij vj (1, u )}. (4.53) k j Repeating this argument for successively larger n, we obtain the general expression X (k) (k) ∗ ∗ vi (n, u ) = max{ri + Pij vj (n − 1, u )}. (4.54) k j Note that this is almost the same as (4.33), differing only by the maximization over k. We can also write this in vector form, for n ≥ 1, as v ∗ (n, u ) = max{r k + [P k ]v ∗ (n − 1, u )}, k (4.55) where for n = 1, we take v ∗ (0, u ) = u . Here k is a set (or vector) of decisions, k = (k1 , k2 , . . . , kM ) where ki is the decision to be used in state i. [P k ] denotes a matrix whose 168 CHAPTER 4. FINITE-STATE MARKOV CHAINS (k ) (k ) (i, j ) element is Pij i , and r k denotes a vector whose ith element is ri i . The maximization ove...
View Full Document

This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online