Discrete-time stochastic processes

This policy is found from the dynamic programming

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: m our discussion of Little’s theorem in Section 3.6, we know that this sum of times is equal to the sum of the number of customers in the system, summed over each integer time from the initial time with i customers to the final time when the system becomes empty. As in the previous example, we modify the Markov chain to make state 0 a trapping state. We take ri = i as the “reward” in state i, and vi as the expected aggregate reward until the trapping state is entered. Using the same reasoning as in the previous example, vi is equal to the immediate “reward” ri = i plus the expected P reward from whatever state is entered next. Thus vi = ri + j ≥1 Pij vj . With v0 = 0, this is v = r + [P ]v . This has a unique solution for v as will be shown later in Lemma 4.1. This same analysis is valid for any choice of reward ri for each transient state i; the reward in the trapping state must be 0 so as to keep the expected aggregate reward finite. 4.5. MARKOV CHAINS WITH REWARDS 159 In the above examples, the Markov chain has a trapping state with zero gain, so the expected gain is essentially a transient phenomena until entering the trapping state. We now look at the more general case of a unichain, i.e., a chain with a single recurrent class, possibly along with some transient states. In this more general case, there can be some average gain per unit time, along with some transient gain depending on the initial state. We first look at the aggregate gain over a finite number of time units, thus providing a clean way of going to the limit. Example 4.5.3. The example in Figure 4.6 provides some intuitive appreciation for the general problem. Note that the chain tends to persist in whatever state it is in for a relatively long time. Thus if the chain starts in state 2, not only is an immediate reward of 1 achieved, but there is a high probability of an additional gain of 1 on many successive transitions. Thus the aggregate value of starting in state 2 is considerably more than the immediate reward of 1. On the other hand, we see from symmetry that the expected gain per unit time, over a long time period, must be one half. ✿♥ ✘1 ② 0.99 r1 =0 0.01 0.01 ③ ♥ 2 ② 0.99 r2 =1 Figure 4.6: Markov chain with rewards. Returning to the general case, it is convenient to work backward from a final time rather than forward from the initial time. This will be quite helpful later when we consider dynamic programming and Markov decision theory. For any final time m, define stage n as n time units before the final time, i.e., as time m − n in Figure 4.7. Equivalently, we often view the final time as time 0, and then stage n corresponds to time −n. m−n n −n n ··· ··· ··· ··· m−3 3 −n+1 −n+2 −n+3 n−1 n−2 n−3 ··· ··· m−2 2 m−1 1 m 0 Time Stage −2 2 −1 1 0 0 Time Stage Figure 4.7: Alternate views of Stages. As a final generalization of the problem (which will be helpful in the solution), we allow the reward at the final time (i.e., in stage 0) to be different from that at other times. The final reward in state i is denoted ui , and u = (u1 , . . . , uM )T . We denote the expected aggregate reward from stage n up to and including the final stage (stage zero), given state i at stage n, as vi (n, u ). Note that the notation here is taking advantage of the Markov property. That is, given that the chain is in state i at time −n (i.e., stage n), the expected aggregate reward up to and including time 0 is independent of the states before time −n and is independent of when the Markov chain started prior to time −n. 160 CHAPTER 4. FINITE-STATE MARKOV CHAINS The expected aggregate reward can be found by starting at stage 1. Given that the chain is in state i at time −1, the immediate reward is ri . The chain then makes a transition (with probability Pij ) to some state j at time 0 with a final reward of uj . Thus X vi (1, u ) = ri + Pij uj . (4.31) j For the example of Figure 4.6 (assuming the final reward is the same as that at the other stages, i.e., ui = ri for i = 1, 2), we have v1 (1, u ) = 0.01 and v2 (1, u ) = 1.99. The expected aggregate reward for stage 2 can be calculated in the same way. Given state i at time −2 (i.e., stage 2), there is an immediate reward of ri and, with probability Pij , the chain goes to state j at time −1 (i.e., stage 1) with an expected additional gain of vj (1, u ). Thus X vi (2, u ) = ri + Pij vj (1, u ). (4.32) j Note that vj (1, u ), as calculated in (4.31), includes the gain in stages 1 and 0, and does not depend on how state j was entered. Iterating the above argument to stage 3, 4, . . . , n, X vi (n, u ) = ri + Pij vj (n−1, u ). (4.33) j This can be written in vector form as v (n, u ) = r + [P ]v (n−1, u ); n ≥ 1, (4.34) where r is a column vector with components r1 , r2 , . . . , rM and v (n, u ) is a column vector with components v1 (n, u ), . . . , vM (n, u ). By substituting (4.34), with n replaced by n − 1, into the last term of (4.34), v (n, u ) = r + [P ]r + [P ]2 v (n−2, u ); n ≥ 2. (4.35) Applying the same substitution recursiv...
View Full Document

Ask a homework question - tutors are online