Discrete-time stochastic processes

7 46 markov decision theory and dynamic programming we

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ely, we eventually get an explicit expression for v (n, u ), v (n, u ) = r + [P ]r + [P ]2 r + · · · + [P ]n−1 r + [P ]n u . (4.36) Eq. (4.34), applied iteratively, is more convenient for calculating v (n, u ) than (4.36), but neither give us much insight into the behavior of the expected aggregate reward, especially for large n. We can get a little insight by averaging the components of (4.36) over the steady-state probability vector π . Since π [P ]m = π for all m and π r is, by definition, the steady state gain per stage g , this gives us π v (n, u ) = ng + π u . (4.37) This result is not surprising, since when the chain starts in steady-state at stage n, it remains in steady-state, yielding a gain per stage of g until the final reward at stage 0. For the example of Figure 4.6 (again assuming u = r ), Figure 4.8 tabulates this steady 4.5. MARKOV CHAINS WITH REWARDS 161 state expected aggregate gain and compares it with the expected aggregate gain vi (n, u ) for initial states 1 and 2. Note that v1 (n, u ) is always less than the steady-state average by an amount approaching 25 with increasing n. Similarly, v2 (n, u ) is greater than the average by the corresponding amount. In other words, for this example, vi (n, u ) − π v (n, u ), for each state i, approaches a limit as n → 1. This limit is called the asymptotic relative gain for starting in state i, relative to starting in steady state. In what follows, we shall see that this type of asymptotic behavior is quite general. π v (n, r ) 1 1.5 2.5 5.5 20.5 50.5 200.5 500.5 n 1 2 4 10 40 100 400 1000 v1 (n, r ) 0.01 0.0298 0.098 0.518 6.420 28.749 175.507 475.500 v2 (n, r ) 1.99 2.9702 4.902 10.482 34.580 72.250 225.492 525.500 Figure 4.8: The expected aggregate reward, as a function of starting state and stage, for the example of figure 4.6. Initially we consider only ergodic Markov chains and first try to understand the asymptotic behavior above at an intuitive level. For large n, the probability of being in state j at time n 0, conditional on starting in i at time −n, is Pij ≈ πj . Thus, the expected final reward at time 0 is approximately π u for each possible starting state at time −n. For (4.36), this says that the final term [P ]n u is approximately (π u )e for large n. Similarly, in (4.36), [P ]n−m r ≈ g e if n − m is large. This means that for very large n, each unit increase or decrease in n simply adds or subtracts g e to the vector gain. Thus, we might conjecture that, for large n, v (n, u ) is the sum of an initial transient term w , an intermediate term ng e , and a final term, (π u )e , i.e, v (n, u ) ≈ w + ng e + (π u )e . (4.38) where we also conjecture that the approximation becomes exact as n → 1. Substituting (4.37) into (4.38), the conjecture (which we shall soon validate) is v (n, u ) ≈ w + (π v (n, u ))e . (4.39) That is, the component wi of w tells us how profitable it is, in the long term, to start in a particular state i rather than start in steady-state. Thus w is called the asymptotic relative gain vector or, for brevity, the relative gain vector. In the example of the table above, w = (−25, +25). There are two reasonable approaches to validate the conjecture above and to evaluate the relative gain vector w . The first is explored in Exercise 4.22 and expands on the intuitive argument leading to (4.38) to show that w is given by w= 1 X ([P ]i − e π )r . n=0 (4.40) 162 CHAPTER 4. FINITE-STATE MARKOV CHAINS This expression is not a very useful way to calculate w , and thus we follow the second approach here, which provides both a convenient expression for w and a proof that the approximation in (4.38) becomes exact in the limit. Rearranging (4.38) and going to the limit, w = lim {v (n, u ) − ng e − (π u )e }. n→1 (4.41) The conjecture, which is still to be proven, is that the limit in (4.41) actually exists. We now show that if this limit exists, w must have a particular form. In particular, substituting (4.34) into (4.41), w = lim {r + [P ]v (n − 1, u ) − ng e − (π u )e } n→1 = r − g e + [P ] lim {v (n − 1, u ) − (n − 1)g e − (π u )e } n→1 = r − g e + [P ]w . Thus, if the limit in (4.41) exists, that limiting vector w must satisfy w + g e = r + [P ]w . The following lemma shows that this equation has a solution. The lemma does not depend on the conjecture in (4.41); we are simply using this conjecture to motivate why the equation (4.42) is important. Lemma 4.1. Let [P ] be the transition matrix of a M state unichain. Let r = (r1 , . . . , rM )T be a reward vector, let π = (π1 , . . . , πM ) be the steady state probabilities of the chain, and P let g = i πi ri . Then the equation w + g e = r + [P ]w (4.42) has a solution for w. With the additional condition π w = 0, that solution is unique. Discussion: Note that v = r + [P ]v in Example 4.5.1 is a special case of (4.42) in T which π = (1, 0, . . . , 0) and r = (0, 1, . . . , 1) and thus g = 0. With the added condition v1 = π v...
View Full Document

This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online