Discrete-time stochastic processes

Denition 414 a state i in a markov decision problem

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ecurrent class is ergodic, then, as seen in (4.47), this final term is asymptotically independent of the starting state and w , but depends on π u . Example 4.5.4. In order to understand better why (4.47) can be false without the assumption of an ergodic unichain, consider a two state periodic chain with P12 = P21 = 1, r1 = r2 = 0, and arbitrary final reward with u1 6= u2 . Then it is easy to see that for n even, v1 (n) = u1 ; v2 (n) = u2 and for n odd, v1 (n) = u2 ; v2 (n) = u1 . Thus, the effect of the final reward on the initial state never dies out. For a unichain with a periodic recurrent class of period d, as in the example above, it is a little hard to interpret w as an asymptotic relative gain vector, since the last term of (4.46) involves w also (i.e., the relative gain of starting in different states depends on both n and u ). The trouble is that the final reward happens at a particular phase of the periodic variation, and the starting state determines the set of states at which the final reward is assigned. If we view the final reward as being randomized over a period, with equal probability of occuring at each phase, then, from (4.46), d−1 X m=0 ≥ ¥ v (n + m, u ) − (n + m)g e = w + [P ]n I + [P ] + · · · + [P ]d−1 {u − w }. Going to the limit n → 1, and using the result of Exercise 4.18, this becomes almost the 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 165 same as the result for an ergodic unichain, i.e., lim n→1 d−1 X m=0 (v (n + m, u ) − (n + m)g e ) = w + (e π )u . (4.48) There is an interesting analogy between the steady-state vector π and the relative gain vector w . If the recurrent class of states is ergodic, then any initial distribution on the states approaches the steady state with increasing time, and similarly the effect of any final gain vector becomes negligible (except for the choice of (π u )) with an increasing number of stages. On the other hand, if the recurrent class is periodic, then starting the Markov chain in steady-state maintains the steady state, and similarly, choosing the final gain to be the relative gain vector maintains the same relative gain at each stage. Theorem 4.9 treated only unichains, and it is sometimes useful to look at asymptotic expressions for chains with m > 1 recurrent classes. In this case, the analogous quantity to a relative gain vector can be expressed as a solution to w+ m X g (i)∫ (i) = r + [P ]w , (4.49) i=1 where g (i) is the gain of the ith recurrent class and ∫ (i) is the corresponding right eigenvector of [P ] (see Exercise 4.14). Using a solution to (4.49) as a final gain vector, we can repeat the argument in (4.44) to get v (n, w ) = w + n m X i=1 g (i)∫ (i) for all n ≥ 1. (4.50) As expected, the average reward per stage depends on the recurrent class of the initial state. If the initial state, j , is transient, the average reward per stage is averaged over the (i) recurrent classes, using the probability ∫j that state j eventually reaches class i. For an arbitrary final reward vector u , (4.50) can be combined with (4.45) to get v (n, u ) = w + n m X i=1 g (i)∫ (i) + [P ]n {u − w } for all n ≥ 1. (4.51) Eqn. (4.49) always has a solution (see Exercise 4.27), and in fact has an m dimensional set P ˜ of solutions given by w = w + i αi∫ (i) , where α1 , . . . , αm can be chosen arbitrarily and ˜ w is any given solution. 4.6 4.6.1 Markov decision theory and dynamic programming Introduction In the previous section, we analyzed the behavior of a Markov chain with rewards. In this section, we consider a much more elaborate structure in which a decision maker can select 166 CHAPTER 4. FINITE-STATE MARKOV CHAINS between various possible decisions for rewards and transition probabilities. In place of the reward ri and the transition probabilities {Pij ; 1 ≤ j ≤ M} associated with a given state i, (1) (2) (K ) there is a choice between some number Ki of different rewards, say ri , ri , . . . , ri i and (1) a corresponding choice between Ki different sets of transition probabilities, say {Pij ; 1 ≤ (2) (K ) j ≤ M}, {Pij , 1 ≤ j ≤ M}, . . . {Pij i ; 1 ≤ j ≤ M}. A decision maker then decides between these Ki possible decisions each time the chain is in state i. Note that if the decision maker (k) chooses decision k for state i, then the reward is ri and the transition probabilities from (k) (k) (k) state i are {Pij ; 1 ≤ j ≤ M}; it is not possible to choose ri for one k and {Pij ; 1 ≤ j ≤ M} for another k. We assume that, given Xn = i, and given decision k at time n, the probability (k) of entering state j at time n + 1 is Pij , independent of earlier states and decisions. Figure 4.10 shows an example of this situation in which the decision maker can choose between two possible decisions in state 2 (K2 = 2) and has no freedom of choice in state 1 (K1 = 1). This figure illustrates the familiar tradeoff between instant gratification (alternative 2) and...
View Full Document

This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online