Discrete-time stochastic processes

# Denition 414 a state i in a markov decision problem

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ecurrent class is ergodic, then, as seen in (4.47), this ﬁnal term is asymptotically independent of the starting state and w , but depends on π u . Example 4.5.4. In order to understand better why (4.47) can be false without the assumption of an ergodic unichain, consider a two state periodic chain with P12 = P21 = 1, r1 = r2 = 0, and arbitrary ﬁnal reward with u1 6= u2 . Then it is easy to see that for n even, v1 (n) = u1 ; v2 (n) = u2 and for n odd, v1 (n) = u2 ; v2 (n) = u1 . Thus, the eﬀect of the ﬁnal reward on the initial state never dies out. For a unichain with a periodic recurrent class of period d, as in the example above, it is a little hard to interpret w as an asymptotic relative gain vector, since the last term of (4.46) involves w also (i.e., the relative gain of starting in diﬀerent states depends on both n and u ). The trouble is that the ﬁnal reward happens at a particular phase of the periodic variation, and the starting state determines the set of states at which the ﬁnal reward is assigned. If we view the ﬁnal reward as being randomized over a period, with equal probability of occuring at each phase, then, from (4.46), d−1 X m=0 ≥ ¥ v (n + m, u ) − (n + m)g e = w + [P ]n I + [P ] + · · · + [P ]d−1 {u − w }. Going to the limit n → 1, and using the result of Exercise 4.18, this becomes almost the 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 165 same as the result for an ergodic unichain, i.e., lim n→1 d−1 X m=0 (v (n + m, u ) − (n + m)g e ) = w + (e π )u . (4.48) There is an interesting analogy between the steady-state vector π and the relative gain vector w . If the recurrent class of states is ergodic, then any initial distribution on the states approaches the steady state with increasing time, and similarly the eﬀect of any ﬁnal gain vector becomes negligible (except for the choice of (π u )) with an increasing number of stages. On the other hand, if the recurrent class is periodic, then starting the Markov chain in steady-state maintains the steady state, and similarly, choosing the ﬁnal gain to be the relative gain vector maintains the same relative gain at each stage. Theorem 4.9 treated only unichains, and it is sometimes useful to look at asymptotic expressions for chains with m > 1 recurrent classes. In this case, the analogous quantity to a relative gain vector can be expressed as a solution to w+ m X g (i)∫ (i) = r + [P ]w , (4.49) i=1 where g (i) is the gain of the ith recurrent class and ∫ (i) is the corresponding right eigenvector of [P ] (see Exercise 4.14). Using a solution to (4.49) as a ﬁnal gain vector, we can repeat the argument in (4.44) to get v (n, w ) = w + n m X i=1 g (i)∫ (i) for all n ≥ 1. (4.50) As expected, the average reward per stage depends on the recurrent class of the initial state. If the initial state, j , is transient, the average reward per stage is averaged over the (i) recurrent classes, using the probability ∫j that state j eventually reaches class i. For an arbitrary ﬁnal reward vector u , (4.50) can be combined with (4.45) to get v (n, u ) = w + n m X i=1 g (i)∫ (i) + [P ]n {u − w } for all n ≥ 1. (4.51) Eqn. (4.49) always has a solution (see Exercise 4.27), and in fact has an m dimensional set P ˜ of solutions given by w = w + i αi∫ (i) , where α1 , . . . , αm can be chosen arbitrarily and ˜ w is any given solution. 4.6 4.6.1 Markov decision theory and dynamic programming Introduction In the previous section, we analyzed the behavior of a Markov chain with rewards. In this section, we consider a much more elaborate structure in which a decision maker can select 166 CHAPTER 4. FINITE-STATE MARKOV CHAINS between various possible decisions for rewards and transition probabilities. In place of the reward ri and the transition probabilities {Pij ; 1 ≤ j ≤ M} associated with a given state i, (1) (2) (K ) there is a choice between some number Ki of diﬀerent rewards, say ri , ri , . . . , ri i and (1) a corresponding choice between Ki diﬀerent sets of transition probabilities, say {Pij ; 1 ≤ (2) (K ) j ≤ M}, {Pij , 1 ≤ j ≤ M}, . . . {Pij i ; 1 ≤ j ≤ M}. A decision maker then decides between these Ki possible decisions each time the chain is in state i. Note that if the decision maker (k) chooses decision k for state i, then the reward is ri and the transition probabilities from (k) (k) (k) state i are {Pij ; 1 ≤ j ≤ M}; it is not possible to choose ri for one k and {Pij ; 1 ≤ j ≤ M} for another k. We assume that, given Xn = i, and given decision k at time n, the probability (k) of entering state j at time n + 1 is Pij , independent of earlier states and decisions. Figure 4.10 shows an example of this situation in which the decision maker can choose between two possible decisions in state 2 (K2 = 2) and has no freedom of choice in state 1 (K1 = 1). This ﬁgure illustrates the familiar tradeoﬀ between instant gratiﬁcation (alternative 2) and...
View Full Document

## This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online