Unformatted text preview: ecurrent class is ergodic, then, as
seen in (4.47), this ﬁnal term is asymptotically independent of the starting state and w ,
but depends on π u .
Example 4.5.4. In order to understand better why (4.47) can be false without the assumption of an ergodic unichain, consider a two state periodic chain with P12 = P21 = 1,
r1 = r2 = 0, and arbitrary ﬁnal reward with u1 6= u2 . Then it is easy to see that for n even,
v1 (n) = u1 ; v2 (n) = u2 and for n odd, v1 (n) = u2 ; v2 (n) = u1 . Thus, the eﬀect of the
ﬁnal reward on the initial state never dies out.
For a unichain with a periodic recurrent class of period d, as in the example above, it
is a little hard to interpret w as an asymptotic relative gain vector, since the last term
of (4.46) involves w also (i.e., the relative gain of starting in diﬀerent states depends on
both n and u ). The trouble is that the ﬁnal reward happens at a particular phase of the
periodic variation, and the starting state determines the set of states at which the ﬁnal
reward is assigned. If we view the ﬁnal reward as being randomized over a period, with
equal probability of occuring at each phase, then, from (4.46),
d−1
X m=0 ≥
¥
v (n + m, u ) − (n + m)g e = w + [P ]n I + [P ] + · · · + [P ]d−1 {u − w }. Going to the limit n → 1, and using the result of Exercise 4.18, this becomes almost the 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 165 same as the result for an ergodic unichain, i.e.,
lim n→1 d−1
X m=0 (v (n + m, u ) − (n + m)g e ) = w + (e π )u . (4.48) There is an interesting analogy between the steadystate vector π and the relative gain
vector w . If the recurrent class of states is ergodic, then any initial distribution on the
states approaches the steady state with increasing time, and similarly the eﬀect of any ﬁnal
gain vector becomes negligible (except for the choice of (π u )) with an increasing number
of stages. On the other hand, if the recurrent class is periodic, then starting the Markov
chain in steadystate maintains the steady state, and similarly, choosing the ﬁnal gain to
be the relative gain vector maintains the same relative gain at each stage.
Theorem 4.9 treated only unichains, and it is sometimes useful to look at asymptotic expressions for chains with m > 1 recurrent classes. In this case, the analogous quantity to a
relative gain vector can be expressed as a solution to
w+ m
X g (i)∫ (i) = r + [P ]w , (4.49) i=1 where g (i) is the gain of the ith recurrent class and ∫ (i) is the corresponding right eigenvector
of [P ] (see Exercise 4.14). Using a solution to (4.49) as a ﬁnal gain vector, we can repeat
the argument in (4.44) to get
v (n, w ) = w + n m
X
i=1 g (i)∫ (i) for all n ≥ 1. (4.50) As expected, the average reward per stage depends on the recurrent class of the initial
state. If the initial state, j , is transient, the average reward per stage is averaged over the
(i)
recurrent classes, using the probability ∫j that state j eventually reaches class i. For an
arbitrary ﬁnal reward vector u , (4.50) can be combined with (4.45) to get
v (n, u ) = w + n m
X
i=1 g (i)∫ (i) + [P ]n {u − w } for all n ≥ 1. (4.51) Eqn. (4.49) always has a solution (see Exercise 4.27), and in fact has an m dimensional set
P
˜
of solutions given by w = w + i αi∫ (i) , where α1 , . . . , αm can be chosen arbitrarily and
˜
w is any given solution. 4.6
4.6.1 Markov decision theory and dynamic programming
Introduction In the previous section, we analyzed the behavior of a Markov chain with rewards. In this
section, we consider a much more elaborate structure in which a decision maker can select 166 CHAPTER 4. FINITESTATE MARKOV CHAINS between various possible decisions for rewards and transition probabilities. In place of the
reward ri and the transition probabilities {Pij ; 1 ≤ j ≤ M} associated with a given state i,
(1) (2)
(K )
there is a choice between some number Ki of diﬀerent rewards, say ri , ri , . . . , ri i and
(1)
a corresponding choice between Ki diﬀerent sets of transition probabilities, say {Pij ; 1 ≤
(2) (K ) j ≤ M}, {Pij , 1 ≤ j ≤ M}, . . . {Pij i ; 1 ≤ j ≤ M}. A decision maker then decides between
these Ki possible decisions each time the chain is in state i. Note that if the decision maker
(k)
chooses decision k for state i, then the reward is ri and the transition probabilities from
(k)
(k)
(k)
state i are {Pij ; 1 ≤ j ≤ M}; it is not possible to choose ri for one k and {Pij ; 1 ≤ j ≤ M}
for another k. We assume that, given Xn = i, and given decision k at time n, the probability
(k)
of entering state j at time n + 1 is Pij , independent of earlier states and decisions.
Figure 4.10 shows an example of this situation in which the decision maker can choose
between two possible decisions in state 2 (K2 = 2) and has no freedom of choice in state 1
(K1 = 1). This ﬁgure illustrates the familiar tradeoﬀ between instant gratiﬁcation (alternative 2) and...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details