This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ely, we eventually get an explicit expression for
v (n, u ),
v (n, u ) = r + [P ]r + [P ]2 r + · · · + [P ]n−1 r + [P ]n u . (4.36) Eq. (4.34), applied iteratively, is more convenient for calculating v (n, u ) than (4.36), but
neither give us much insight into the behavior of the expected aggregate reward, especially
for large n. We can get a little insight by averaging the components of (4.36) over the
steadystate probability vector π . Since π [P ]m = π for all m and π r is, by deﬁnition, the
steady state gain per stage g , this gives us
π v (n, u ) = ng + π u . (4.37) This result is not surprising, since when the chain starts in steadystate at stage n, it
remains in steadystate, yielding a gain per stage of g until the ﬁnal reward at stage 0.
For the example of Figure 4.6 (again assuming u = r ), Figure 4.8 tabulates this steady 4.5. MARKOV CHAINS WITH REWARDS 161 state expected aggregate gain and compares it with the expected aggregate gain vi (n, u ) for
initial states 1 and 2. Note that v1 (n, u ) is always less than the steadystate average by an
amount approaching 25 with increasing n. Similarly, v2 (n, u ) is greater than the average by
the corresponding amount. In other words, for this example, vi (n, u ) − π v (n, u ), for each
state i, approaches a limit as n → 1. This limit is called the asymptotic relative gain for
starting in state i, relative to starting in steady state. In what follows, we shall see that
this type of asymptotic behavior is quite general.
π v (n, r )
1
1.5
2.5
5.5
20.5
50.5
200.5
500.5 n
1
2
4
10
40
100
400
1000 v1 (n, r )
0.01
0.0298
0.098
0.518
6.420
28.749
175.507
475.500 v2 (n, r )
1.99
2.9702
4.902
10.482
34.580
72.250
225.492
525.500 Figure 4.8: The expected aggregate reward, as a function of starting state and stage,
for the example of ﬁgure 4.6. Initially we consider only ergodic Markov chains and ﬁrst try to understand the asymptotic
behavior above at an intuitive level. For large n, the probability of being in state j at time
n
0, conditional on starting in i at time −n, is Pij ≈ πj . Thus, the expected ﬁnal reward at
time 0 is approximately π u for each possible starting state at time −n. For (4.36), this
says that the ﬁnal term [P ]n u is approximately (π u )e for large n. Similarly, in (4.36),
[P ]n−m r ≈ g e if n − m is large. This means that for very large n, each unit increase or
decrease in n simply adds or subtracts g e to the vector gain. Thus, we might conjecture
that, for large n, v (n, u ) is the sum of an initial transient term w , an intermediate term
ng e , and a ﬁnal term, (π u )e , i.e,
v (n, u ) ≈ w + ng e + (π u )e . (4.38) where we also conjecture that the approximation becomes exact as n → 1. Substituting
(4.37) into (4.38), the conjecture (which we shall soon validate) is
v (n, u ) ≈ w + (π v (n, u ))e . (4.39) That is, the component wi of w tells us how proﬁtable it is, in the long term, to start
in a particular state i rather than start in steadystate. Thus w is called the asymptotic
relative gain vector or, for brevity, the relative gain vector. In the example of the table
above, w = (−25, +25).
There are two reasonable approaches to validate the conjecture above and to evaluate the
relative gain vector w . The ﬁrst is explored in Exercise 4.22 and expands on the intuitive
argument leading to (4.38) to show that w is given by
w= 1
X ([P ]i − e π )r . n=0 (4.40) 162 CHAPTER 4. FINITESTATE MARKOV CHAINS This expression is not a very useful way to calculate w , and thus we follow the second
approach here, which provides both a convenient expression for w and a proof that the
approximation in (4.38) becomes exact in the limit.
Rearranging (4.38) and going to the limit,
w = lim {v (n, u ) − ng e − (π u )e }.
n→1 (4.41) The conjecture, which is still to be proven, is that the limit in (4.41) actually exists. We
now show that if this limit exists, w must have a particular form. In particular, substituting
(4.34) into (4.41),
w = lim {r + [P ]v (n − 1, u ) − ng e − (π u )e } n→1 = r − g e + [P ] lim {v (n − 1, u ) − (n − 1)g e − (π u )e }
n→1 = r − g e + [P ]w . Thus, if the limit in (4.41) exists, that limiting vector w must satisfy w + g e = r + [P ]w .
The following lemma shows that this equation has a solution. The lemma does not depend
on the conjecture in (4.41); we are simply using this conjecture to motivate why the equation
(4.42) is important.
Lemma 4.1. Let [P ] be the transition matrix of a M state unichain. Let r = (r1 , . . . , rM )T
be a reward vector, let π = (π1 , . . . , πM ) be the steady state probabilities of the chain, and
P
let g = i πi ri . Then the equation
w + g e = r + [P ]w (4.42) has a solution for w. With the additional condition π w = 0, that solution is unique.
Discussion: Note that v = r + [P ]v in Example 4.5.1 is a special case of (4.42) in
T
which π = (1, 0, . . . , 0) and r = (0, 1, . . . , 1) and thus g = 0. With the added condition
v1 = π v...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details