This preview shows page 1. Sign up to view the full content.
Unformatted text preview: = 0, the solution is unique. Example 4.5.2 is the same, except for that r is
diﬀerent, and thus also has a unique solution.
Proof: Rewrite (4.42) as
{[P ] − [I ]}w = g e − r . (4.43) e
Let w be a particular solution to (4.43) (if one exists). Then any solution to (4.43) can be
e
expressed as w + x for some x that satisﬁes the homogeneous equation {[P ] − [I ]}x = 0. For
x to satisfy {[P ] − [I ]}x = 0, however, x must be a right eigenvector of [P ] with eigenvalue
1. From Theorem 4.8, x must have the form αe for some number α. This means that if a
e
e
particular solution w to (4.43) exists, then all solutions have the form w = w + αe . For
a particular solution to (4.43) to exist, g e − r must lie in the column space of the matrix
[P ] − [I ]. This column space is the space orthogonal to the left null space of [P ] − [I ]. This
left null space, however, is simply the set of left eigenvectors of [P ] of eigenvalue 1, i.e., the
scalar multiples of π . Thus, a particular solution exists iﬀ π (g e − r ) = 0. Since π g e = g
and π r = g , this equality is satisﬁed and a particular solution exists. Since all solutions 4.5. MARKOV CHAINS WITH REWARDS 163 e
e
have the form w = w + αe , setting π w = 0 determines the value of α to be −π w , thus
yielding a unique solution with π w = 0 and completing the proof. It is not necessary to assume that g = π r in the lemma. If g is treated as a variable in
(4.42), then, by premultiplying any solution w , g of (4.42) by π , we ﬁnd that g = π r must
be satisﬁed. This means that (4.42) can be viewed as M linear equations in the M + 1
variables w , g and the set of solutions can be found without ﬁrst calculating π . Naturally,
π must be found to ﬁnd the particular solution with π w = 0.
If the ﬁnal reward vector is chosen to be any solution w of (4.42) (not necessarily the one
with π w = 0), then
v (1, w ) = r + [P ]w = w + g e
v (2, w ) = r + [P ]{w + g e } = w + 2g e
··· ··· v (n, w ) = r + [P ]{w + (n − 1)g e } = w + ng e . (4.44) This is a simple explicit expression for expected aggregate gain for this special ﬁnal reward
vector. We now show how to use this to get a simple expression for v (n, u ) for arbitrary
u . From (4.36),
v (n, u ) − v (n, w ) = [P ]n {u − w }. (4.45) Note that this is valid for any Markov unichain and any reward vector. Substituting (4.44)
into (4.45),
v (n, u ) = ng e + w + [P ]n {u − w }. (4.46) It should now be clear why we wanted to allow the ﬁnal reward vector to diﬀer from the
reward vector at other stages. The result is summarized in the following theorem:
Theorem 4.9. Let [P ] be the transition matrix of a unichain. Let r be a reward vector and
w a solution to (4.42). Then the expected aggregate reward vector over n stages is given by
(4.46). If the unichain is ergodic and w satisﬁes π w = 0 then
lim {v(n, u) − ng e} = w + (π u)e. n→1 (4.47) Proof: The argument above established (4.46). If the recurrent class is ergodic, then [P ]n
approaches a matrix whose rows each equal π , and (4.47) follows.
The set of solutions to (4.42) has the form w + αe where w satisﬁes π w = 0 and α is any
real number. The factor α cancels out in (4.46), so any solution can be used. In (4.47),
however, the restriction to π w = 0 is necessary. We have deﬁned the (asymptotic) relative
gain vector w to satisfy π w = 0 so that, in the ergodic case, the expected aggregate gain,
v (n, u ) can be cleanly split into an initial transient w , the intermediate gain per stage, ne ,
and the ﬁnal gain π u , as in (4.47). We shall call other solutions to (4.42) shifted relative
gain vectors. 164 CHAPTER 4. FINITESTATE MARKOV CHAINS Recall that Examples 4.5.1 and 4.5.2 showed that the aggregate reward vi from state i to
enter a trapping state, state 1, is given by the solution to v = r + [P ]v , v1 = 0. This
aggregate reward, in the general setup of Theorem 4.9, is limn→1 v (n, u ). Since g = 0 and
u = 0 in these examples, (4.47) simpliﬁes to limn→1 v (n, u ) = w where w = r + [P ]w and
π w = w1 = 0. Thus, we see that (4.47) gives the same answer as we got in these examples.
For the example in Figure 4.6, we have seen that w = (−25, 25) (see Exercise 4.21 also).
The large relative gain for state 2 accounts for both the immediate reward and the high
probability of multiple additional rewards through remaining in state 2. Note that w2 can
not be interpreted as the expected reward up to the ﬁrst transition from state 2 to 1. The
reason for this is that the gain starting from state 1 cannot be ignored; this can be seen
from Figure 4.9, which modiﬁes Figure 4.6 by changing P12 to 1. In this case, (see Exercise
4.21), w2 − w1 = 1/1.01 ∼ 0.99, reﬂecting the fact that state 1 is always left immediately,
thus reducing the advantage of starting in state 2.
♥
1
②
r1 =0 1
0.01 ③
♥
2
② 0.99 1
r2 =1 Figure 4.9: A variation of Figure 4.6.
We can now interpret the general solution in (4.46) by viewing g e as the steady state gain
per stage, viewing w as the dependence on the initial state, and viewing [P ]n {u − w } as
the dependence on the ﬁnal reward vector u ). If the r...
View
Full
Document
 Spring '09
 R.Srikant

Click to edit the document details