Discrete-time stochastic processes

# 57 p k0 k0 0 where g 0 i i ri is the steady state gain

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: = 0, the solution is unique. Example 4.5.2 is the same, except for that r is diﬀerent, and thus also has a unique solution. Proof: Rewrite (4.42) as {[P ] − [I ]}w = g e − r . (4.43) e Let w be a particular solution to (4.43) (if one exists). Then any solution to (4.43) can be e expressed as w + x for some x that satisﬁes the homogeneous equation {[P ] − [I ]}x = 0. For x to satisfy {[P ] − [I ]}x = 0, however, x must be a right eigenvector of [P ] with eigenvalue 1. From Theorem 4.8, x must have the form αe for some number α. This means that if a e e particular solution w to (4.43) exists, then all solutions have the form w = w + αe . For a particular solution to (4.43) to exist, g e − r must lie in the column space of the matrix [P ] − [I ]. This column space is the space orthogonal to the left null space of [P ] − [I ]. This left null space, however, is simply the set of left eigenvectors of [P ] of eigenvalue 1, i.e., the scalar multiples of π . Thus, a particular solution exists iﬀ π (g e − r ) = 0. Since π g e = g and π r = g , this equality is satisﬁed and a particular solution exists. Since all solutions 4.5. MARKOV CHAINS WITH REWARDS 163 e e have the form w = w + αe , setting π w = 0 determines the value of α to be −π w , thus yielding a unique solution with π w = 0 and completing the proof. It is not necessary to assume that g = π r in the lemma. If g is treated as a variable in (4.42), then, by pre-multiplying any solution w , g of (4.42) by π , we ﬁnd that g = π r must be satisﬁed. This means that (4.42) can be viewed as M linear equations in the M + 1 variables w , g and the set of solutions can be found without ﬁrst calculating π . Naturally, π must be found to ﬁnd the particular solution with π w = 0. If the ﬁnal reward vector is chosen to be any solution w of (4.42) (not necessarily the one with π w = 0), then v (1, w ) = r + [P ]w = w + g e v (2, w ) = r + [P ]{w + g e } = w + 2g e ··· ··· v (n, w ) = r + [P ]{w + (n − 1)g e } = w + ng e . (4.44) This is a simple explicit expression for expected aggregate gain for this special ﬁnal reward vector. We now show how to use this to get a simple expression for v (n, u ) for arbitrary u . From (4.36), v (n, u ) − v (n, w ) = [P ]n {u − w }. (4.45) Note that this is valid for any Markov unichain and any reward vector. Substituting (4.44) into (4.45), v (n, u ) = ng e + w + [P ]n {u − w }. (4.46) It should now be clear why we wanted to allow the ﬁnal reward vector to diﬀer from the reward vector at other stages. The result is summarized in the following theorem: Theorem 4.9. Let [P ] be the transition matrix of a unichain. Let r be a reward vector and w a solution to (4.42). Then the expected aggregate reward vector over n stages is given by (4.46). If the unichain is ergodic and w satisﬁes π w = 0 then lim {v(n, u) − ng e} = w + (π u)e. n→1 (4.47) Proof: The argument above established (4.46). If the recurrent class is ergodic, then [P ]n approaches a matrix whose rows each equal π , and (4.47) follows. The set of solutions to (4.42) has the form w + αe where w satisﬁes π w = 0 and α is any real number. The factor α cancels out in (4.46), so any solution can be used. In (4.47), however, the restriction to π w = 0 is necessary. We have deﬁned the (asymptotic) relative gain vector w to satisfy π w = 0 so that, in the ergodic case, the expected aggregate gain, v (n, u ) can be cleanly split into an initial transient w , the intermediate gain per stage, ne , and the ﬁnal gain π u , as in (4.47). We shall call other solutions to (4.42) shifted relative gain vectors. 164 CHAPTER 4. FINITE-STATE MARKOV CHAINS Recall that Examples 4.5.1 and 4.5.2 showed that the aggregate reward vi from state i to enter a trapping state, state 1, is given by the solution to v = r + [P ]v , v1 = 0. This aggregate reward, in the general setup of Theorem 4.9, is limn→1 v (n, u ). Since g = 0 and u = 0 in these examples, (4.47) simpliﬁes to limn→1 v (n, u ) = w where w = r + [P ]w and π w = w1 = 0. Thus, we see that (4.47) gives the same answer as we got in these examples. For the example in Figure 4.6, we have seen that w = (−25, 25) (see Exercise 4.21 also). The large relative gain for state 2 accounts for both the immediate reward and the high probability of multiple additional rewards through remaining in state 2. Note that w2 can not be interpreted as the expected reward up to the ﬁrst transition from state 2 to 1. The reason for this is that the gain starting from state 1 cannot be ignored; this can be seen from Figure 4.9, which modiﬁes Figure 4.6 by changing P12 to 1. In this case, (see Exercise 4.21), w2 − w1 = 1/1.01 ∼ 0.99, reﬂecting the fact that state 1 is always left immediately, thus reducing the advantage of starting in state 2. ♥ 1 ② r1 =0 1 0.01 ③ ♥ 2 ② 0.99 1 r2 =1 Figure 4.9: A variation of Figure 4.6. We can now interpret the general solution in (4.46) by viewing g e as the steady state gain per stage, viewing w as the dependence on the initial state, and viewing [P ]n {u − w } as the dependence on the ﬁnal reward vector u ). If the r...
View Full Document

Ask a homework question - tutors are online