Discrete-time stochastic processes

57 p k0 k0 0 where g 0 i i ri is the steady state gain

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: = 0, the solution is unique. Example 4.5.2 is the same, except for that r is different, and thus also has a unique solution. Proof: Rewrite (4.42) as {[P ] − [I ]}w = g e − r . (4.43) e Let w be a particular solution to (4.43) (if one exists). Then any solution to (4.43) can be e expressed as w + x for some x that satisfies the homogeneous equation {[P ] − [I ]}x = 0. For x to satisfy {[P ] − [I ]}x = 0, however, x must be a right eigenvector of [P ] with eigenvalue 1. From Theorem 4.8, x must have the form αe for some number α. This means that if a e e particular solution w to (4.43) exists, then all solutions have the form w = w + αe . For a particular solution to (4.43) to exist, g e − r must lie in the column space of the matrix [P ] − [I ]. This column space is the space orthogonal to the left null space of [P ] − [I ]. This left null space, however, is simply the set of left eigenvectors of [P ] of eigenvalue 1, i.e., the scalar multiples of π . Thus, a particular solution exists iff π (g e − r ) = 0. Since π g e = g and π r = g , this equality is satisfied and a particular solution exists. Since all solutions 4.5. MARKOV CHAINS WITH REWARDS 163 e e have the form w = w + αe , setting π w = 0 determines the value of α to be −π w , thus yielding a unique solution with π w = 0 and completing the proof. It is not necessary to assume that g = π r in the lemma. If g is treated as a variable in (4.42), then, by pre-multiplying any solution w , g of (4.42) by π , we find that g = π r must be satisfied. This means that (4.42) can be viewed as M linear equations in the M + 1 variables w , g and the set of solutions can be found without first calculating π . Naturally, π must be found to find the particular solution with π w = 0. If the final reward vector is chosen to be any solution w of (4.42) (not necessarily the one with π w = 0), then v (1, w ) = r + [P ]w = w + g e v (2, w ) = r + [P ]{w + g e } = w + 2g e ··· ··· v (n, w ) = r + [P ]{w + (n − 1)g e } = w + ng e . (4.44) This is a simple explicit expression for expected aggregate gain for this special final reward vector. We now show how to use this to get a simple expression for v (n, u ) for arbitrary u . From (4.36), v (n, u ) − v (n, w ) = [P ]n {u − w }. (4.45) Note that this is valid for any Markov unichain and any reward vector. Substituting (4.44) into (4.45), v (n, u ) = ng e + w + [P ]n {u − w }. (4.46) It should now be clear why we wanted to allow the final reward vector to differ from the reward vector at other stages. The result is summarized in the following theorem: Theorem 4.9. Let [P ] be the transition matrix of a unichain. Let r be a reward vector and w a solution to (4.42). Then the expected aggregate reward vector over n stages is given by (4.46). If the unichain is ergodic and w satisfies π w = 0 then lim {v(n, u) − ng e} = w + (π u)e. n→1 (4.47) Proof: The argument above established (4.46). If the recurrent class is ergodic, then [P ]n approaches a matrix whose rows each equal π , and (4.47) follows. The set of solutions to (4.42) has the form w + αe where w satisfies π w = 0 and α is any real number. The factor α cancels out in (4.46), so any solution can be used. In (4.47), however, the restriction to π w = 0 is necessary. We have defined the (asymptotic) relative gain vector w to satisfy π w = 0 so that, in the ergodic case, the expected aggregate gain, v (n, u ) can be cleanly split into an initial transient w , the intermediate gain per stage, ne , and the final gain π u , as in (4.47). We shall call other solutions to (4.42) shifted relative gain vectors. 164 CHAPTER 4. FINITE-STATE MARKOV CHAINS Recall that Examples 4.5.1 and 4.5.2 showed that the aggregate reward vi from state i to enter a trapping state, state 1, is given by the solution to v = r + [P ]v , v1 = 0. This aggregate reward, in the general setup of Theorem 4.9, is limn→1 v (n, u ). Since g = 0 and u = 0 in these examples, (4.47) simplifies to limn→1 v (n, u ) = w where w = r + [P ]w and π w = w1 = 0. Thus, we see that (4.47) gives the same answer as we got in these examples. For the example in Figure 4.6, we have seen that w = (−25, 25) (see Exercise 4.21 also). The large relative gain for state 2 accounts for both the immediate reward and the high probability of multiple additional rewards through remaining in state 2. Note that w2 can not be interpreted as the expected reward up to the first transition from state 2 to 1. The reason for this is that the gain starting from state 1 cannot be ignored; this can be seen from Figure 4.9, which modifies Figure 4.6 by changing P12 to 1. In this case, (see Exercise 4.21), w2 − w1 = 1/1.01 ∼ 0.99, reflecting the fact that state 1 is always left immediately, thus reducing the advantage of starting in state 2. ♥ 1 ② r1 =0 1 0.01 ③ ♥ 2 ② 0.99 1 r2 =1 Figure 4.9: A variation of Figure 4.6. We can now interpret the general solution in (4.46) by viewing g e as the steady state gain per stage, viewing w as the dependence on the initial state, and viewing [P ]n {u − w } as the dependence on the final reward vector u ). If the r...
View Full Document

Ask a homework question - tutors are online