This preview shows page 1. Sign up to view the full content.
Unformatted text preview: m our discussion of Little’s theorem in Section 3.6, we
know that this sum of times is equal to the sum of the number of customers in the system,
summed over each integer time from the initial time with i customers to the ﬁnal time when
the system becomes empty. As in the previous example, we modify the Markov chain to
make state 0 a trapping state. We take ri = i as the “reward” in state i, and vi as the
expected aggregate reward until the trapping state is entered. Using the same reasoning as
in the previous example, vi is equal to the immediate “reward” ri = i plus the expected
P
reward from whatever state is entered next. Thus vi = ri + j ≥1 Pij vj . With v0 = 0, this
is v = r + [P ]v . This has a unique solution for v as will be shown later in Lemma 4.1.
This same analysis is valid for any choice of reward ri for each transient state i; the reward
in the trapping state must be 0 so as to keep the expected aggregate reward ﬁnite. 4.5. MARKOV CHAINS WITH REWARDS 159 In the above examples, the Markov chain has a trapping state with zero gain, so the expected
gain is essentially a transient phenomena until entering the trapping state. We now look at
the more general case of a unichain, i.e., a chain with a single recurrent class, possibly along
with some transient states. In this more general case, there can be some average gain per
unit time, along with some transient gain depending on the initial state. We ﬁrst look at
the aggregate gain over a ﬁnite number of time units, thus providing a clean way of going
to the limit.
Example 4.5.3. The example in Figure 4.6 provides some intuitive appreciation for the
general problem. Note that the chain tends to persist in whatever state it is in for a
relatively long time. Thus if the chain starts in state 2, not only is an immediate reward
of 1 achieved, but there is a high probability of an additional gain of 1 on many successive
transitions. Thus the aggregate value of starting in state 2 is considerably more than the
immediate reward of 1. On the other hand, we see from symmetry that the expected gain
per unit time, over a long time period, must be one half.
✿♥
✘1
② 0.99 r1 =0 0.01
0.01 ③
♥
2
② 0.99 r2 =1 Figure 4.6: Markov chain with rewards.
Returning to the general case, it is convenient to work backward from a ﬁnal time rather
than forward from the initial time. This will be quite helpful later when we consider dynamic
programming and Markov decision theory. For any ﬁnal time m, deﬁne stage n as n time
units before the ﬁnal time, i.e., as time m − n in Figure 4.7. Equivalently, we often view
the ﬁnal time as time 0, and then stage n corresponds to time −n.
m−n
n
−n
n ···
··· ···
··· m−3
3 −n+1 −n+2 −n+3
n−1
n−2
n−3 ···
··· m−2
2 m−1
1 m
0 Time
Stage −2
2 −1
1 0
0 Time
Stage Figure 4.7: Alternate views of Stages.
As a ﬁnal generalization of the problem (which will be helpful in the solution), we allow the
reward at the ﬁnal time (i.e., in stage 0) to be diﬀerent from that at other times. The ﬁnal
reward in state i is denoted ui , and u = (u1 , . . . , uM )T . We denote the expected aggregate
reward from stage n up to and including the ﬁnal stage (stage zero), given state i at stage n,
as vi (n, u ). Note that the notation here is taking advantage of the Markov property. That
is, given that the chain is in state i at time −n (i.e., stage n), the expected aggregate reward
up to and including time 0 is independent of the states before time −n and is independent
of when the Markov chain started prior to time −n. 160 CHAPTER 4. FINITESTATE MARKOV CHAINS The expected aggregate reward can be found by starting at stage 1. Given that the chain is
in state i at time −1, the immediate reward is ri . The chain then makes a transition (with
probability Pij ) to some state j at time 0 with a ﬁnal reward of uj . Thus
X
vi (1, u ) = ri +
Pij uj .
(4.31)
j For the example of Figure 4.6 (assuming the ﬁnal reward is the same as that at the other
stages, i.e., ui = ri for i = 1, 2), we have v1 (1, u ) = 0.01 and v2 (1, u ) = 1.99.
The expected aggregate reward for stage 2 can be calculated in the same way. Given state i
at time −2 (i.e., stage 2), there is an immediate reward of ri and, with probability Pij , the
chain goes to state j at time −1 (i.e., stage 1) with an expected additional gain of vj (1, u ).
Thus
X
vi (2, u ) = ri +
Pij vj (1, u ).
(4.32)
j Note that vj (1, u ), as calculated in (4.31), includes the gain in stages 1 and 0, and does not
depend on how state j was entered. Iterating the above argument to stage 3, 4, . . . , n,
X
vi (n, u ) = ri +
Pij vj (n−1, u ).
(4.33)
j This can be written in vector form as
v (n, u ) = r + [P ]v (n−1, u ); n ≥ 1, (4.34) where r is a column vector with components r1 , r2 , . . . , rM and v (n, u ) is a column vector
with components v1 (n, u ), . . . , vM (n, u ). By substituting (4.34), with n replaced by n − 1,
into the last term of (4.34),
v (n, u ) = r + [P ]r + [P ]2 v (n−2, u ); n ≥ 2. (4.35) Applying the same substitution recursiv...
View Full
Document
 Spring '09
 R.Srikant

Click to edit the document details