Unformatted text preview: ce, but also for studying residual life,
queueing delay, and many other phenomena.
In Section 4.6, we shall study Markov decision theory, or dynamic programming. This can
be viewed as a generalization of Markov chains with rewards in the sense that there is a
“decision maker” or “policy maker” who in each state can choose between several diﬀerent
policies; for each policy, there is a given set of transition probabilities to the next state
and a given expected reward for the current state. Thus the decision maker must make a
compromise between the expected reward of a given policy in the current state (i.e., the
immediate reward) and the long term beneﬁt from the next state to be entered. This is a
much more challenging problem than the current study of Markov chains with rewards, but
a thorough understanding of the current problem provides the machinery to understand
Markov decision theory also.
Frequently it is more natural to associate rewards with transitions rather than states. If rij
denotes the reward associated with a transition from i to j and Pij denotes the correspondP
ing transition probability, then ri =
j Pij rij is the expected reward associated with a
transition from state i. Since we analyze only expected rewards here, and since the eﬀect of
P
transition rewards rij are summarized into the state rewards ri = j Pij rij , we henceforth
ignore transition rewards and consider only state rewards.
The steadystate expected reward per unit time, assuming a single recurrent class of states,
P
is easily seen to be g = i πi ri where πi is the steadystate probability of being in state i.
The following examples demonstrate that it is also important to understand the transient
behavior of rewards. This transient behavior will turn out to be even more important when
we study Markov decision theory and dynamic programming.
Example 4.5.1 (Expected ﬁrstpassage time). A common problem when dealing with
Markov chains is that of ﬁnding the expected number of steps, starting in some initial state,
before some given ﬁnal state is entered. Since the answer to this problem does not depend
on what happens after the given ﬁnal state is entered, we can modify the chain to convert
the given ﬁnal state, say state 1, into a trapping state (a trapping state i is a state from
which there is no exit, i.e., for which Pii = 1). That is, we set P11 = 1, P1j = 0 for all
j 6= 1, and leave Pij unchanged for all i 6= 1 and all j (see Figure 4.5).
♥
♥
2❍
✯
✟2❍ ✙
✟
✿♥
✘1❍ ✗
✄ ✎✙
❥ ✄✟
❍♥
4 ❥♥
❍3
✯②
✟ ✚
✗
✚✄
❂
✚
❥♥
❍3
✿♥
✘1
✯②
✟
⑥ ✎✙
✄✟
♥ 4 Figure 4.5: The conversion of a four state Markov chain into a chain for which state 1
is a trapping state. Note that the outgoing arcs from node 1 have been removed. Let vi be the expected number of steps to reach state 1 starting in state i 6= 1. This number 158 CHAPTER 4. FINITESTATE MARKOV CHAINS of steps includes the ﬁrst step plus the expected number of steps from whatever state is
entered next (which is 0 if state 1 is entered next). Thus, for the chain in Figure 4.5, we
have the equations
v2 = 1 + P23 v3 + P24 v4
v3 = 1 + P32 v2 + P33 v3 + P34 v4
v4 = 1 + P42 v2 + P43 v3 .
For an arbitrary chain of M states where 1 is a trapping state and all other states are
transient, this set of equations becomes
vi = 1 + X Pij vj ; j 6=1 i 6= 1. (4.29) If we deﬁne ri = 1 for i 6= 1 and ri = 0 for i = 1, then ri is a unit reward for not yet entering
the trapping state, and vi as the expected aggregate reward before entering the trapping
state. Thus by taking r1 = 0, the reward ceases upon entering the trapping state, and vi
is the expected transient reward, i.e., the expected ﬁrst passage time from state i to state
1. Note that in this example, rewards occur only in transient states. Since transient states
P
have zero steadystate probabilities, the steadystate gain per unit time, g = i πi ri , is 0.
If we deﬁne v1 = 0, then (4.29), along with v1 = 0, has the vector form
v = r + [P ]v ; v1 = 0. (4.30) For a Markov chain with M states, (4.29) is a set of M − 1 equations in the M − 1 variables
v2 to vM . The equation v = r + [P ]v is a set of M linear equations, of which the ﬁrst is the
vacuous equation v1 = 0 + v1 , and, with v1 = 0, the last M − 1 correspond to (4.29). It is
not hard to show that (4.30) has a unique solution for v under the condition that states 2
to M are all transient states and 1 is a trapping state, but we prove this later, in Lemma
4.1, under more general circumstances.
Example 4.5.2. Assume that a Markov chain has M states, {0, 1, . . . , M − 1}, and that
the state represents the number of customers in an integer time queueing system. Suppose
we wish to ﬁnd the expected sum of the times all customers spend in the system, starting
at an integer time where i customers are in the system, and ending at the ﬁrst instant
when the system becomes idle. Fro...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details