Discrete-time stochastic processes

Theorem 49 treated only unichains and it is sometimes

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ce, but also for studying residual life, queueing delay, and many other phenomena. In Section 4.6, we shall study Markov decision theory, or dynamic programming. This can be viewed as a generalization of Markov chains with rewards in the sense that there is a “decision maker” or “policy maker” who in each state can choose between several different policies; for each policy, there is a given set of transition probabilities to the next state and a given expected reward for the current state. Thus the decision maker must make a compromise between the expected reward of a given policy in the current state (i.e., the immediate reward) and the long term benefit from the next state to be entered. This is a much more challenging problem than the current study of Markov chains with rewards, but a thorough understanding of the current problem provides the machinery to understand Markov decision theory also. Frequently it is more natural to associate rewards with transitions rather than states. If rij denotes the reward associated with a transition from i to j and Pij denotes the correspondP ing transition probability, then ri = j Pij rij is the expected reward associated with a transition from state i. Since we analyze only expected rewards here, and since the effect of P transition rewards rij are summarized into the state rewards ri = j Pij rij , we henceforth ignore transition rewards and consider only state rewards. The steady-state expected reward per unit time, assuming a single recurrent class of states, P is easily seen to be g = i πi ri where πi is the steady-state probability of being in state i. The following examples demonstrate that it is also important to understand the transient behavior of rewards. This transient behavior will turn out to be even more important when we study Markov decision theory and dynamic programming. Example 4.5.1 (Expected first-passage time). A common problem when dealing with Markov chains is that of finding the expected number of steps, starting in some initial state, before some given final state is entered. Since the answer to this problem does not depend on what happens after the given final state is entered, we can modify the chain to convert the given final state, say state 1, into a trapping state (a trapping state i is a state from which there is no exit, i.e., for which Pii = 1). That is, we set P11 = 1, P1j = 0 for all j 6= 1, and leave Pij unchanged for all i 6= 1 and all j (see Figure 4.5). ♥ ♥ 2❍ ✯ ✟2❍ ✙ ✟ ✿♥ ✘1❍ ✗ ✄ ✎✙ ❥ ✄✟ ❍♥ 4 ❥♥ ❍3 ✯② ✟ ✚ ✗ ✚✄ ❂ ✚ ❥♥ ❍3 ✿♥ ✘1 ✯② ✟ ⑥ ✎✙ ✄✟ ♥ 4 Figure 4.5: The conversion of a four state Markov chain into a chain for which state 1 is a trapping state. Note that the outgoing arcs from node 1 have been removed. Let vi be the expected number of steps to reach state 1 starting in state i 6= 1. This number 158 CHAPTER 4. FINITE-STATE MARKOV CHAINS of steps includes the first step plus the expected number of steps from whatever state is entered next (which is 0 if state 1 is entered next). Thus, for the chain in Figure 4.5, we have the equations v2 = 1 + P23 v3 + P24 v4 v3 = 1 + P32 v2 + P33 v3 + P34 v4 v4 = 1 + P42 v2 + P43 v3 . For an arbitrary chain of M states where 1 is a trapping state and all other states are transient, this set of equations becomes vi = 1 + X Pij vj ; j 6=1 i 6= 1. (4.29) If we define ri = 1 for i 6= 1 and ri = 0 for i = 1, then ri is a unit reward for not yet entering the trapping state, and vi as the expected aggregate reward before entering the trapping state. Thus by taking r1 = 0, the reward ceases upon entering the trapping state, and vi is the expected transient reward, i.e., the expected first passage time from state i to state 1. Note that in this example, rewards occur only in transient states. Since transient states P have zero steady-state probabilities, the steady-state gain per unit time, g = i πi ri , is 0. If we define v1 = 0, then (4.29), along with v1 = 0, has the vector form v = r + [P ]v ; v1 = 0. (4.30) For a Markov chain with M states, (4.29) is a set of M − 1 equations in the M − 1 variables v2 to vM . The equation v = r + [P ]v is a set of M linear equations, of which the first is the vacuous equation v1 = 0 + v1 , and, with v1 = 0, the last M − 1 correspond to (4.29). It is not hard to show that (4.30) has a unique solution for v under the condition that states 2 to M are all transient states and 1 is a trapping state, but we prove this later, in Lemma 4.1, under more general circumstances. Example 4.5.2. Assume that a Markov chain has M states, {0, 1, . . . , M − 1}, and that the state represents the number of customers in an integer time queueing system. Suppose we wish to find the expected sum of the times all customers spend in the system, starting at an integer time where i customers are in the system, and ending at the first instant when the system becomes idle. Fro...
View Full Document

Ask a homework question - tutors are online