Discrete-time stochastic processes

# 3 a state j in a countable state markov chain is

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: n per stage, g and g 0 , for stationary policies k and k 0 . Show that g = g 0 . b) Find the relative gain vectors, w and w 0 , for stationary policies k and k 0 . c) Suppose the ﬁnal reward, at stage 0, is u1 = 0, u2 = u. For what range of u does the dynamic programming algorithm use decision k in state 2 at stage 1? d) For what range of u does the dynamic programming algorithm use decision k in state 2 at stage 2? at stage n? You should ﬁnd that (for this example) the dynamic programming algorithm uses the same decision at each stage n as it uses in stage 1. ∗ ∗ e) Find the optimal gain v2 (n, u ) and v1 (n, u ) as a function of stage n assuming u = 10. f ) Find limn→1 v ∗ (n, u ) and show how it depends on u. Exercise 4.31. Consider a Markov decision problem in which the stationary policies k and k 0 each satisfy Bellman’s equation, (4.60) and each correspond to ergodic Markov chains. 0 0 a) Show that if r k + [P k ]w 0 ≥ r k + [P k ]w 0 is not satisﬁed with equality, then g 0 > g . 0 0 b) Show that r k + [P k ]w 0 = r k + [P k ]w 0 (Hint: use part a). c) Find the relationship between the relative gain vector w k for policy k and the relative gain vector w 0 for policy k 0 . (Hint: Show that r k + [P k ]w 0 = g e + w 0 ; what does this say about w and w 0 ?) e) Suppose that policy k uses decision 1 in state 1 and policy k 0 uses decision 2 in state 1 (i.e., k1 = 1 for policy k and k1 = 2 for policy k 0 ). What is the relationship between (k) (k) (k) (k) r1 , P11 , P12 , . . . P1J for k equal to 1 and 2? f ) Now suppose that policy k uses decision 1 in each state and policy k 0 uses decision 2 in (1) (2) each state. Is it possible that ri > ri for all i? Explain carefully. (1) g) Now assume that ri Explain. is the same for all i. Does this change your answer to part f )? 4.8. EXERCISES 193 Exercise 4.32. Consider a Markov decision problem with three states. Assume that each stationary policy corresponds to an ergodic Markov chain. It is known that a particular policy k 0 = (k1 , k2 , k3 ) = (2, 4, 1) is the unique optimal stationary policy (i.e., the gain per stage in steady-state is maximized by always using decision 2 in state 1, decision 4 in state (k) 2, and decision 1 in state 3). As usual, ri denotes the reward in state i under decision k, (k) and Pij denotes the probability of a transition to state j given state i and given the use of decision k in state i. Consider the eﬀect of changing the Markov decision problem in each of the following ways (the changes in each part are to be considered in the absence of the changes in the other parts): (1) (1) (2) (2) a) r1 is replaced by r1 − 1. b) r1 is replaced by r1 + 1. (k) c) r1 (k) is replaced by r1 + 1 for all state 1 decisions k. (ki ) d) for all i, ri is replaced by r(ki ) + 1 for the decision ki of policy k 0 . For each of the above changes, answer the following questions; give explanations : 1) Is the gain per stage, g 0 , increased, decreased, or unchanged by the given change? 2) Is it possible that another policy, k 6= k 0 , is optimal after the given change? Exercise 4.33. (The Odoni Bound) Let k 0 be the optimal stationary policy for a Markov decision problem and let g 0 and π 0 be the corresponding gain and steady-state probability ∗ respectively. Let vi (n, u ) be the optimal dynamic expected reward for starting in state i at stage n. ∗ ∗ ∗ ∗ a) Show that mini [vi (n, u ) − vi (n − 1)] ≤ g 0 ≤ maxi [vi (n, u ) − vi (n − 1)] ; n ≥ 1. Hint: ∗ (n, u ) − v ∗ (n − 1) by π 0 or π 0 where k is the optimal dynamic Consider premultiplying v policy at stage n. b) Show that the lower bound is non-decreasing in n and the upper bound is non-increasing in n and both converge to g 0 with increasing n. Exercise 4.34. Consider a Markov decision problem with three states, {1, 2, 3}. For state (1) (2) (1) (2) 3, there are two decisions, r3 = r3 = 0 and P3,1 = P3,2 = 1. For state 1, there are two (1) decisions, r1 = 0, (1) r2 = 0, (2) (2) (1) (2) r1 = −100 and P1,1 = P1,3 = 1. For state 2, there are two decisions, (1) (2 r2 = −100 and P2,1 = P2,3 = 1. a) Show that there are two ergodic unichain optimal stationary policies, one using decision 1 in states 1 and 3 and decision 2 in state 2. The other uses the opposite decision in each state. b) Find the relative gain vector for each of the above stationary policies. c) Let u be the ﬁnal reward vector. Show that the ﬁrst stationary policy above is the optimal dynamic policy in all stages if u1 ≥ u2 + 100 and u3 ≥ u2 + 100. Show that a non-unichain stationary policy is the optimal dynamic policy if u1 = u2 = u3 194 CHAPTER 4. FINITE-STATE MARKOV CHAINS ∗ c) Theorem 4.13 implies that, under the conditions of the theorem, limn→1 [vi (n, u ) − ∗ (n, u )] is independent of u . Show that this is not true for the conditions of this exercise. vj Exercise 4.35. Assume that k 0 is a unique optimal stationary policy and corresponds to an ergodic unichain (as in...
View Full Document

## This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online