This preview shows page 1. Sign up to view the full content.
Unformatted text preview: n per stage, g and g 0 , for stationary policies k and k 0 . Show
that g = g 0 .
b) Find the relative gain vectors, w and w 0 , for stationary policies k and k 0 .
c) Suppose the ﬁnal reward, at stage 0, is u1 = 0, u2 = u. For what range of u does the
dynamic programming algorithm use decision k in state 2 at stage 1?
d) For what range of u does the dynamic programming algorithm use decision k in state 2
at stage 2? at stage n? You should ﬁnd that (for this example) the dynamic programming
algorithm uses the same decision at each stage n as it uses in stage 1.
∗
∗
e) Find the optimal gain v2 (n, u ) and v1 (n, u ) as a function of stage n assuming u = 10. f ) Find limn→1 v ∗ (n, u ) and show how it depends on u.
Exercise 4.31. Consider a Markov decision problem in which the stationary policies k and
k 0 each satisfy Bellman’s equation, (4.60) and each correspond to ergodic Markov chains.
0 0 a) Show that if r k + [P k ]w 0 ≥ r k + [P k ]w 0 is not satisﬁed with equality, then g 0 > g .
0 0 b) Show that r k + [P k ]w 0 = r k + [P k ]w 0 (Hint: use part a). c) Find the relationship between the relative gain vector w k for policy k and the relative
gain vector w 0 for policy k 0 . (Hint: Show that r k + [P k ]w 0 = g e + w 0 ; what does this say
about w and w 0 ?)
e) Suppose that policy k uses decision 1 in state 1 and policy k 0 uses decision 2 in state
1 (i.e., k1 = 1 for policy k and k1 = 2 for policy k 0 ). What is the relationship between
(k)
(k)
(k)
(k)
r1 , P11 , P12 , . . . P1J for k equal to 1 and 2?
f ) Now suppose that policy k uses decision 1 in each state and policy k 0 uses decision 2 in
(1)
(2)
each state. Is it possible that ri > ri for all i? Explain carefully.
(1) g) Now assume that ri
Explain. is the same for all i. Does this change your answer to part f )? 4.8. EXERCISES 193 Exercise 4.32. Consider a Markov decision problem with three states. Assume that each
stationary policy corresponds to an ergodic Markov chain. It is known that a particular
policy k 0 = (k1 , k2 , k3 ) = (2, 4, 1) is the unique optimal stationary policy (i.e., the gain per
stage in steadystate is maximized by always using decision 2 in state 1, decision 4 in state
(k)
2, and decision 1 in state 3). As usual, ri denotes the reward in state i under decision k,
(k)
and Pij denotes the probability of a transition to state j given state i and given the use of
decision k in state i. Consider the eﬀect of changing the Markov decision problem in each
of the following ways (the changes in each part are to be considered in the absence of the
changes in the other parts):
(1) (1) (2) (2) a) r1 is replaced by r1 − 1.
b) r1 is replaced by r1 + 1.
(k) c) r1 (k) is replaced by r1 + 1 for all state 1 decisions k.
(ki ) d) for all i, ri is replaced by r(ki ) + 1 for the decision ki of policy k 0 . For each of the above changes, answer the following questions; give explanations :
1) Is the gain per stage, g 0 , increased, decreased, or unchanged by the given change?
2) Is it possible that another policy, k 6= k 0 , is optimal after the given change?
Exercise 4.33. (The Odoni Bound) Let k 0 be the optimal stationary policy for a Markov
decision problem and let g 0 and π 0 be the corresponding gain and steadystate probability
∗
respectively. Let vi (n, u ) be the optimal dynamic expected reward for starting in state i at
stage n.
∗
∗
∗
∗
a) Show that mini [vi (n, u ) − vi (n − 1)] ≤ g 0 ≤ maxi [vi (n, u ) − vi (n − 1)] ; n ≥ 1. Hint:
∗ (n, u ) − v ∗ (n − 1) by π 0 or π 0 where k is the optimal dynamic
Consider premultiplying v
policy at stage n. b) Show that the lower bound is nondecreasing in n and the upper bound is nonincreasing
in n and both converge to g 0 with increasing n.
Exercise 4.34. Consider a Markov decision problem with three states, {1, 2, 3}. For state
(1)
(2)
(1)
(2)
3, there are two decisions, r3 = r3 = 0 and P3,1 = P3,2 = 1. For state 1, there are two
(1) decisions, r1 = 0,
(1) r2 = 0, (2) (2) (1) (2) r1 = −100 and P1,1 = P1,3 = 1. For state 2, there are two decisions,
(1) (2 r2 = −100 and P2,1 = P2,3 = 1. a) Show that there are two ergodic unichain optimal stationary policies, one using decision
1 in states 1 and 3 and decision 2 in state 2. The other uses the opposite decision in each
state.
b) Find the relative gain vector for each of the above stationary policies.
c) Let u be the ﬁnal reward vector. Show that the ﬁrst stationary policy above is the
optimal dynamic policy in all stages if u1 ≥ u2 + 100 and u3 ≥ u2 + 100. Show that a
nonunichain stationary policy is the optimal dynamic policy if u1 = u2 = u3 194 CHAPTER 4. FINITESTATE MARKOV CHAINS ∗
c) Theorem 4.13 implies that, under the conditions of the theorem, limn→1 [vi (n, u ) −
∗ (n, u )] is independent of u . Show that this is not true for the conditions of this exercise.
vj Exercise 4.35. Assume that k 0 is a unique optimal stationary policy and corresponds to
an ergodic unichain (as in...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details