This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Markov Decision Processes CS 221 Section 6 October 30, 2009 Today we will discuss several sample MDP problems. The solutions are included here, so you can work through them on your own if you like. 1. MDPs with Random Stopping Times Suppose we have a Markov Decision Process (MDP) M = ( S , A ,P sa ,γ,R ), where S is a discrete state space with n states, and the rewards are discounted by a factor γ . (Recall that P sa ( s ) is the “transition model” and R ( s ) is the reward function.) We can view this process as a game where we begin in some state s ∈ S and take turns selecting actions and transitioning to new states, accumulating rewards along the way. At the n th turn, we first receive some (discounted) reward, γ n R ( s ), for the current state s . Then, we select an action, a ∈ A and transition, randomly, to a new state s according to the probabilities, P sa ( s ). Since the discount factor is γ < 1, our rewards become smaller and smaller as the game goes on. (Hence, the optimal strategy will try to accumulate big rewards early.) Now consider a slight modification of this game. At the start of each turn we receive an undiscounted reward, R ( s ), and then flip a biased coin that lands heads with probability ² , < ² ≤ 1. If the coin lands heads , then the game is stopped and we are left with whatever reward we have accumulated thus far. Otherwise, we choose our action and we transition to the next state according to P sa , as usual. We will now show that this new game can be expressed as an MDP. In addition, we’ll also show that the value of this game (i.e., the largest reward we expect to gain from playing it) is equivalent to the discounted reward in the original MDP, M . Define a new MDP, ˜ M = ( ˜ S , A , ˜ P sa , 1 , ˜ R ). This MDP has the same action space as M , but the discount factor is 1, and we have a different state space, transition model, and reward function. We’ll construct the MDP ˜ M so that it is just like the MDP M , but with some modifications to include the coinflipping rules defined above. In particular, we’re going to add a new state called the “sink” state, which we’ll denote e . If the coin toss comes up heads , then we’ll transition, always, to this state and remain there forever (accumulating 0 reward each turn). If the coin toss is tails , then we’ll just transition according to P sa as before, with no chance of enter the sink state, e . (a) Complete the construction by specifying explicitly ˜ S , ˜ P sa , and ˜ R for the new MDP, ˜ M . Answer: Let ˜ S = S ∪ { e } , where e is the new sink state. 1 CS221 Section #6 2 Now, let’s assume we’re in a state s ∈ S (i.e., s 6 = e ), then we have: ˜ P sa ( s  heads ) = ( 1 if s = e , if s ∈ S , and ˜ P sa ( s  tails ) = ( if s = e , P sa ( s ) if s ∈ S ....
View
Full
Document
This note was uploaded on 12/15/2009 for the course CS 221 at Stanford.
 '09
 KOLLER,NG
 Artificial Intelligence

Click to edit the document details