cs221-section6

cs221-section6 - Markov Decision Processes CS 221 Section 6...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Markov Decision Processes CS 221 Section 6 October 30, 2009 Today we will discuss several sample MDP problems. The solutions are included here, so you can work through them on your own if you like. 1. MDPs with Random Stopping Times Suppose we have a Markov Decision Process (MDP) M = ( S , A ,P sa ,γ,R ), where S is a discrete state space with n states, and the rewards are discounted by a factor γ . (Recall that P sa ( s ) is the “transition model” and R ( s ) is the reward function.) We can view this process as a game where we begin in some state s ∈ S and take turns selecting actions and transitioning to new states, accumulating rewards along the way. At the n th turn, we first receive some (discounted) reward, γ n R ( s ), for the current state s . Then, we select an action, a ∈ A and transition, randomly, to a new state s according to the probabilities, P sa ( s ). Since the discount factor is γ < 1, our rewards become smaller and smaller as the game goes on. (Hence, the optimal strategy will try to accumulate big rewards early.) Now consider a slight modification of this game. At the start of each turn we receive an undiscounted reward, R ( s ), and then flip a biased coin that lands heads with probability ² , < ² ≤ 1. If the coin lands heads , then the game is stopped and we are left with whatever reward we have accumulated thus far. Otherwise, we choose our action and we transition to the next state according to P sa , as usual. We will now show that this new game can be expressed as an MDP. In addition, we’ll also show that the value of this game (i.e., the largest reward we expect to gain from playing it) is equivalent to the discounted reward in the original MDP, M . Define a new MDP, ˜ M = ( ˜ S , A , ˜ P sa , 1 , ˜ R ). This MDP has the same action space as M , but the discount factor is 1, and we have a different state space, transition model, and reward function. We’ll construct the MDP ˜ M so that it is just like the MDP M , but with some modifications to include the coin-flipping rules defined above. In particular, we’re going to add a new state called the “sink” state, which we’ll denote e . If the coin toss comes up heads , then we’ll transition, always, to this state and remain there forever (accumulating 0 reward each turn). If the coin toss is tails , then we’ll just transition according to P sa as before, with no chance of enter the sink state, e . (a) Complete the construction by specifying explicitly ˜ S , ˜ P sa , and ˜ R for the new MDP, ˜ M . Answer: Let ˜ S = S ∪ { e } , where e is the new sink state. 1 CS221 Section #6 2 Now, let’s assume we’re in a state s ∈ S (i.e., s 6 = e ), then we have: ˜ P sa ( s | heads ) = ( 1 if s = e , if s ∈ S , and ˜ P sa ( s | tails ) = ( if s = e , P sa ( s ) if s ∈ S ....
View Full Document

This note was uploaded on 12/15/2009 for the course CS 221 at Stanford.

Page1 / 6

cs221-section6 - Markov Decision Processes CS 221 Section 6...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online