This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: CS221 Exercise Set #5 1 CS 221 Exercise Set #5 1. MDPs with Random Stopping Times Suppose we have a Markov Decision Process (MDP) M = ( S , A ,P sa ,,R ), where S is a discrete state space with n states, and the rewards are discounted by a factor . (Recall that P sa ( s ) is the transition model and R ( s ) is the reward function.) We can view this process as a game where we begin in some state s S and take turns selecting actions and transitioning to new states, accumulating rewards along the way. At the n th turn, we first receive some (discounted) reward, n R ( s ), for the current state s . Then, we select an action, a A and transition, randomly, to a new state s according to the probabilities, P sa ( s ). Since the discount factor is < 1, our rewards become smaller and smaller as the game goes on. (Hence, the optimal strategy will try to accumulate big rewards early.) Now consider a slight modification of this game. At the start of each turn we receive an undiscounted reward, R ( s ), and then flip a biased coin that lands heads with probability , < 1. If the coin lands heads , then the game is stopped and we are left with whatever reward we have accumulated thus far. Otherwise, we choose our action and we transition to the next state according to P sa , as usual. We will now show that this new game can be expressed as an MDP. In addition, well also show that the value of this game (i.e., the largest reward we expect to gain from playing it) is equivalent to the discounted reward in the original MDP, M ....
View Full Document