SP10 cs188 lecture 10 -- MDPs II (2PP)

SP10 cs188 lecture 10 -- MDPs II (2PP)

1 CS 188: Artificial Intelligence Spring 2010 Lecture 10: MDPs 2/18/2010 Pieter Abbeel – UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements square4 P2: Due tonight square4 W3: Expectimax, utilities and MDPs---out tonight, due next Thursday. square4 Online book: Sutton and Barto http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html 2

2 Recap: MDPs square4 Markov decision processes: square4 States S square4 Actions A square4 Transitions P(s’|s,a) (or T(s,a,s’)) square4 Rewards R(s,a,s’) (and discount γ ) square4 Start state s 0 square4 Quantities: square4 Policy = map of states to actions square4 Utility = sum of discounted rewards square4 Values = expected future utility from a state square4 Q-Values = expected future utility from a q-state a s s, a s,a,s’ s’ 4 Recap MPD Example: Grid World square4 The agent lives in a grid square4 Walls block the agent’s path square4 The agent’s actions do not always go as planned: square4 80% of the time, the action North takes the agent North (if there is no wall there) square4 10% of the time, North takes the agent West; 10% East square4 If there is a wall in the direction the agent would have been taken, the agent stays put square4 Small “living” reward each step square4 Big rewards come at the end square4 Goal: maximize sum of rewards
3 Why Not Search Trees? square4 Why not solve with expectimax?

### Page1 / 10

