SP11 cs188 lecture 9 -- MDPs II 6PP

SP11 cs188 lecture 9 -- MDPs II 6PP - Announcements CS 188:...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel – UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements s Midterm: Tuesday March 15, 5-8pm s P2: Due Friday 4:59pm s W3: Minimax, expectimax and MDPs---out tonight, due Monday February 28. s Online book: Sutton and Barto http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html 2 Outline s Markov Decision Processes (MDPs) s Formalism s Value iteration s Expectimax Search vs. Value Iteration s Value Iteration: s No exponential blow-up with depth [cf. graph search vs. tree search] s Can handle infinite duration games s Policy Evaluation and Policy Iteration 3 Reinforcement Learning s Basic idea: s Receive feedback in the form of rewards s Agent’s utility is defined by the reward function s Must learn to act so as to maximize expected rewards Grid World s The agent lives in a grid s Walls block the agent’s path s The agent’s actions do not always go as planned: s 80% of the time, the action North takes the agent North (if there is no wall there) s 10% of the time, North takes the agent West; 10% East s If there is a wall in the direction the agent would have been taken, the agent stays put s Small “living” reward each step s Big rewards come at the end s Goal: maximize sum of rewards Grid Futures 6 Deterministic Grid World Stochastic Grid World X X E N S W X E N S W ? X X X
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Markov Decision Processes s An MDP is defined by: s A set of states s S s A set of actions a A s A transition function T(s,a,s’) s Prob that a from s leads to s’ s i.e., P(s’ | s,a) s Also called the model s A reward function R(s, a, s’) s Sometimes just R(s) or R(s’) s A start state (or distribution) s Maybe a terminal state s MDPs are a family of non- deterministic search problems s Reinforcement learning: MDPs where we don’t know the transition or reward functions 7 What is Markov about MDPs? s Andrey Markov (1856-1922) s “Markov” generally means that given the present state, the future and the past are independent s For Markov decision processes, “Markov” means: Solving MDPs s In deterministic single-agent search problems, want an optimal plan , or sequence of actions, from start to a goal s In an MDP, we want an optimal policy π *: S A s A policy π gives an action for each state s An optimal policy maximizes expected utility if followed s Defines a reflex agent Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s Example Optimal Policies R(s) = -2.0 R(s) = -0.4 R(s) = -0.03 R(s) = -0.01 11 Example: High-Low s Three card types: 2, 3, 4 s Infinite deck, twice as many 2’s s Start with 3 showing s After each card, you say “high”
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 8

SP11 cs188 lecture 9 -- MDPs II 6PP - Announcements CS 188:...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online