SP10 cs188 lecture 12 -- reinforcement learning II (2PP)

2/25/2010 1 CS 188: Artificial Intelligence Spring 2010 Lecture 12: Reinforcement Learning II 2/25/2010 Pieter Abbeel – UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements square4 W3 Utilities: due tonight square4 P3 Reinforcement Learning (RL): square4 Out tonight, due Thursday next week square4 You will get to apply RL to: square4 Gridworld agent square4 Crawler square4 Pac-man 2

2/25/2010 2 Reinforcement Learning square4 Still assume a Markov decision process (MDP): square4 A set of states s S square4 A set of actions (per state) A square4 A model T(s,a,s’) square4 A reward function R(s,a,s’) square4 Still looking for a policy π (s) square4 New twist: don’t know T or R square4 I.e. don’t know which states are good or what the actions do square4 Must actually try actions and states out to learn 3 The Story So Far: MDPs and RL square4 If we know the MDP square4 Compute V*, Q*, π * exactly square4 Evaluate a fixed policy π square4 If we don’t know the MDP square4 We can estimate the MDP then solve square4 We can estimate V for a fixed policy π square4 We can estimate Q*(s,a) for the optimal policy while executing an exploration policy 4 square4 Model-based DPs square4 Value and policy Iteration square4 Policy evaluation square4 Model-based RL square4 Model-free RL: square4 Value learning square4 Q-learning Things we know how to do: Techniques:
2/25/2010 3 Problems with TD Value Learning square4 TD value leaning is a model-free way to do policy evaluation square4 However, if we want to turn values into a (new) policy, we’re sunk: square4 Idea: learn Q-values directly square4 Makes action selection model-free too!

