lecture 12 - CS 188: Artificial Intelligence Spring 2010...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS 188: Artificial Intelligence Spring 2010 Lecture 12: Reinforcement Learning II 2/25/2010 Pieter Abbeel – UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements s W3 Utilities: due tonight s P3 Reinforcement Learning (RL): s Out tonight, due Thursday next week s You will get to apply RL to: s Gridworld agent s Crawler s Pac-man 2
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Reinforcement Learning s Still assume a Markov decision process (MDP): s A set of states s S s A set of actions (per state) A s A model T(s,a,s’) s A reward function R(s,a,s’) s Still looking for a policy π (s) s New twist: don’t know T or R s I.e. don’t know which states are good or what the actions do s Must actually try actions and states out to learn 3 The Story So Far: MDPs and RL s If we know the MDP s Compute V*, Q*, π * exactly s Evaluate a fixed policy π s If we don’t know the MDP s We can estimate the MDP then solve s We can estimate V for a fixed policy π s We can estimate Q*(s,a) for the optimal policy while executing an exploration policy 4 s Model-based DPs s Value and policy Iteration s Policy evaluation s Model-based RL s Model-free RL: s Value learning s Q-learning Things we know how to do: Techniques:
Background image of page 2
3 Problems with TD Value Learning s TD value leaning is a model-free way to do policy evaluation s However, if we want to turn values into a (new) policy, we’re sunk: s Idea: learn Q-values directly s Makes action selection model-free too! a s s, a s,a,s’ s’ 6 Active Learning s Full reinforcement learning s You don’t know the transitions T(s,a,s’) s You don’t know the rewards R(s,a,s’) s You can choose any actions you like s Goal: learn the optimal policy s … what value iteration did! s In this case: s Learner makes choices! s Fundamental tradeoff: exploration vs. exploitation s This is NOT offline planning! You actually take actions in the world and find out what happens… 7
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Detour: Q-Value Iteration s Value iteration: find successive approx optimal values s Start with V
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 04/21/2010 for the course EECS 188 taught by Professor Cs188 during the Spring '01 term at University of California, Berkeley.

Page1 / 16

lecture 12 - CS 188: Artificial Intelligence Spring 2010...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online