SP11 cs188 lecture 10 -- reinforcement learning 6PP

SP11 cs188 lecture 10 -- reinforcement learning 6PP -...

This preview shows pages 1–3. Sign up to view the full content.

1 CS 188: Artificial Intelligence Spring 2011 Lecture 10: Reinforcement Learning 2/23/2011 Pieter Abbeel – UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore Announcements § W2 due on Monday at 5:29pm – in lecture or in 283 Soda Dropbox § W2 Half Credit Recovery Resubmission due on Wednesday at 5:29pm § P3 released, due Monday March 7 at 4:59pm § Recall: readings for current material § Online book: Sutton and Barto 2 MDPs and RL Outline § Markov Decision Processes (MDPs) § Formalism § Value iteration § Expectimax Search vs. Value Iteration § Policy Evaluation and Policy Iteration § Reinforcement Learning § Model-based Learning § Model-free Learning § Direct Evaluation § Temporal Difference Learning § Q-Learning 3 MDPs recap § Markov decision processes: § States S § Actions A § Transitions P(s ` |s,a) (or T(s,a,s ` )) § Rewards R(s,a,s ` ) (and discount γ ) § Start state s 0 4 MDP Example: Grid World § The agent lives in a grid § Walls block the agent ` s path § The agent ` s actions do not always go as planned: § 80% of the time, the action North takes the agent North (if there is no wall there) § 10% of the time, North takes the agent West; 10% East § If there is a wall in the direction the agent would have been taken, the agent stays put § Rewards come at the end § Goal: maximize sum of rewards Value Iteration: V* 1 8

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 Value Iteration: V* 2 9 Value Iteration V* i+1 11 Value Iteration § Idea: § V i * (s) : the expected discounted sum of rewards accumulated when starting from state s and acting optimally for a horizon of i time steps. § Value iteration: § Start with V 0 * (s) = 0, which we know is right (why?) § Given V i * , calculate the values for all states for horizon i+1: § This is called a value update or Bellman update § Repeat until convergence § Theorem: will converge to unique optimal values § Basic idea: approximations get refined towards optimal values § Policy may converge long before values do 12 Example: Bellman Updates 13 max happens for a=right, other actions not shown Example: γ =0.9, living reward=0, noise=0.2 [[Convergence]]** § Define the max-norm: § Theorem: For any two approximations U and V § I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U
This is the end of the preview. Sign up to access the rest of the document.

SP11 cs188 lecture 10 -- reinforcement learning 6PP -...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online