{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

SP11 cs188 lecture 11 -- reinforcement learning II 6PP

SP11 cs188 lecture 11 -- reinforcement learning II 6PP -...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS 188: Artificial Intelligence Spring 2011 Lecture 11: Reinforcement Learning II 2/28/2010 Pieter Abbeel – UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore Announcements §੿ W2: due right now §੿ Submission of self-corrected copy for partial credit due Wednesday 5:29pm §੿ P3 Reinforcement Learning (RL): §੿ Out, due Monday 4:59pm §੿ You get to apply RL to: §੿ Gridworld agent §੿ Crawler §੿ Pac-man §੿ Recall: readings for current material §੿ Online book: Sutton and Barto http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html 2 MDPs and RL Outline §੿ Markov Decision Processes (MDPs) §੿ Formalism §੿ Value iteration §੿ Expectimax Search vs. Value Iteration §੿ Policy Evaluation and Policy Iteration §੿ Reinforcement Learning §੿ Model-based Learning §੿ Model-free Learning §੿ Direct Evaluation [performs policy evaluation] §੿ Temporal Difference Learning [performs policy evaluation] §੿ Q-Learning [learns optimal state-action value function Q*] §੿ Exploration vs. exploitation 3 Reinforcement Learning §੿ Still assume a Markov decision process (MDP): §੿ A set of states s S §੿ A set of actions (per state) A §੿ A model T(s,a,s ) §੿ A reward function R(s,a,s ) §੿ Still looking for a policy π (s) §੿ New twist: don t know T or R §੿ I.e. don t know which states are good or what the actions do §੿ Must actually try actions and states out to learn 4 Reinforcement Learning §੿ Reinforcement learning: §੿ Still assume an MDP: §੿ A set of states s S §੿ A set of actions (per state) A §੿ A model T(s,a,s ) §੿ A reward function R(s,a,s ) §੿ Still looking for a policy π (s) §੿ New twist: don t know T or R §੿ I.e. don t know which states are good or what the actions do §੿ Must actually try actions and states out to learn 5 Example: learning to walk Before learning (hand-tuned) One of many learning runs After learning [After 1000 field traversals] [Kohl and Stone, ICRA 2004]
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 Model-Based Learning §੿ Idea: §੿ Learn the model empirically through experience §੿ Solve for values as if the learned model were correct §੿ Simple empirical model learning §੿ Count outcomes for each s,a §੿ Normalize to give estimate of T(s,a,s ) §੿ Discover R(s,a,s ) when we experience (s,a,s ) §੿ Solving the MDP with the learned model §੿ Value iteration, or policy iteration 7 π (s) s s, π (s) s, π (s),s s Example: Learn Model in Model- Based Learning §੿ Episodes: x y T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2 +100 -100 γ = 1 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) 8 Model-based vs. Model-free §੿ Model-based RL §੿ First act in MDP and learn T, R §੿ Then value iteration or policy iteration with learned T, R §੿ Advantage: efficient use of data §੿ Disadvantage: requires building a model for T, R §੿ Model-free RL §੿ Bypass the need to learn T, R §੿
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}