This preview shows pages 1–5. Sign up to view the full content.
1
CS 188: Artificial Intelligence
Spring 2010
Lecture 12: Reinforcement Learning II
2/25/2010
Pieter Abbeel – UC Berkeley
Many slides over the course adapted from either Dan Klein,
Stuart Russell or Andrew Moore
1
Announcements
s
W3 Utilities: due tonight
s
P3 Reinforcement Learning (RL):
s
Out tonight, due Thursday next week
s
You will get to apply RL to:
s
Gridworld agent
s
Crawler
s
Pacman
2
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document 2
Reinforcement Learning
s
Still assume a Markov decision process
(MDP):
s
A
set of states s
∈
S
s
A
set of actions (per state) A
s
A
model T(s,a,s’)
s
A
reward function R(s,a,s’)
s
Still looking for a policy
π
(s)
s
New twist:
don’t know T or R
s
I.e. don’t know which states are good or what the actions do
s
Must actually try actions and states out to learn
3
The Story So Far: MDPs and RL
s
If we know the MDP
s
Compute V*, Q*,
π
* exactly
s
Evaluate a fixed policy
π
s
If we don’t know the MDP
s
We can estimate the MDP then solve
s
We can estimate V for a fixed policy
π
s
We can estimate Q*(s,a) for the
optimal policy while executing an
exploration policy
4
s
Modelbased DPs
s
Value and policy
Iteration
s
Policy evaluation
s
Modelbased RL
s
Modelfree RL:
s
Value learning
s
Qlearning
Things we know how to do:
Techniques:
3
Problems with TD Value Learning
s
TD value leaning is a modelfree way
to do policy evaluation
s
However, if we want to turn values into
a (new) policy, we’re sunk:
s
Idea: learn Qvalues directly
s
Makes action selection modelfree too!
a
s
s, a
s,a,s’
s’
6
Active Learning
s
Full reinforcement learning
s
You don’t know the transitions T(s,a,s’)
s
You don’t know the rewards R(s,a,s’)
s
You can choose any actions you like
s
Goal: learn the optimal policy
s
… what value iteration did!
s
In this case:
s
Learner makes choices!
s
Fundamental tradeoff: exploration vs. exploitation
s
This is NOT offline planning!
You actually take actions in the
world and find out what happens…
7
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document 4
Detour: QValue Iteration
s
Value iteration: find successive approx optimal values
s
Start with V
This is the end of the preview. Sign up
to
access the rest of the document.
This note was uploaded on 04/21/2010 for the course EECS 188 taught by Professor Cs188 during the Spring '01 term at University of California, Berkeley.
 Spring '01
 cs188
 Artificial Intelligence

Click to edit the document details