This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Announcements CS 188: Artificial Intelligence W2: due right now Submission of selfcorrected copy for
partial credit due Wednesday 5:29pm Spring 2011 P3 Reinforcement Learning (RL): Lecture 11: Reinforcement Learning II
2/28/2010 Out, due Monday 4:59pm You get to apply RL to: Gridworld agent Crawler Pacman Recall: readings for current material Pieter Abbeel – UC Berkeley
Many slides over the course adapted from either Dan Klein,
Stuart Russell or Andrew Moore Online book: Sutton and Barto
1 http://www.cs.ualberta.ca/~sutton/book/ebook/thebook.html 2 TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.: AAAAAA MDPs and RL Outline Reinforcement Learning Still assume a Markov decision process
(MDP): Markov Decision Processes (MDPs) Formalism
Value iteration
Expectimax Search vs. Value Iteration
Policy Evaluation and Policy Iteration A set of states s ∈ S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) Reinforcement Learning Modelbased Learning Modelfree Learning Still looking for a policy π(s) Direct Evaluation [performs policy evaluation] Temporal Difference Learning [performs policy evaluation] QLearning [learns optimal stateaction value function Q*] New twist: don t know T or R Exploration vs. exploitation
3 Reinforcement Learning I.e. don t know which states are good or what the actions do Must actually try actions and states out to learn 4 Example: learning to walk Reinforcement learning: Still assume an MDP: A set of states s ∈ S
A set of actions (per state) A
A model T(s,a,s )
A reward function R(s,a,s ) Still looking for a policy π(s)
Before learning (handtuned) New twist: don t know T or R I.e. don t know which states are good or what the actions do Must actually try actions and states out to learn
5 One of many learning runs After learning
[After 1000
field
traversals] [Kohl and Stone, ICRA 2004] 1 Example: Learn Model in ModelBased Learning ModelBased Learning y Idea: Learn the model empirically through experience Solve for values as if the learned model were correct Episodes: +100 (1,1) up 1 (1,1) up 1 (1,2) up 1 (1,2) up 1 (1,2) up 1 (1,3) right 1 (1,3) right 1 (2,3) right 1 s (2,3) right 1 (3,3) right 1 π(s) (3,3) right 1 (3,2) up 1 (3,2) up 1 (4,2) exit 100 (3,3) right 1 (done) Simple empirical model learning Count outcomes for each s,a Normalize to give estimate of T(s,a,s ) Discover R(s,a,s ) when we experience (s,a,s ) Solving the MDP with the learned model Value iteration, or policy iteration s, π(s)
s, π(s),s
s (4,3) exit +100 100 x
γ=1
T(<3,3>, right, <4,3>) = 1 / 3
T(<2,3>, right, <3,3>) = 2 / 2 (done)
7 Modelbased vs. Modelfree Direct Evaluation Modelbased RL 8 Repeatedly execute the policy π Estimate the value of the state s as the
average over all times the state s was
visited of the sum of discounted rewards
accumulated from state s onwards First act in MDP and learn T, R
Then value iteration or policy iteration with learned T, R
Advantage: efficient use of data
Disadvantage: requires building a model for T, R Modelfree RL Bypass the need to learn T, R Methods to evaluate a fixed policy without knowing T, R: (i) Direct Evaluation (ii) Temporal Difference Learning Method to learn \pi*, Q*, V* without knowing T, R (iii) QLearning 10 Example: Direct Evaluation 11 Limitations of Direct Evaluation y Episodes: +100 (1,1) up 1 (1,1) up 1 (1,2) up 1 (1,2) up 1 (1,2) up 1 (1,3) right 1 (1,3) right 1 (2,3) right 1 (2,3) right 1 (3,3) right 1 (3,3) right 1 (3,2) up 1 (3,2) up 1 (4,2) exit 100 (3,3) right 1 (done)
(4,3) exit +100
(done) 100 x
γ = 1, R = 1 V(2,3) ~ (96 + 103) / 2 = 3.5
V(3,3) ~ (99 + 97 + 102) / 3 = 31.3
12 Assume random initial state Assume the value of state
(1,2) is known perfectly
based on past runs Now for the first time
encounter (1,1)  can we do
better than estimating V(1,1)
as the rewards outcome of
that run?
13 2 TemporalDifference Learning SampleBased Policy Evaluation? Big idea: learn from every experience! Update V(s) each time we experience (s,a,s ,r) Likely s will contribute updates more often s Who needs T and R? Approximate the
expectation with samples (drawn from T!) s, π(s) π(s) Temporal difference learning s, π(s) Policy still fixed! Move values toward value of whatever
successor occurs: running average! s, π(s),s
s2 s1
s s
π(s) s3 s Sample of V(s):
Almost! (i) Will only be in
state s once and then land
in s’ hence have only one
sample have to keep all
samples around? (ii) Where
do we get value for s’? 14 Exponential Moving Average Update to V(s):
Same update:
15 Policy evaluation when T (and R) unknown  recap Modelbased: Exponential moving average Learn the model empirically through experience Solve for values as if the learned model were correct Makes recent samples more important Modelfree: Direct evaluation: Forgets about the past (distant past values were wrong anyway) Easy to compute from the running average V(s) = sample estimate of sum of rewards accumulated from state s onwards Temporal difference (TD) value learning: Move values toward value of whatever successor occurs: running average! Decreasing learning rate can give converging averages
16 18 Active Learning Problems with TD Value Learning TD value leaning is a modelfree way
to do policy evaluation However, if we want to turn values into
a (new) policy, we re sunk: Full reinforcement learning s a
s, a
s,a,s
s You don t know the transitions T(s,a,s )
You don t know the rewards R(s,a,s )
You can choose any actions you like
Goal: learn the optimal policy
… what value iteration did! In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the
world and find out what happens… Idea: learn Qvalues directly Makes action selection modelfree too!
19 20 3 Detour: QValue Iteration QLearning QLearning: samplebased Qvalue iteration Learn Q*(s,a) values Value iteration: find successive approx optimal values Start with V0(s) = 0, which we know is right (why?) Given Vi, calculate the values for all states for depth i+1: Receive a sample (s,a,s ,r) Consider your old estimate: Consider your new sample estimate: But Qvalues are more useful! Start with Q0(s,a) = 0, which we know is right (why?) Given Qi, calculate the qvalues for all qstates for depth i+1: Incorporate the new estimate into a running average:
21 QLearning Properties Exploration / Exploitation Amazing result: Qlearning converges to optimal policy 23 Several schemes for forcing exploration If you explore enough
If you make the learning rate small enough
… but not decrease it too quickly!
Basically doesn t matter how you select actions (!) Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1ε, act according to current policy Neat property: offpolicy learning learn optimal policy without following it Problems with random actions? You do explore the space, but keep thrashing
around once learning is done One solution: lower ε over time Another solution: exploration functions
27 Exploration Functions 28 QLearning When to explore Qlearning produces tables of qvalues: Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet)
established Exploration function Takes a value estimate and a count, and returns an optimistic
utility, e.g.
(exact form not important)
Qi+1 (s, a) ← (1 − α)Qi (s, a) + α R(s, a, s ) + γ max Qi (s , a )
a now becomes:
Qi+1 (s, a) ← (1 − α)Qi (s, a) + α R(s, a, s ) + γ max f (Qi (s , a ), N (s , a ))
a 30 32 4 QLearning The Story So Far: MDPs and RL
Things we know how to do: Techniques: In realistic situations, we cannot possibly learn
about every single state! We can solve small MDPs exactly, Value and policy
offline
Iteration We can estimate values Vπ(s)
directly for a fixed policy π. Temporal
difference learning We can estimate Q*(s,a) for the
optimal policy while executing an
exploration policy Too many states to visit them all in training Too many states to hold the qtables in memory Qlearning Exploratory action
selection Instead, we want to generalize: Learn about some small number of training states
from experience Generalize that experience to new, similar states This is a fundamental idea in machine learning, and
we ll see it over and over again
33 Example: Pacman 34 FeatureBased Representations Solution: describe a state using
a vector of features Let s say we discover
through experience
that this state is bad: Features are functions from states
to real numbers (often 0/1) that
capture important properties of the
state Example features: In naïve q learning, we
know nothing about
this state or its q
states: Distance to closest ghost
Distance to closest dot
Number of ghosts
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
…… etc. Can also describe a qstate (s, a)
with features (e.g. action moves
closer to food) Or even this one!
35 Linear Feature Functions 36 Function Approximation Using a feature representation, we can write a
q function (or value function) for any state
using a few weights: Qlearning with linear qfunctions: Exact Q s
Approximate Q s Advantage: our experience is summed up in a
few powerful numbers Disadvantage: states may share features but
be very different in value! Intuitive interpretation: Adjust weights of active features E.g. if something unexpectedly bad happens, disprefer all states
with that state s features Formal justification: online least squares
37 38 5 Example: QPacman Linear regression
40 26
24
22 20 20
30
40 20 30
20 10 0
0 10 20 0 10
0 Given examples
given a new point Predict
39 Linear regression 40 Ordinary Least Squares (OLS) 40 26
24
22 20 Error or residual Observation 20 Prediction 30
40 20 30
20 10 0
0 20 0 10
0 0 Prediction Prediction 0 20 41 Minimizing Error 42 Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 5 Value update explained: 10 15 43 0 2 4 6 8 10 12 14 16 18 20 44 6 Policy Search Policy Search Problem: often the featurebased policies that work well
aren t the ones that approximate V / Q best Solution: learn the policy that maximizes rewards rather
than the value that predicts rewards This is the idea behind policy search, such as what
controlled the upsidedown helicopter 45 46 Policy Search MDPs and RL Outline Simplest policy search: Markov Decision Processes (MDPs) Start with an initial linear value function or Qfunction Nudge each feature weight up and down and see if
your policy is better than before Problems: Formalism
Value iteration
Expectimax Search vs. Value Iteration
Policy Evaluation and Policy Iteration Reinforcement Learning How do we tell the policy got better? Need to run many sample episodes! If there are a lot of features, this can be impractical Modelbased Learning Modelfree Learning 47 To Learn More About RL Direct Evaluation [performs policy evaluation]
Temporal Difference Learning [performs policy evaluation]
QLearning [learns optimal stateaction value function Q*]
Policy Search [learns optimal policy from subset of all policies] 48 Take a Deep Breath… We re done with search and planning! Online book: Sutton and Barto
http://www.cs.ualberta.ca/~sutton/book/ebook/thebook.html Next, we ll look at how to reason with
probabilities Graduate level course at Berkeley has
reading material pointers online: http://www.cs.berkeley.edu/~russell/classes/
cs294/s11/ Diagnosis
Tracking objects
Speech recognition
Robot mapping
… lots more! Third part of course: machine learning
49 50 7 Helicopter dynamics State: ___
( Á; µ; Ã ; Á; µ; Ã ; x ; y ; z ; x ; y ; z )
___ [ (roll, pitch, yaw, roll rate, pitch rate, yaw rate, x, y, z, x velocity, y
velocity, z velocity) ] Control inputs: Roll cyclic pitch control (tilts rotor plane)
Pitch cyclic pitch control (tilts rotor plane)
Tail rotor collective pitch (affects tail rotor thrust)
Collective pitch (affects main rotor thrust) Dynamics:
51 st+1 = f (st, at) + wt
[f encodes helicopter dynamics] Helicopter policy class Reward function a1 = w0 + w1 Á + w2 x + w3 er r x
_ a2 = w4 + w5 µ + w6 y + w7 er r y
_ a3 = w8 + w9 Ã a4 = w1 0 + w1 1 z + w1 2 er r z
_ R (s) = ¡ (x ¡ x ¤ )2 ¡ (y ¡ y¤ )2 ¡ (z ¡ z ¤ )2
¡ x 2 ¡ y2 ¡ z 2
_
_
_
¡ (Ã ¡ Ã¤)2 Total of 12 parameters Toddler (Tedrake + al.)* Uses policy gradient from trials on
actual robot Toddler (Tedrake et al.)
the Leverages value function approximation to improve the
gradient estimates Policy parameterization: Dynamics analyis enables separation roll and pitch.
Roll turns out the hardest control problem. ankle roll torque ¿ = w> Á (qroll, qdotroll), Á tiles (qroll, qdotroll) into 5 x 7  i.e., encodes a lookup table On board sensing: 3 axis gyro, 2 axis tilt sensor 8 ...
View
Full
Document
This note was uploaded on 08/26/2011 for the course CS 188 taught by Professor Staff during the Spring '08 term at Berkeley.
 Spring '08
 Staff
 Artificial Intelligence

Click to edit the document details