Unformatted text preview: Reinforcement Learning Ron Parr CPS 271 RL Highlights • Everybody likes to learn from experience • Use ML techniques to generalize from rela%vely small amounts of experience • Some notable successes: – Backgammon – Flying a helicopter upside down From Andrew Ng’s home page • SuLon’s seminal RL paper is 88th most cited ref. in computer science (Citeseerx 10/09); SuLon & Barto RL Book is the 14th most cited 1 Comparison w/Other Kinds of Learning • Learning o\en viewed as: – Classiﬁca^on (supervised), or – Model learning (unsupervised) • RL is between these (delayed signal) • What the last thing that happens before an accident? Overview • Review of value determina^on • Mo^va^on for RL • Algorithms for RL –
–
–
– Overview TD Q
learning Approxima^on 2 Recall Our Game Show Start $100 1 correct $1,000 $0 2 correct $10,000 $0 $100 2 correct $100,000 $0 $0 $1,100 $11,100 Op^mal Policy w/o Chea^ng V=$3,750 V=$4,166 9/10 V=$5,555 3/4 V=$11.1K 1/2 X 1/10 $111,100 $0 X $100 $0 X $1,100 $0 $0 $11,100 3 Cheat un^l you win policy V=$3,749 V=$4,166 V=$5,555 V=$11.11K V=$32.47K V=$32.58K V=$32.95K w/o cheat V=$34.43K 9/10 3/4 1/2 1/10 $
1000 Solving for Values Vπ = γPπ Vπ + Rπ
For moderate numbers of states we can solve this system exacty: Vπ = (I − γPπ ) −1 R
€
Guaranteed inver^ble because has spectral radius <1 €
4 Itera^vely Solving for Values Vπ = γPπ V + R
For larger numbers of states we can solve this system indirectly: € i
i +1
Vπ = γPπ Vπ + R
Guaranteed convergent because has spectral radius <1 for γ<1 € Convergence not guaranteed for γ=1 Overview • Review of value determina^on • Mo^va^on for RL • Algorithms for RL –
–
–
– Overview TD Q
learning Approxima^on 5 Why We Need RL • Where do we get transi^on probabili^es? • How do we store them? • Big problems have big models • Model size is quadra^c in state space size • Where do we get the reward func^on? RL Framework • Learn by “trial and error” • No assump^ons about model • No assump^ons about reward func^on • Assumes: – True state is known at all ^mes – Immediate reward is known – Discount is known 6 RL Schema • Act • Perceive results • Update something • Repeat RL for Our Game Show • Problem: We don’t know probability of answering correctly • Solu^on: – Buy the home version of the game – Prac^ce on the home game to reﬁne our strategy – Deploy strategy when we play the real game 7 Model Learning Approach • Learn model, solve • How to learn a model: – Take ac^on a in state s, observe s’ – Take ac^on a in state s, n ^mes – Observe s’ m ^mes – P(s’s,a) = m/n – Fill in transi^on matrix for each ac^on – Compute avg. reward for each state • Solve learned model as an MDP Limita^ons of Model Learning • Par^^ons learning, solu^on into two phases • Model may be large (hard to visit every state lots of ^mes) – Note: Can’t completely get around this problem… • Model storage is expensive • Model manipula^on is expensive 8 Overview • Review of value determina^on • Mo^va^on for RL • Algorithms for RL – TD – Q
learning – Approxima^on Temporal Diﬀerences • One of the ﬁrst RL algorithms • Learn the value of a ﬁxed policy
(no op^miza^on; just predic^on) • Recall itera^ve value determina^on: Vπ i +1 ( s) = R( s, π ( s)) + γ ∑ P( s'  s, π ( s))Vπ i ( s' )
s'
Problem: We don’t know this. €
9 Temporal Diﬀerence Learning • Remember Value Determina^on: V i +1 ( s) = R( s, π ( s)) + γ ∑ P( s'  s, π ( s))V i ( s' )
s'
• Compute an update as if the observed s’ and r were the only possible outcomes: temp
i
V ( s) = r + γV ( s' )
€ • Make a small update in this direc^on: i +1
i
temp
V ( s) = (1 − α )V ( s) + αV ( s)
€ € 0 < α ≤ 1
€ Example: Home Version of Game s3 $0 $0 $100 $0 $1,100 $111,100 $0 $11,100 Suppose we guess: V(s3)=15K We play and get the ques^on wrong Vtemp=0 V(s3) = (1
α)15K + α0 10 Convergence? • Why doesn’t this oscillate? – e.g. consider some low probability s’ with a very high (or low) reward value – This could s^ll cause a big jump in V(s) Convergence Intui^ons • Need heavy machinery from stochas^c process theory to prove convergence • Main ideas: – Itera^ve value determina^on converges – TD updates approximate value determina^on – Samples approximate expecta^on V i +1 ( s) = R( s, π ( s)) + γ ∑ P( s'  s, π ( s))V i ( s' )
s'
€
11 Ensuring Convergence • Rewards have bounded variance • ≤ γ < 1
0
• Every state visited inﬁnitely o\en • Learning rate decays so that: – ∑ α ( s) = ∞
– ∑ α ( s) < ∞
∞ i
∞ € i i 2
i €
€
These condi^ons are jointly suﬃcient to ensure convergence in the limit with probability 1. How Strong is This? •
•
•
•
• Bounded variance of rewards: easy Discount: standard Visi^ng every state inﬁnitely o\en: Hmmm… Learning rate: O\en leads to slow learning Convergence in the limit: Weak – Hard to say anything stronger w/o knowing the mixing rate of the process – Mixing rate can be low; hard to know a priori • Convergence w.p. 1: Not a problem. 12 Using TD for Control • Recall value itera^on: V i +1 ( s) = maxa R( s, a) + γ ∑ P( s'  s, a)V i ( s' )
s'
• Why not pick the maximizing a and then do: € i +1
i
temp
V ( s) = (1 − α )V ( s' ) + αV ( s' )
– s’ is the observed next state a\er taking ac^on a € Problems • Pick the best ac^on w/o model? • Must visit every state inﬁnitely o\en – What if a good policy doesn’t do this? • Learning is done “on policy” – Taking random ac^ons to make sure that all states are visited will cause problems 13 Q
Learning Overview • Want to maintain good proper^es of TD • Learns good policies and op^mal value func^on, not just the value of a ﬁxed policy • Simple modiﬁca^on to TD that learns the op^mal policy regardless of how you act! (mostly) Q
learning • Recall value itera^on: V i +1 ( s) = maxa R( s, a) + γ ∑ P( s'  s, a)V i ( s' )
s'
• Can split this into two func^ons: € Q i +1 ( s, a) = R( s, a) + γ ∑ P( s'  s, a)V i ( s' )
s' V i +1 t +1 ( s) = maxa Q ( s, a) €
14 Q
learning • Store Q values instead of a value func^on • Makes selec^on of best ac^on easy • Update rule: temp
i
Q ( s, a) = r + γ maxa ' Q ( s', a' )
i +1
i
temp
Q ( s, a) = (1 − α )Q ( s, a) + αQ ( s, a)
€ € Q
learning Proper^es • Converges under same condi^ons as TD • S^ll must visit every state inﬁnitely o\en • Separates policy you are currently following from value func^on learning: temp
i
Q ( s, a) = r + γ maxa ' Q ( s', a' )
i +1
i
temp
Q ( s, a) = (1 − α )Q ( s, a) + αQ ( s, a)
€ €
15 Value Func^on Representa^on • Fundamental problem remains unsolved: –
–
–
– TD/Q learning solves model
learning problem, but Large models s^ll have large value func^ons Too expensive to store these func^ons Impossible to visit every state in large models • Func^on approxima^on – Use machine learning methods to generalize – Avoid the need to visit every state Proper^es of approximate RL • Table
updates are a special case • Can be combined with Q
learning • Convergence not guaranteed – Policy evalua^on with linear func^on approxima^on converges if samples are drawn “on policy” – Ordinary neural nets converge to local opt – NN + RL convergence not guaranteed • Chasing a moving target • Errors can compound • Success requires very well chosen features 16 How’d They Do That??? • Backgammon (Tesauro) –
–
–
– Neural network value func^on approxima^on TD suﬃcient (known model) Carefully selected inputs to neural network About 1 million games played against self • Helicopter (Ng et al.) – Approximate policy itera^on – Constrained policy space – Trained on a simulator Swept under the rug… • Diﬃculty of ﬁnding good features • Par^al observability • Explora^on vs. Exploita^on 17 Conclusions • Reinforcement learning solves an MDP • Converges for exact value func^on representa^on • Can be combined with approxima^on methods • Good results require good features 18 ...
View
Full Document
 Spring '11
 Parr
 Artificial Intelligence, the00, to00, a00, value00

Click to edit the document details