rl[1] - Reinforcement Learning Ron Parr CPS 271 RL...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Reinforcement Learning Ron Parr CPS 271 RL Highlights •  Everybody likes to learn from experience •  Use ML techniques to generalize from rela%vely small amounts of experience •  Some notable successes: –  Backgammon –  Flying a helicopter upside down From Andrew Ng’s home page •  SuLon’s seminal RL paper is 88th most cited ref. in computer science (Citeseerx 10/09); SuLon & Barto RL Book is the 14th most cited 1 Comparison w/Other Kinds of Learning •  Learning o\en viewed as: –  Classifica^on (supervised), or –  Model learning (unsupervised) •  RL is between these (delayed signal) •  What the last thing that happens before an accident? Overview •  Review of value determina^on •  Mo^va^on for RL •  Algorithms for RL –  –  –  –  Overview TD Q ­learning Approxima^on 2 Recall Our Game Show Start $100 1 correct $1,000 $0 2 correct $10,000 $0 $100 2 correct $100,000 $0 $0 $1,100 $11,100 Op^mal Policy w/o Chea^ng V=$3,750 V=$4,166 9/10 V=$5,555 3/4 V=$11.1K 1/2 X 1/10 $111,100 $0 X $100 $0 X $1,100 $0 $0 $11,100 3 Cheat un^l you win policy V=$3,749 V=$4,166 V=$5,555 V=$11.11K V=$32.47K V=$32.58K V=$32.95K w/o cheat V=$34.43K 9/10 3/4 1/2 1/10 $ ­1000 Solving for Values Vπ = γPπ Vπ + Rπ For moderate numbers of states we can solve this system exacty: Vπ = (I − γPπ ) −1 R € Guaranteed inver^ble because has spectral radius <1 € 4 Itera^vely Solving for Values Vπ = γPπ V + R For larger numbers of states we can solve this system indirectly: € i i +1 Vπ = γPπ Vπ + R Guaranteed convergent because has spectral radius <1 for γ<1 € Convergence not guaranteed for γ=1 Overview •  Review of value determina^on •  Mo^va^on for RL •  Algorithms for RL –  –  –  –  Overview TD Q ­learning Approxima^on 5 Why We Need RL •  Where do we get transi^on probabili^es? •  How do we store them? •  Big problems have big models •  Model size is quadra^c in state space size •  Where do we get the reward func^on? RL Framework •  Learn by “trial and error” •  No assump^ons about model •  No assump^ons about reward func^on •  Assumes: –  True state is known at all ^mes –  Immediate reward is known –  Discount is known 6 RL Schema •  Act •  Perceive results •  Update something •  Repeat RL for Our Game Show •  Problem: We don’t know probability of answering correctly •  Solu^on: –  Buy the home version of the game –  Prac^ce on the home game to refine our strategy –  Deploy strategy when we play the real game 7 Model Learning Approach •  Learn model, solve •  How to learn a model: –  Take ac^on a in state s, observe s’ –  Take ac^on a in state s, n ^mes –  Observe s’ m ^mes –  P(s’|s,a) = m/n –  Fill in transi^on matrix for each ac^on –  Compute avg. reward for each state •  Solve learned model as an MDP Limita^ons of Model Learning •  Par^^ons learning, solu^on into two phases •  Model may be large (hard to visit every state lots of ^mes) –  Note: Can’t completely get around this problem… •  Model storage is expensive •  Model manipula^on is expensive 8 Overview •  Review of value determina^on •  Mo^va^on for RL •  Algorithms for RL –  TD –  Q ­learning –  Approxima^on Temporal Differences •  One of the first RL algorithms •  Learn the value of a fixed policy (no op^miza^on; just predic^on) •  Recall itera^ve value determina^on: Vπ i +1 ( s) = R( s, π ( s)) + γ ∑ P( s' | s, π ( s))Vπ i ( s' ) s' Problem: We don’t know this. € 9 Temporal Difference Learning •  Remember Value Determina^on: V i +1 ( s) = R( s, π ( s)) + γ ∑ P( s' | s, π ( s))V i ( s' ) s' •  Compute an update as if the observed s’ and r were the only possible outcomes: temp i V ( s) = r + γV ( s' ) € •  Make a small update in this direc^on: i +1 i temp V ( s) = (1 − α )V ( s) + αV ( s) € € 0 < α ≤ 1 € Example: Home Version of Game s3 $0 $0 $100 $0 $1,100 $111,100 $0 $11,100 Suppose we guess: V(s3)=15K We play and get the ques^on wrong Vtemp=0 V(s3) = (1 ­α)15K + α0 10 Convergence? •  Why doesn’t this oscillate? –  e.g. consider some low probability s’ with a very high (or low) reward value –  This could s^ll cause a big jump in V(s) Convergence Intui^ons •  Need heavy machinery from stochas^c process theory to prove convergence •  Main ideas: –  Itera^ve value determina^on converges –  TD updates approximate value determina^on –  Samples approximate expecta^on V i +1 ( s) = R( s, π ( s)) + γ ∑ P( s' | s, π ( s))V i ( s' ) s' € 11 Ensuring Convergence •  Rewards have bounded variance •  ≤ γ < 1 0 •  Every state visited infinitely o\en •  Learning rate decays so that: –  ∑ α ( s) = ∞ –  ∑ α ( s) < ∞ ∞ i ∞ € i i 2 i € € These condi^ons are jointly sufficient to ensure convergence in the limit with probability 1. How Strong is This? •  •  •  •  •  Bounded variance of rewards: easy Discount: standard Visi^ng every state infinitely o\en: Hmmm… Learning rate: O\en leads to slow learning Convergence in the limit: Weak –  Hard to say anything stronger w/o knowing the mixing rate of the process –  Mixing rate can be low; hard to know a priori •  Convergence w.p. 1: Not a problem. 12 Using TD for Control •  Recall value itera^on: V i +1 ( s) = maxa R( s, a) + γ ∑ P( s' | s, a)V i ( s' ) s' •  Why not pick the maximizing a and then do: € i +1 i temp V ( s) = (1 − α )V ( s' ) + αV ( s' ) –  s’ is the observed next state a\er taking ac^on a € Problems •  Pick the best ac^on w/o model? •  Must visit every state infinitely o\en –  What if a good policy doesn’t do this? •  Learning is done “on policy” –  Taking random ac^ons to make sure that all states are visited will cause problems 13 Q ­Learning Overview •  Want to maintain good proper^es of TD •  Learns good policies and op^mal value func^on, not just the value of a fixed policy •  Simple modifica^on to TD that learns the op^mal policy regardless of how you act! (mostly) Q ­learning •  Recall value itera^on: V i +1 ( s) = maxa R( s, a) + γ ∑ P( s' | s, a)V i ( s' ) s' •  Can split this into two func^ons: € Q i +1 ( s, a) = R( s, a) + γ ∑ P( s' | s, a)V i ( s' ) s' V i +1 t +1 ( s) = maxa Q ( s, a) € 14 Q ­learning •  Store Q values instead of a value func^on •  Makes selec^on of best ac^on easy •  Update rule: temp i Q ( s, a) = r + γ maxa ' Q ( s', a' ) i +1 i temp Q ( s, a) = (1 − α )Q ( s, a) + αQ ( s, a) € € Q ­learning Proper^es •  Converges under same condi^ons as TD •  S^ll must visit every state infinitely o\en •  Separates policy you are currently following from value func^on learning: temp i Q ( s, a) = r + γ maxa ' Q ( s', a' ) i +1 i temp Q ( s, a) = (1 − α )Q ( s, a) + αQ ( s, a) € € 15 Value Func^on Representa^on •  Fundamental problem remains unsolved: –  –  –  –  TD/Q learning solves model ­learning problem, but Large models s^ll have large value func^ons Too expensive to store these func^ons Impossible to visit every state in large models •  Func^on approxima^on –  Use machine learning methods to generalize –  Avoid the need to visit every state Proper^es of approximate RL •  Table ­updates are a special case •  Can be combined with Q ­learning •  Convergence not guaranteed –  Policy evalua^on with linear func^on approxima^on converges if samples are drawn “on policy” –  Ordinary neural nets converge to local opt –  NN + RL convergence not guaranteed •  Chasing a moving target •  Errors can compound •  Success requires very well chosen features 16 How’d They Do That??? •  Backgammon (Tesauro) –  –  –  –  Neural network value func^on approxima^on TD sufficient (known model) Carefully selected inputs to neural network About 1 million games played against self •  Helicopter (Ng et al.) –  Approximate policy itera^on –  Constrained policy space –  Trained on a simulator Swept under the rug… •  Difficulty of finding good features •  Par^al observability •  Explora^on vs. Exploita^on 17 Conclusions •  Reinforcement learning solves an MDP •  Converges for exact value func^on representa^on •  Can be combined with approxima^on methods •  Good results require good features 18 ...
View Full Document

This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.

Ask a homework question - tutors are online