SP11 cs188 lecture 11 -- reinforcement learning II 6PP

SP11 cs188 lecture 11 -- reinforcement learning II 6PP -...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Announcements CS 188: Artificial Intelligence   W2: due right now   Submission of self-corrected copy for partial credit due Wednesday 5:29pm Spring 2011   P3 Reinforcement Learning (RL): Lecture 11: Reinforcement Learning II 2/28/2010   Out, due Monday 4:59pm   You get to apply RL to:   Gridworld agent   Crawler   Pac-man   Recall: readings for current material Pieter Abbeel – UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore   Online book: Sutton and Barto 1 http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html 2 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA MDPs and RL Outline Reinforcement Learning   Still assume a Markov decision process (MDP):   Markov Decision Processes (MDPs)         Formalism Value iteration Expectimax Search vs. Value Iteration Policy Evaluation and Policy Iteration   A set of states s ∈ S   A set of actions (per state) A   A model T(s,a,s )   A reward function R(s,a,s )   Reinforcement Learning   Model-based Learning   Model-free Learning   Still looking for a policy π(s)   Direct Evaluation [performs policy evaluation]   Temporal Difference Learning [performs policy evaluation]   Q-Learning [learns optimal state-action value function Q*]   New twist: don t know T or R   Exploration vs. exploitation 3 Reinforcement Learning   I.e. don t know which states are good or what the actions do   Must actually try actions and states out to learn 4 Example: learning to walk   Reinforcement learning:   Still assume an MDP:         A set of states s ∈ S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s )   Still looking for a policy π(s) Before learning (hand-tuned)   New twist: don t know T or R   I.e. don t know which states are good or what the actions do   Must actually try actions and states out to learn 5 One of many learning runs After learning [After 1000 field traversals] [Kohl and Stone, ICRA 2004] 1 Example: Learn Model in ModelBased Learning Model-Based Learning y   Idea:   Learn the model empirically through experience   Solve for values as if the learned model were correct   Episodes: +100 (1,1) up -1 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (1,3) right -1 (2,3) right -1 s (2,3) right -1 (3,3) right -1 π(s) (3,3) right -1 (3,2) up -1 (3,2) up -1 (4,2) exit -100 (3,3) right -1 (done)   Simple empirical model learning   Count outcomes for each s,a   Normalize to give estimate of T(s,a,s )   Discover R(s,a,s ) when we experience (s,a,s )   Solving the MDP with the learned model   Value iteration, or policy iteration s, π(s) s, π(s),s s (4,3) exit +100 -100 x γ=1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2 (done) 7 Model-based vs. Model-free Direct Evaluation   Model-based RL         8   Repeatedly execute the policy π   Estimate the value of the state s as the average over all times the state s was visited of the sum of discounted rewards accumulated from state s onwards First act in MDP and learn T, R Then value iteration or policy iteration with learned T, R Advantage: efficient use of data Disadvantage: requires building a model for T, R   Model-free RL   Bypass the need to learn T, R   Methods to evaluate a fixed policy without knowing T, R:   (i) Direct Evaluation   (ii) Temporal Difference Learning   Method to learn \pi*, Q*, V* without knowing T, R   (iii) Q-Learning 10 Example: Direct Evaluation 11 Limitations of Direct Evaluation y   Episodes: +100 (1,1) up -1 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (1,3) right -1 (2,3) right -1 (2,3) right -1 (3,3) right -1 (3,3) right -1 (3,2) up -1 (3,2) up -1 (4,2) exit -100 (3,3) right -1 (done) (4,3) exit +100 (done) -100 x γ = 1, R = -1 V(2,3) ~ (96 + -103) / 2 = -3.5 V(3,3) ~ (99 + 97 + -102) / 3 = 31.3 12   Assume random initial state   Assume the value of state (1,2) is known perfectly based on past runs   Now for the first time encounter (1,1) --- can we do better than estimating V(1,1) as the rewards outcome of that run? 13 2 Temporal-Difference Learning Sample-Based Policy Evaluation?   Big idea: learn from every experience!   Update V(s) each time we experience (s,a,s ,r)   Likely s will contribute updates more often s   Who needs T and R? Approximate the expectation with samples (drawn from T!) s, π(s) π(s)   Temporal difference learning s, π(s)   Policy still fixed!   Move values toward value of whatever successor occurs: running average! s, π(s),s s2 s1 s s π(s) s3 s Sample of V(s): Almost! (i) Will only be in state s once and then land in s’ hence have only one sample  have to keep all samples around? (ii) Where do we get value for s’? 14 Exponential Moving Average Update to V(s): Same update: 15 Policy evaluation when T (and R) unknown --- recap   Model-based:   Exponential moving average   Learn the model empirically through experience   Solve for values as if the learned model were correct   Makes recent samples more important   Model-free:   Direct evaluation:   Forgets about the past (distant past values were wrong anyway)   Easy to compute from the running average   V(s) = sample estimate of sum of rewards accumulated from state s onwards   Temporal difference (TD) value learning:   Move values toward value of whatever successor occurs: running average!   Decreasing learning rate can give converging averages 16 18 Active Learning Problems with TD Value Learning   TD value leaning is a model-free way to do policy evaluation   However, if we want to turn values into a (new) policy, we re sunk:   Full reinforcement learning s           a s, a s,a,s s You don t know the transitions T(s,a,s ) You don t know the rewards R(s,a,s ) You can choose any actions you like Goal: learn the optimal policy … what value iteration did!   In this case:   Learner makes choices!   Fundamental tradeoff: exploration vs. exploitation   This is NOT offline planning! You actually take actions in the world and find out what happens…   Idea: learn Q-values directly   Makes action selection model-free too! 19 20 3 Detour: Q-Value Iteration Q-Learning   Q-Learning: sample-based Q-value iteration   Learn Q*(s,a) values   Value iteration: find successive approx optimal values   Start with V0(s) = 0, which we know is right (why?)   Given Vi, calculate the values for all states for depth i+1:   Receive a sample (s,a,s ,r)   Consider your old estimate:   Consider your new sample estimate:   But Q-values are more useful!   Start with Q0(s,a) = 0, which we know is right (why?)   Given Qi, calculate the q-values for all q-states for depth i+1:   Incorporate the new estimate into a running average: 21 Q-Learning Properties Exploration / Exploitation   Amazing result: Q-learning converges to optimal policy         23   Several schemes for forcing exploration If you explore enough If you make the learning rate small enough … but not decrease it too quickly! Basically doesn t matter how you select actions (!)   Simplest: random actions (ε greedy)   Every time step, flip a coin   With probability ε, act randomly   With probability 1-ε, act according to current policy   Neat property: off-policy learning   learn optimal policy without following it   Problems with random actions?   You do explore the space, but keep thrashing around once learning is done   One solution: lower ε over time   Another solution: exploration functions 27 Exploration Functions 28 Q-Learning   When to explore   Q-learning produces tables of q-values:   Random actions: explore a fixed amount   Better idea: explore areas whose badness is not (yet) established   Exploration function   Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important) ￿ ￿ Qi+1 (s, a) ← (1 − α)Qi (s, a) + α R(s, a, s￿ ) + γ max Qi (s￿ , a￿ ) ￿ a now becomes: ￿ ￿ Qi+1 (s, a) ← (1 − α)Qi (s, a) + α R(s, a, s￿ ) + γ max f (Qi (s￿ , a￿ ), N (s￿ , a￿ )) ￿ a 30 32 4 Q-Learning The Story So Far: MDPs and RL Things we know how to do: Techniques:   In realistic situations, we cannot possibly learn about every single state!   We can solve small MDPs exactly,   Value and policy offline Iteration   We can estimate values Vπ(s) directly for a fixed policy π.   Temporal difference learning   We can estimate Q*(s,a) for the optimal policy while executing an exploration policy   Too many states to visit them all in training   Too many states to hold the q-tables in memory   Q-learning   Exploratory action selection   Instead, we want to generalize:   Learn about some small number of training states from experience   Generalize that experience to new, similar states   This is a fundamental idea in machine learning, and we ll see it over and over again 33 Example: Pacman 34 Feature-Based Representations   Solution: describe a state using a vector of features   Let s say we discover through experience that this state is bad:   Features are functions from states to real numbers (often 0/1) that capture important properties of the state   Example features:   In naïve q learning, we know nothing about this state or its q states:             Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot)2 Is Pacman in a tunnel? (0/1) …… etc.   Can also describe a q-state (s, a) with features (e.g. action moves closer to food)   Or even this one! 35 Linear Feature Functions 36 Function Approximation   Using a feature representation, we can write a q function (or value function) for any state using a few weights:   Q-learning with linear q-functions: Exact Q s Approximate Q s   Advantage: our experience is summed up in a few powerful numbers   Disadvantage: states may share features but be very different in value!   Intuitive interpretation:   Adjust weights of active features   E.g. if something unexpectedly bad happens, disprefer all states with that state s features   Formal justification: online least squares 37 38 5 Example: Q-Pacman Linear regression 40 26 24 22 20 20 30 40 20 30 20 10 0 0 10 20 0 10 0 Given examples given a new point Predict 39 Linear regression 40 Ordinary Least Squares (OLS) 40 26 24 22 20 Error or residual Observation 20 Prediction 30 40 20 30 20 10 0 0 20 0 10 0 0 Prediction Prediction 0 20 41 Minimizing Error 42 Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 -5 Value update explained: -10 -15 43 0 2 4 6 8 10 12 14 16 18 20 44 6 Policy Search Policy Search   Problem: often the feature-based policies that work well aren t the ones that approximate V / Q best   Solution: learn the policy that maximizes rewards rather than the value that predicts rewards   This is the idea behind policy search, such as what controlled the upside-down helicopter 45 46 Policy Search MDPs and RL Outline   Simplest policy search:   Markov Decision Processes (MDPs)   Start with an initial linear value function or Q-function   Nudge each feature weight up and down and see if your policy is better than before           Problems: Formalism Value iteration Expectimax Search vs. Value Iteration Policy Evaluation and Policy Iteration   Reinforcement Learning   How do we tell the policy got better?   Need to run many sample episodes!   If there are a lot of features, this can be impractical   Model-based Learning   Model-free Learning         47 To Learn More About RL Direct Evaluation [performs policy evaluation] Temporal Difference Learning [performs policy evaluation] Q-Learning [learns optimal state-action value function Q*] Policy Search [learns optimal policy from subset of all policies] 48 Take a Deep Breath…   We re done with search and planning!   Online book: Sutton and Barto http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html   Next, we ll look at how to reason with probabilities   Graduate level course at Berkeley has reading material pointers online:           http://www.cs.berkeley.edu/~russell/classes/ cs294/s11/ Diagnosis Tracking objects Speech recognition Robot mapping … lots more!   Third part of course: machine learning 49 50 7 Helicopter dynamics   State: ___ ( Á; µ; à ; Á; µ; à ; x ; y ; z ; x ; y ; z ) ___ [ (roll, pitch, yaw, roll rate, pitch rate, yaw rate, x, y, z, x velocity, y velocity, z velocity) ]   Control inputs:         Roll cyclic pitch control (tilts rotor plane) Pitch cyclic pitch control (tilts rotor plane) Tail rotor collective pitch (affects tail rotor thrust) Collective pitch (affects main rotor thrust)   Dynamics: 51   st+1 = f (st, at) + wt [f encodes helicopter dynamics] Helicopter policy class Reward function a1 = w0 + w1 Á + w2 x + w3 er r x _ a2 = w4 + w5 µ + w6 y + w7 er r y _ a3 = w8 + w9 à a4 = w1 0 + w1 1 z + w1 2 er r z _ R (s) = ¡ (x ¡ x ¤ )2 ¡ (y ¡ y¤ )2 ¡ (z ¡ z ¤ )2 ¡ x 2 ¡ y2 ¡ z 2 _ _ _ ¡ (à ¡ ä)2  Total of 12 parameters Toddler (Tedrake + al.)*   Uses policy gradient from trials on actual robot Toddler (Tedrake et al.) the   Leverages value function approximation to improve the gradient estimates   Policy parameterization:   Dynamics analyis enables separation roll and pitch. Roll turns out the hardest control problem.   ankle roll torque ¿ = w> Á (qroll, qdotroll),   Á tiles (qroll, qdotroll) into 5 x 7 --- i.e., encodes a lookup table   On board sensing: 3 axis gyro, 2 axis tilt sensor 8 ...
View Full Document

This note was uploaded on 08/26/2011 for the course CS 188 taught by Professor Staff during the Spring '08 term at Berkeley.

Ask a homework question - tutors are online