rl1 - Reinforcement Learning CSci 5512: Artificial...

Info iconThis preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon
Reinforcement Learning CSci 5512: Artifcial Intelligence II
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Outline Reinforcement Learning Passive Reinforcement Learning Active Reinforcement Learning Generalizations Policy Search
Background image of page 2
Reinforcement Learning (RL) Learning what to do to maximize reward Learner is not given training Only feedback is in terms of reward Try things out and see what the reward is Different from Supervised Learning Teacher gives training examples
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Examples Robotics: Quadruped Gait Control, Ball Acquisition (Robocup) Control: Helicopters Operations Research: Pricing, Routing, Scheduling Game Playing: Backgammon, Solitaire, Chess, Checkers Human Computer Interaction: Spoken Dialogue Systems Economics/Finance: Trading
Background image of page 4
MDP Vs RL Markov decision process Set of states S , set of actions A Transition probabilities to next states T ( s , a , a 0 ) Reward functions R ( s ) RL is based on MDPs, but Transition model is not known Reward model is not known MDP computes an optimal policy RL learns an optimal policy
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Types of RL Passive Vs Active Passive: Agent executes a fixed policy and evaluates it Active: Agents updates policy as it learns Model based Vs Model free Model-based: Learn transition and reward model, use it to get optimal policy Model free: Derive optimal policy without learning the model
Background image of page 6
Passive Learning 123 1 2 3 - 1 + 1 4 1 2 3 - 1 + 1 4 0.611 0.812 0.655 0.762 0.918 0.705 0.660 0.868 0.388 Evaluate how good a policy π is Learn the utility U π ( s ) of each state Same as policy evaluation for known transition & reward models
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Passive Learning (Contd.) 123 1 2 3 - 1 + 1 4 Agent executes a sequence of trials: (1 , 1) (1 , 2) (1 , 3) (1 , 2) (1 , 3) (2 , 3) (3 , 3) (4 , 3) +1 (1 , 1) (1 , 2) (1 , 3) (2 , 3) (3 , 3) (3 , 2) (3 , 3) (4 , 3) +1 (1 , 1) (2 , 1) (3 , 1) (3 , 2) (4 , 2) - 1 Goal is to learn the expected utility U π ( s ) U π ( s ) = E " X t =0 γ t R ( s t ) | π, s 0 = s #
Background image of page 8
Direct Utility Estimation Reduction to inductive learning Compute the empirical value of each state Each trial gives a sample value Estimate the utility based on the sample values Example: First trial gives State (1,1): A sample of reward 0.72 State (1,2): Two samples of reward 0.76 and 0.84 State (1,3): Two samples of reward 0.80 and 0.88 Estimate can be a running average of sample values Example: U (1 , 1) = 0 . 72 , U (1 , 2) = 0 . 80 , U (1 , 3) = 0 . 84 , . . .
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Direct Utility Estimation (Contd.) Ignores a very important source of information The utility of states satisfy the Bellman equations U π ( s ) = R ( s ) + γ X s 0 T ( s , π ( s ) , s 0 ) U π ( s 0 ) Search is in a hypothesis space for U much larger than needed Convergence is very slow
Background image of page 10
Adaptive Dynamic Programming (ADP) Make use of Bellman equations to get U π ( s ) U π ( s ) = R ( s ) + γ X s 0 T ( s , π ( s ) , s 0 ) U π ( s 0 ) Need to estimate T ( s , π ( s ) , s 0 ) and R ( s ) from trials Plug-in learnt transition and reward in the Bellman equations Solving for U π : System of n linear equations
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
ADP (Contd.) Estimates of T and R keep changing Make use of modified policy iteration idea
Background image of page 12
Image of page 13
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 30

rl1 - Reinforcement Learning CSci 5512: Artificial...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online