rl1 - Reinforcement Learning CSci 5512 Artificial...

Info icon This preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon
Reinforcement Learning CSci 5512: Artificial Intelligence II
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Outline Reinforcement Learning Passive Reinforcement Learning Active Reinforcement Learning Generalizations Policy Search
Image of page 2
Reinforcement Learning (RL) Learning what to do to maximize reward Learner is not given training Only feedback is in terms of reward Try things out and see what the reward is Different from Supervised Learning Teacher gives training examples
Image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Examples Robotics: Quadruped Gait Control, Ball Acquisition (Robocup) Control: Helicopters Operations Research: Pricing, Routing, Scheduling Game Playing: Backgammon, Solitaire, Chess, Checkers Human Computer Interaction: Spoken Dialogue Systems Economics/Finance: Trading
Image of page 4
MDP Vs RL Markov decision process Set of states S , set of actions A Transition probabilities to next states T ( s , a , a 0 ) Reward functions R ( s ) RL is based on MDPs, but Transition model is not known Reward model is not known MDP computes an optimal policy RL learns an optimal policy
Image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Types of RL Passive Vs Active Passive: Agent executes a fixed policy and evaluates it Active: Agents updates policy as it learns Model based Vs Model free Model-based: Learn transition and reward model, use it to get optimal policy Model free: Derive optimal policy without learning the model
Image of page 6
Passive Learning 1 2 3 1 2 3 - 1 + 1 4 1 2 3 1 2 3 - 1 + 1 4 0.611 0.812 0.655 0.762 0.918 0.705 0.660 0.868 0.388 Evaluate how good a policy π is Learn the utility U π ( s ) of each state Same as policy evaluation for known transition & reward models
Image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Passive Learning (Contd.) 1 2 3 1 2 3 - 1 + 1 4 Agent executes a sequence of trials: (1 , 1) (1 , 2) (1 , 3) (1 , 2) (1 , 3) (2 , 3) (3 , 3) (4 , 3) +1 (1 , 1) (1 , 2) (1 , 3) (2 , 3) (3 , 3) (3 , 2) (3 , 3) (4 , 3) +1 (1 , 1) (2 , 1) (3 , 1) (3 , 2) (4 , 2) - 1 Goal is to learn the expected utility U π ( s ) U π ( s ) = E " X t =0 γ t R ( s t ) | π, s 0 = s #
Image of page 8
Direct Utility Estimation Reduction to inductive learning Compute the empirical value of each state Each trial gives a sample value Estimate the utility based on the sample values Example: First trial gives State (1,1): A sample of reward 0.72 State (1,2): Two samples of reward 0.76 and 0.84 State (1,3): Two samples of reward 0.80 and 0.88 Estimate can be a running average of sample values Example: U (1 , 1) = 0 . 72 , U (1 , 2) = 0 . 80 , U (1 , 3) = 0 . 84 , . . .
Image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Direct Utility Estimation (Contd.) Ignores a very important source of information The utility of states satisfy the Bellman equations U π ( s ) = R ( s ) + γ X s 0 T ( s , π ( s ) , s 0 ) U π ( s 0 ) Search is in a hypothesis space for U much larger than needed Convergence is very slow
Image of page 10
Adaptive Dynamic Programming (ADP) Make use of Bellman equations to get U π ( s ) U π ( s ) = R ( s ) + γ X s 0 T ( s , π ( s ) , s 0 ) U π ( s 0 ) Need to estimate T ( s , π ( s ) , s 0 ) and R ( s ) from trials Plug-in learnt transition and reward in the Bellman equations Solving for U π : System of n linear equations
Image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
ADP (Contd.) Estimates of T and R keep changing
Image of page 12
Image of page 13
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern