rl-f10

rl-f10 - Reinforcement Learning Slides for this part are...

Info iconThis preview shows pages 1–24. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Reinforcement Learning Slides for this part are adapted from those of Dan Klein@UCB Does self learning through simulator. [Infants dont get to simulate the world since they neither have T(.) nor R(.) of their world] Objective(s) of Reinforcement Learning Given your effectors and perceptors Assume full observability of state as well as rewardxx The world (raw in tooth and claw) (sometimes) a simulator [so you get ergodicity and can repeat futures] Learn how to perform well This may involve Dimensions of Variation of RL Algorithms Model-based vs. Model-free Model-based vs. Model-free Model-based Have/learn action models (i.e. transition probabilities) Eg. Approximate DP Passive vs. Active Passive vs. Active Passive: Assume the agent is already following a policy (so there is no action choice to be made; you just need to learn the state values and may be action Dimensions of variation (Contd) Extent of Backup Full DP Adjust value based on values of all the neighbors (as predicted by the transition model) Can only be done when transition model is present Generalization Learn Tabular representations Learn feature- based (factored) representations Online inductive learning methods.. When you were a kid, your policy was mostly dictated by your parents (if it is 6AM, wake up and go to school). You however did learn to detest Mondays and look foraward to Fridays.. Inductive Learning over direct estimation States are represented in terms of features The long term cumulative rewards experienced from the states become their labels Do inductive learning (regression) to find the function that maps features to values This generalizes the experience beyond the specific states we saw We are basically doing EMPIRICAL Policy Evaluation! ut we know this will be wasteful since it misses the correlation between values of neibhoring states!) Do DP-based policy evaluation! Passive Robustness in the face of Model Uncertainty Suppose you ran through a red light a couple of times, and reached home faster Should we learn that running through red lights is a good action? General issue with maximum-likelihood learning If you tossed a coin thrice and it came heads twice, can you say that the probability of heads is 2/3? General solution: Bayesian Learning Active Learning with Monte Carlo Active Model Completeness issue G reedy in the L imit of I nfinite E xploration Must try all state-action combinations infinitely often; but must become greedy in the limit (e.g set it to f(1/t) Idea: Keep track of the number of times a state/action pair has been explored; below a threshold, boost the value of that pair (optimism for exploration) U+ is set to R+ (max optimistic reward) as long as N(s,a) is below a threshold Qn: What if a very unlikely negative (or positive) transition biases the estimate? Temporal Difference wont directly work for Active Learning...
View Full Document

This note was uploaded on 03/11/2012 for the course CSE 571 taught by Professor Baral during the Fall '08 term at ASU.

Page1 / 83

rl-f10 - Reinforcement Learning Slides for this part are...

This preview shows document pages 1 - 24. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online