This preview shows pages 1–24. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Reinforcement Learning Slides for this part are adapted from those of Dan Klein@UCB Does self learning through simulator. [Infants dont get to simulate the world since they neither have T(.) nor R(.) of their world] Objective(s) of Reinforcement Learning Given your effectors and perceptors Assume full observability of state as well as rewardxx The world (raw in tooth and claw) (sometimes) a simulator [so you get ergodicity and can repeat futures] Learn how to perform well This may involve Dimensions of Variation of RL Algorithms Modelbased vs. Modelfree Modelbased vs. Modelfree Modelbased Have/learn action models (i.e. transition probabilities) Eg. Approximate DP Passive vs. Active Passive vs. Active Passive: Assume the agent is already following a policy (so there is no action choice to be made; you just need to learn the state values and may be action Dimensions of variation (Contd) Extent of Backup Full DP Adjust value based on values of all the neighbors (as predicted by the transition model) Can only be done when transition model is present Generalization Learn Tabular representations Learn feature based (factored) representations Online inductive learning methods.. When you were a kid, your policy was mostly dictated by your parents (if it is 6AM, wake up and go to school). You however did learn to detest Mondays and look foraward to Fridays.. Inductive Learning over direct estimation States are represented in terms of features The long term cumulative rewards experienced from the states become their labels Do inductive learning (regression) to find the function that maps features to values This generalizes the experience beyond the specific states we saw We are basically doing EMPIRICAL Policy Evaluation! ut we know this will be wasteful since it misses the correlation between values of neibhoring states!) Do DPbased policy evaluation! Passive Robustness in the face of Model Uncertainty Suppose you ran through a red light a couple of times, and reached home faster Should we learn that running through red lights is a good action? General issue with maximumlikelihood learning If you tossed a coin thrice and it came heads twice, can you say that the probability of heads is 2/3? General solution: Bayesian Learning Active Learning with Monte Carlo Active Model Completeness issue G reedy in the L imit of I nfinite E xploration Must try all stateaction combinations infinitely often; but must become greedy in the limit (e.g set it to f(1/t) Idea: Keep track of the number of times a state/action pair has been explored; below a threshold, boost the value of that pair (optimism for exploration) U+ is set to R+ (max optimistic reward) as long as N(s,a) is below a threshold Qn: What if a very unlikely negative (or positive) transition biases the estimate? Temporal Difference wont directly work for Active Learning...
View
Full
Document
This note was uploaded on 03/11/2012 for the course CSE 571 taught by Professor Baral during the Fall '08 term at ASU.
 Fall '08
 Baral

Click to edit the document details