This preview shows pages 1–9. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSE 6740 Lecture 16 How Do I Learn Actions From Data? (Reinforcement Learning) Alexander Gray agray@cc.gatech.edu Georgia Institute of Technology CSE 6740 Lecture 16 p.1/47 Quiz Answers 1. A stationary process is IID. F. 2. A purely random process is IID. T. 3. A Markov model is a special case of an autoregressive process. T. CSE 6740 Lecture 16 p.2/47 Today 1. Reinforcement Learning 2. RL Methods: Known Environment 3. RL Methods: Unknown Environment CSE 6740 Lecture 16 p.3/47 Reinforcement Learning Learning optimal actions from data. CSE 6740 Lecture 16 p.4/47 Reinforcement Learning Reinforcement learning (RL) is about learning optimal actions. The framework of RL is that you are an agent in an environment having discrete states s , and in each state you can take some actions a , and when taking certain actions in certain states you get a scalar reward r . For example: Environment: Youre in state 65. You have four possible actions. You: Ill take action 2. Environment: You received a reward of 7 units. You are now in state 15. You have two possible actions. You: Ill take action 1. Environment: You received a reward of 4 units... CSE 6740 Lecture 16 p.5/47 Optimal Policy In general our model is that the rewards and nextstate transitions are nondeterministic, i.e. occur with some probability. Your job is to find a policy , saying which action should be taken in each state, that maximizes some longrun measure of reward. One example is the finitehorizon reward : E parenleftBigg h summationdisplay t =0 r t parenrightBigg (1) which says you should optimize your expected reward for the next h steps, and need not worry about what will happen after that. CSE 6740 Lecture 16 p.6/47 Optimal Policy Since h is arbitrary, an infinite horizon reward is generally preferred, like the average reward : lim h E parenleftBigg 1 h h summationdisplay t =0 r t parenrightBigg . (2) One possible problem with this is that we might prefer a policy which gains a large amount of reward earlier. The geometrically discounted reward does that: E parenleftBigg summationdisplay t =0 t r t parenrightBigg , (3) where < 1 is a mathematical device for bounding the infinite sum. CSE 6740 Lecture 16 p.7/47 Optimality and Convergence Which objective function should we use? The finitehorizon model is appropriate when the agents lifetime is known, or there is a hard deadline. Averagereward and discountedreward lead to different behaviors, and which one is better is a matter of debate. More is known about the discounted reward case because it is mathematically more convenient. For example, for this model, it can be shown that there exists an optimal deterministic (always choose the same action in a state), stationary (doesnt change over time) policy....
View Full
Document
 Fall '08
 Staff

Click to edit the document details