lecture16 - CSE 6740 Lecture 16 How Do I Learn Actions From...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CSE 6740 Lecture 16 How Do I Learn Actions From Data? (Reinforcement Learning) Alexander Gray agray@cc.gatech.edu Georgia Institute of Technology CSE 6740 Lecture 16 p.1/47 Quiz Answers 1. A stationary process is IID. F. 2. A purely random process is IID. T. 3. A Markov model is a special case of an autoregressive process. T. CSE 6740 Lecture 16 p.2/47 Today 1. Reinforcement Learning 2. RL Methods: Known Environment 3. RL Methods: Unknown Environment CSE 6740 Lecture 16 p.3/47 Reinforcement Learning Learning optimal actions from data. CSE 6740 Lecture 16 p.4/47 Reinforcement Learning Reinforcement learning (RL) is about learning optimal actions. The framework of RL is that you are an agent in an environment having discrete states s , and in each state you can take some actions a , and when taking certain actions in certain states you get a scalar reward r . For example: Environment: Youre in state 65. You have four possible actions. You: Ill take action 2. Environment: You received a reward of 7 units. You are now in state 15. You have two possible actions. You: Ill take action 1. Environment: You received a reward of -4 units... CSE 6740 Lecture 16 p.5/47 Optimal Policy In general our model is that the rewards and next-state transitions are non-deterministic, i.e. occur with some probability. Your job is to find a policy , saying which action should be taken in each state, that maximizes some long-run measure of reward. One example is the finite-horizon reward : E parenleftBigg h summationdisplay t =0 r t parenrightBigg (1) which says you should optimize your expected reward for the next h steps, and need not worry about what will happen after that. CSE 6740 Lecture 16 p.6/47 Optimal Policy Since h is arbitrary, an infinite horizon reward is generally preferred, like the average reward : lim h E parenleftBigg 1 h h summationdisplay t =0 r t parenrightBigg . (2) One possible problem with this is that we might prefer a policy which gains a large amount of reward earlier. The geometrically discounted reward does that: E parenleftBigg summationdisplay t =0 t r t parenrightBigg , (3) where < 1 is a mathematical device for bounding the infinite sum. CSE 6740 Lecture 16 p.7/47 Optimality and Convergence Which objective function should we use? The finite-horizon model is appropriate when the agents lifetime is known, or there is a hard deadline. Average-reward and discounted-reward lead to different behaviors, and which one is better is a matter of debate. More is known about the discounted reward case because it is mathematically more convenient. For example, for this model, it can be shown that there exists an optimal deterministic (always choose the same action in a state), stationary (doesnt change over time) policy....
View Full Document

Page1 / 47

lecture16 - CSE 6740 Lecture 16 How Do I Learn Actions From...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online