{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

lecture16

# lecture16 - CSE 6740 Lecture 16 How Do I Learn Actions From...

This preview shows pages 1–9. Sign up to view the full content.

CSE 6740 Lecture 16 How Do I Learn Actions From Data? (Reinforcement Learning) Alexander Gray [email protected] Georgia Institute of Technology CSE 6740 Lecture 16 – p.1/47

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Quiz Answers 1. A stationary process is IID. F. 2. A purely random process is IID. T. 3. A Markov model is a special case of an autoregressive process. T. CSE 6740 Lecture 16 – p.2/47
Today 1. Reinforcement Learning 2. RL Methods: Known Environment 3. RL Methods: Unknown Environment CSE 6740 Lecture 16 – p.3/47

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Reinforcement Learning Learning optimal actions from data. CSE 6740 Lecture 16 – p.4/47
Reinforcement Learning Reinforcement learning (RL) is about learning optimal actions. The framework of RL is that you are an agent in an environment having discrete states s , and in each state you can take some actions a , and when taking certain actions in certain states you get a scalar reward r . For example: Environment: You’re in state 65. You have four possible actions. You: I’ll take action 2. Environment: You received a reward of 7 units. You are now in state 15. You have two possible actions. You: I’ll take action 1. Environment: You received a reward of -4 units... CSE 6740 Lecture 16 – p.5/47

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Optimal Policy In general our model is that the rewards and next-state transitions are non-deterministic, i.e. occur with some probability. Your job is to find a policy π , saying which action should be taken in each state, that maximizes some long-run measure of reward. One example is the finite-horizon reward : E parenleftBigg h summationdisplay t =0 r t parenrightBigg (1) which says you should optimize your expected reward for the next h steps, and need not worry about what will happen after that. CSE 6740 Lecture 16 – p.6/47
Optimal Policy Since h is arbitrary, an infinite horizon reward is generally preferred, like the average reward : lim h →∞ E parenleftBigg 1 h h summationdisplay t =0 r t parenrightBigg . (2) One possible problem with this is that we might prefer a policy which gains a large amount of reward earlier. The geometrically discounted reward does that: E parenleftBigg summationdisplay t =0 γ t r t parenrightBigg , (3) where 0 γ < 1 is a mathematical device for bounding the infinite sum. CSE 6740 Lecture 16 – p.7/47

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Optimality and Convergence Which objective function should we use? The finite-horizon model is appropriate when the agent’s lifetime is known, or there is a hard deadline. Average-reward and discounted-reward lead to different behaviors, and which one is better is a matter of debate. More is known about the discounted reward case because it is mathematically more convenient. For example, for this model, it can be shown that there exists an optimal deterministic (always choose the same action in a state), stationary (doesn’t change over time) policy. We can analyze whether a procedure converges to the optimal policy, and we can determine its rate of convergence to the optimal policy.
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}