RL - Reinforcement Learning Variation on Supervised...

This preview shows pages 1–6. Sign up to view the full content.

CS 478 - Reinforcement Learning 1 Reinforcement Learning Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is given either immediately or after some steps Chess Path Discovery RL systems learn a mapping from states to actions by trial-and-error interactions with a dynamic environment TD-Gammon (Neuro-Gammon)

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
CS 478 - Reinforcement Learning 2 RL Basics RL Basics Agent (sensors and actions) Can sense state of Environment (position, etc.) Agent has a set of possible actions Actual rewards for actions from a state are usually delayed and do not give direct information about how best to arrive at the reward RL seeks to learn the optimal policy: which action should the agent take given a particular state to achieve the agents goals (e.g. maximize reward)
CS 478 - Reinforcement Learning 3 Learning a Policy Learning a Policy Find optimal policy π : S -> A a = π(s), where a is an element of A , and s an element of S Which actions in a sequence leading to a goal should be rewarded, punished, etc. – Temporal Credit assignment problem Exploration vs. Exploitation – To what extent should we explore new unknown states (hoping for better opportunities) vs. taking the best possible action based on the knowledge already gained Markovian? – Do we just base action decision on current state or is their some memory of past states – Basic RL assumes Markovian processes (action outcome is only a function of current state, state fully observable) – Does not directly handle partially observable states

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
CS 478 - Reinforcement Learning 4 Rewards Rewards Assume a reward function r ( s , a ) – Common approach is a positive reward for entering a goal state (win the game, get a resource, etc.), negative for entering a bad state (lose the game, lose resource, etc.), 0 for all other transitions. Could also make all reward transitions -1, except for 0 going into the goal state, which would lead to finding a minimal length path to a goal Discount factor γ: between 0 and 1, future rewards are discounted Value Function V ( s ): The value of a state is the sum of the discounted rewards received when starting in that state and following a fixed policy until reaching a terminal state V ( s ) also called the Discounted Cumulative Reward V p ( s t ) = r t + gr t +1 + g 2 r t +2 + ...= g i i =0 ¥ å r t + i
CS 478 - Reinforcement Learning 5 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -14 -20 -22 -14 -18 -22 -20 -22 -20 -14 0 -20 -22 -18 -14 0 0 -1 -2 0 -1 -2 -1 -2 -1 0 0 -1 -2 -1 0 Reward Function 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 .90 .81 1 .90

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 10/24/2011.

Page1 / 16

RL - Reinforcement Learning Variation on Supervised...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online