7.1.Reinforcement Learning

7.1.Reinforcement Learning - RL Introductory References...

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
1 Machine Learning and Data Mining COMP9417 Reinforcement Learning (RL) Session I 2011 LIC: Mike Bain Guest Lecturer: Bernhard Hengst email: [email protected] RL Introductory References References: Artificial Intelligence: A Modern Approach (Second Edition) Stuart Russell and Peter Norvig (Chapter 21) Machine Learning T. Michell, 1997 (Chapters 1 and 13) Reinforcement Learning: An Introduction R Sutton and A G. Barto 1998 (html version – link from Sutton’s home page) Part 1 Introduction to Reinforcement Learning (RL) Intuition Agent view of RL Simple example introducing concepts Markov Decision Problems (MDP) – Why Markov? – Stochastic problems – Infinite horizon problems – Exploration vs Exploitation – Function approximation – Unknown model and Q-Learning Pole Balancing Example Reinforcement learning is about: - how a machine can learn - the best way to act - given future rewards. Chapter 1 Machine Learning T. Michell, 1997 Task T = checkers Performance P = wins Representation of board state x = (x 1 , x 2 , …x 6 ) Target value function V(board ) = ! i w i x i Update rule V(board) = V(successor(board)) In any board state, play best V(successor(board)) The Agent View of RL agent Environment Sensors effectors One Room Problem Room States 0 1 2 3 4 5 6 7 8 Exit Reward $100 cost $1 per time-step Actions {N,S,E,W} Objective or goal: find a set of actions to maximise reward over time
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Sense – Act Cycle 0 1 2 3 4 5 6 7 8 agent Environment Reward, State action Policy " : S ! A e.g. " (1)=E, " (2)=S, " (5)=S, . .. $93 $98 $99 $94 $97 $100 $95 $96 $95 Value Function is the utility of the current state in terms of future rewards given a policy. 0 1 2 3 4 5 6 7 8 $100 -$1 policy " = Sum of rewards to termination (Stochastic Shortest Path) Sum of rewards for next N time steps Discounted sum of rewards Average reward per step Optimality Criteria Optimal Value Function $97 $98 $99 $98 $99 $100 $97 $98 $99 0 1 2 3 4 5 6 7 8 $100 -$1 (*=Optimal) An Optimal Policy $97 $98 $99 $98 $99 $100 $97 $98 $99 0 1 2 3 4 5 6 7 8 Solution - Value Iteration
Background image of page 2
3 $100 the only reward 0 1 2 3 4 5 6 7 8 Be Careful Defining the Problem! * $100 * $100 * $100 * $100 * $100 * $100 * $100 * $100 * $100 $0 Discounted Value Function * $72.9 * $81 * $90 * $81 * $90 * $100 * $72.9 * $81 * $90 0 1 2 3 4 5 6 7 8 $100 $0 Example History of Reinforcement Learning • Psychology/biology (eg B F Skinner - animal training) • Operations Research Dynamic Programming (planning over time) MENACE (Machine Educable Noughts and Crosses Engine – D.Michie, 1961)
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Other RL Examples • Checker Player [Arthur Samuel, 1959,1967] • Trial and Error [Michie, 1961] • TD-Gammon [Tesauro, 1995]
Background image of page 4
Image of page 5
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 06/20/2011 for the course COMP 9417 taught by Professor Some during the Three '11 term at University of New South Wales.

Page1 / 14

7.1.Reinforcement Learning - RL Introductory References...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online