7.1.Reinforcement Learning

# 7.1.Reinforcement Learning - RL Introductory References...

This preview shows pages 1–5. Sign up to view the full content.

1 Machine Learning and Data Mining COMP9417 Reinforcement Learning (RL) Session I 2011 LIC: Mike Bain Guest Lecturer: Bernhard Hengst email: [email protected] RL Introductory References References: Artificial Intelligence: A Modern Approach (Second Edition) Stuart Russell and Peter Norvig (Chapter 21) Machine Learning T. Michell, 1997 (Chapters 1 and 13) Reinforcement Learning: An Introduction R Sutton and A G. Barto 1998 (html version – link from Sutton’s home page) Part 1 Introduction to Reinforcement Learning (RL) Intuition Agent view of RL Simple example introducing concepts Markov Decision Problems (MDP) – Why Markov? – Stochastic problems – Infinite horizon problems – Exploration vs Exploitation – Function approximation – Unknown model and Q-Learning Pole Balancing Example Reinforcement learning is about: - how a machine can learn - the best way to act - given future rewards. Chapter 1 Machine Learning T. Michell, 1997 Task T = checkers Performance P = wins Representation of board state x = (x 1 , x 2 , …x 6 ) Target value function V(board ) = ! i w i x i Update rule V(board) = V(successor(board)) In any board state, play best V(successor(board)) The Agent View of RL agent Environment Sensors effectors One Room Problem Room States 0 1 2 3 4 5 6 7 8 Exit Reward \$100 cost \$1 per time-step Actions {N,S,E,W} Objective or goal: find a set of actions to maximise reward over time

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 Sense – Act Cycle 0 1 2 3 4 5 6 7 8 agent Environment Reward, State action Policy " : S ! A e.g. " (1)=E, " (2)=S, " (5)=S, . .. \$93 \$98 \$99 \$94 \$97 \$100 \$95 \$96 \$95 Value Function is the utility of the current state in terms of future rewards given a policy. 0 1 2 3 4 5 6 7 8 \$100 -\$1 policy " = Sum of rewards to termination (Stochastic Shortest Path) Sum of rewards for next N time steps Discounted sum of rewards Average reward per step Optimality Criteria Optimal Value Function \$97 \$98 \$99 \$98 \$99 \$100 \$97 \$98 \$99 0 1 2 3 4 5 6 7 8 \$100 -\$1 (*=Optimal) An Optimal Policy \$97 \$98 \$99 \$98 \$99 \$100 \$97 \$98 \$99 0 1 2 3 4 5 6 7 8 Solution - Value Iteration
3 \$100 the only reward 0 1 2 3 4 5 6 7 8 Be Careful Defining the Problem! * \$100 * \$100 * \$100 * \$100 * \$100 * \$100 * \$100 * \$100 * \$100 \$0 Discounted Value Function * \$72.9 * \$81 * \$90 * \$81 * \$90 * \$100 * \$72.9 * \$81 * \$90 0 1 2 3 4 5 6 7 8 \$100 \$0 Example History of Reinforcement Learning • Psychology/biology (eg B F Skinner - animal training) • Operations Research Dynamic Programming (planning over time) MENACE (Machine Educable Noughts and Crosses Engine – D.Michie, 1961)

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
4 Other RL Examples • Checker Player [Arthur Samuel, 1959,1967] • Trial and Error [Michie, 1961] • TD-Gammon [Tesauro, 1995]
This is the end of the preview. Sign up to access the rest of the document.

## This note was uploaded on 06/20/2011 for the course COMP 9417 taught by Professor Some during the Three '11 term at University of New South Wales.

### Page1 / 14

7.1.Reinforcement Learning - RL Introductory References...

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online