# rl-4up - 1 MachineLearning CS6375-Fall 2010 a Reinforcement...

This preview shows pages 1–4. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 MachineLearning CS6375---Fall 2010 a Reinforcement Learning Reading: Chapter 21, R&N Sections 13.1-13.2, 13.5, Mitchell 2 Learning Delayed Rewards All you can see is a series of states and rewards: S 1 (r=0) S 2 (r=0) S 3 (r=4) S 2 (r=0) S 4 (r=0) S 5 (r=0) Task : Based on this sequence, estimate J*(S 1 ),J*(S 2 )···J*(S 6 ) 3 Idea 1: Supervised Learning Assume γ = 0.5. S 1 (r=0) S 2 (r=0) S 3 (r=4) S 2 (r=0) S 4 (r=0) S 5 (r=0) At t=1 we were in state S 1 and eventually got a long term discounted reward of 0+ γ 0+ γ 2 4+ γ 3 0+ γ 4 0…= 1 At t=2 in state S 2 ltdr= 2 At t=5 in state S 4 ltdr= 0 At t=3 in state S 3 ltdr= 4 At t=6 in state S 5 ltdr= 0 At t=4 in state S 2 ltdr= 0 4 Supervised Learning Algorithm •Watch a trajectory S[0] r[0] S[1] r[1] ····S[T]r[T] •For t=0,1, ···T , compute •Compute •You’re done! 5 Online Supervised Learning Algorithm Initialize: Count[S i ] = 0 ∀ S i SumJ[S i ] = 0 ∀ S i Eligibility[S i ] = 0 ∀ S i Observe: When we experience S i with reward r do this: ∀ j Elig[S j ] γ Elig[S j ] Elig[S i ] Elig[S i ] + 1 ∀ j SumJ[S j ] SumJ[S j ] + r × Elig[S j ] Count[S i ] Count[S i ] + 1 Then at any time, J est (S j )= SumJ[S j ]/Count[S j ] 6 Online Supervised Learning Economics Given N states S 1 ···S N , OSL needs O(N) memory. Each update needs O(N) work since we must update all Elig[ ] array elements Idea: Be sparse and only update/process Elig[ ] elements with values > ε for tiny ε Easy to prove: 7 Online Supervised Learning Let’s grab OSL off the street, bundle it into a black van, take it to a bunker and interrogate it under 600 Watt lights. S 1 (r=0) S 2 (r=0) S 3 (r=4) S 2 (r=0) S 4 (r=0) S 5 (r=0) T 8 Certainty-Equivalent (CE) Learning Idea: Do model-based learning (i.e., use your data to estimate the underlying Markov system, instead of trying to estimate J directly. S 1 (r=0) S 2 (r=0) S 3 (r=4) S 2 (r=0) S 4 (r=0) S 5 (r=0) Estimated Markov System : You draw in the transitions +probs What’re the estimated J values? 9 C.E. Method for Markov Systems Initialize: Count[S i ] = 0 #Times visited S i SumR[S i ] = 0 ∀ S i,, S j Sum of rewards from S i Trans[S i ,S j ] = 0 #Times transitioned from S...
View Full Document

## This note was uploaded on 11/03/2010 for the course UNIVERSITY CS6375 taught by Professor Vicentng during the Fall '10 term at University of Texas at Dallas, Richardson.

### Page1 / 9

rl-4up - 1 MachineLearning CS6375-Fall 2010 a Reinforcement...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online