# 383-Fall11-Lec22-addendum - Reinforcement Learning for HW 5...

This preview shows pages 1–11. Sign up to view the full content.

1 CMPSCI 383 Nov 29, 2011 Reinforcement Learning for HW 5

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 Today ʼ s lecture Active agents The exploration/exploitation dilemma Q-Learning
3 Active RL Agents Experience Build Utility Function Policy Select U Q π Predictions . . . Actions

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
4 Interaction of policy and utility Policy Utility Function policy evaluation policy improvement utility learning “greedification” π U, Q
5 What is Q? Action-value function Q ( s , a ) = Utility of doing action a in state s i.e.: Total amount of reward expected over the future if you do action a in state s and thereafter select optimal actions. U ( s ) = max a Q ( s , a ) The utility of a state is the utility of doing the best action from that state:

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6 Learning an action-value function Q-Learning directly assigns a Q-value, Q(s,a), to each [state,action] pair. Don ʼ t need to learn transition probabilities to decide on best action: π *( s ) = argmax a Q ( s , a )
7 Bellman Equation for Q functions Q ( s , a ) = R ( s ) + γ P ( ʹ s | s , a )max ʹ a ʹ s Q ( ʹ s , ʹ a ) Recall Bellman Equation for U :

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
8

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 11

383-Fall11-Lec22-addendum - Reinforcement Learning for HW 5...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online