Assume that for 2 all s s and a a t0 t s a t0

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: s . • • V (s) ← (1 − α)V (s) + α[r(s, π (s)) + γ V (s￿ )] = V (s) + α[r(s, π (s)) + γ V (s￿ ) − V (s)]. ￿￿ ￿ ￿ temporal difference of V values Mehryar Mohri - Foundations of Machine Learning page 42 TD(0) Algorithm TD(0)() 1 V ← V0 ￿ initialization. 2 for t ← 0 to T do 3 s ← SelectState() 4 for each step of epoch t do 5 r￿ ← Reward(s, π (s)) 6 s￿ ← NextState(π , s) 7 V (s) ← (1 − α)V (s) + α[r￿ + γ V (s￿ )] 8 s ← s￿ 9 return V Mehryar Mohri - Foundations of Machine Learning page 43 Q-Learning Algorithm Idea: assume deterministic rewards. Q∗ (s, a) = E[r(s, a)] + γ ￿ s￿ ∈S Pr[s￿ | s, a]V ∗ (s￿ ) = E[r(s, a) + γ max Q∗ (s, a)] ￿ s a∈A Algorithm: α ∈ [0, 1] depends on number of visits. sample new state s￿. update: • • Q(s, a) ← αQ(s, a) + (1 − α)[r(s, a) + γ max Q(s￿ , a￿ )]. ￿ a ∈A Mehryar Mohri - Foundations of Machine Learning page 44 Q-Learning Algorithm (Watkins, 1989; Watkins and Dayan 1992) Q-Learning(π ) 1 Q ← Q0 ￿ initialization, e.g., Q0 = 0. 2 for t ← 0 to T do 3 s ← SelectState() 4 for each step of epoch t do 5 a ← SelectAction(π , s) ￿ policy π derived from Q, e.g., ￿-greedy. 6 r￿ ← Reward(s, a) 7 s￿ ← NextState(s, a￿ ) ￿ ￿ ￿￿ 8 Q(s, a) ← Q(s, a) + α r + γ maxa￿ Q(s , a ) − Q(s, a) 9 s ← s￿ 10 return Q Mehryar Mohri - Foundations of Machine Learning page 45 Notes Can be viewed as a stochastic formulation of the value iteration algorithm. Convergence for any policy so long as states and actions visited infinitely often. How to choose the action at each iteration? Maximize reward? Explore other actions? Qlearning is an off-policy method: no control over the policy. Mehryar Mohri - Foundations of Machine Learning page 46 Policies Epsilon-greedy strategy: with probabilit...
View Full Document

This note was uploaded on 07/12/2012 for the course CSCI GA.2566-00 taught by Professor Mohri during the Spring '12 term at NYU.

Ask a homework question - tutors are online