Unformatted text preview: y 1 − greedy action from s ;
with probability random action. •
• Epoch-dependent strategy (Boltzmann exploration):
pt (a|s, Q) = • τ → 0: greedy selection.
• larger τ : random action.
t e Q(s,a)
τt a ∈A e Q(s,a )
τt , t Mehryar Mohri - Foundations of Machine Learning page 47 Convergence of Q-Learning
Theorem: consider a ﬁnite MDP. Assume that for
all s ∈ S and a ∈ A , t=0 αt (s, a) = ∞, t=0 αt (s, a) < ∞
with αt (s, a) ∈ [0, 1] . Then, the Q-learning algorithm
Q∗ (with probability
converges to the optimal value
note: the conditions on αt (s, a) impose that each
state-action pair is visited inﬁnitely many times. • Mehryar Mohri - Foundations of Machine Learning page 48 SARSA: On-Policy Algorithm
1 Q ← Q0 initialization, e.g., Q0 = 0.
2 for t ← 0 to T do
s ← SelectState()
a ← SelectAction(π (Q), s) policy π derived from Q, e.g., -greedy.
for each step of epoch t do
r ← Reward(s, a)
s ← NextState(s, a)
a ← SelectAction(π (Q),s ) policy π derived
from Q, e.g., -greedy.
Q(s, a) ← Q(s, a) + αt (s, a) r + γ Q(s , a ) − Q(s, a)
s ← s
a ← a
12 return Q Mehryar Mohri - Foundations of Machine Learning page 49 No...
View Full Document
- Spring '12
- Machine Learning, Bellman equation, Markov decision process, Mehryar Mohri