lecture_11

# Backgammon large number of positions 30 pieces 24 26

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: y 1 − ￿ greedy action from s ; with probability ￿ random action. • • Epoch-dependent strategy (Boltzmann exploration): pt (a|s, Q) = ￿ • τ → 0: greedy selection. • larger τ : random action. t e Q(s,a) τt a￿ ∈A e Q(s,a￿ ) τt , t Mehryar Mohri - Foundations of Machine Learning page 47 Convergence of Q-Learning Theorem: consider a ﬁnite MDP. Assume that for ￿∞ ￿∞ 2 all s ∈ S and a ∈ A , t=0 αt (s, a) = ∞, t=0 αt (s, a) < ∞ with αt (s, a) ∈ [0, 1] . Then, the Q-learning algorithm Q∗ (with probability converges to the optimal value one). note: the conditions on αt (s, a) impose that each state-action pair is visited inﬁnitely many times. • Mehryar Mohri - Foundations of Machine Learning page 48 SARSA: On-Policy Algorithm SARSA(π ) 1 Q ← Q0 ￿ initialization, e.g., Q0 = 0. 2 for t ← 0 to T do 3 s ← SelectState() 4 a ← SelectAction(π (Q), s) ￿ policy π derived from Q, e.g., ￿-greedy. 5 for each step of epoch t do 6 r￿ ← Reward(s, a) 7 s￿ ← NextState(s, a) 8 a￿ ← SelectAction(π (Q),￿s￿ ) ￿ policy π derived ￿ from Q, e.g., ￿-greedy. 9 Q(s, a) ← Q(s, a) + αt (s, a) r￿ + γ Q(s￿ , a￿ ) − Q(s, a) 10 s ← s￿ 11 a ← a￿ 12 return Q Mehryar Mohri - Foundations of Machine Learning page 49 No...
View Full Document

Ask a homework question - tutors are online