Backgammon large number of positions 30 pieces 24 26

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: y 1 − ￿ greedy action from s ; with probability ￿ random action. • • Epoch-dependent strategy (Boltzmann exploration): pt (a|s, Q) = ￿ • τ → 0: greedy selection. • larger τ : random action. t e Q(s,a) τt a￿ ∈A e Q(s,a￿ ) τt , t Mehryar Mohri - Foundations of Machine Learning page 47 Convergence of Q-Learning Theorem: consider a finite MDP. Assume that for ￿∞ ￿∞ 2 all s ∈ S and a ∈ A , t=0 αt (s, a) = ∞, t=0 αt (s, a) < ∞ with αt (s, a) ∈ [0, 1] . Then, the Q-learning algorithm Q∗ (with probability converges to the optimal value one). note: the conditions on αt (s, a) impose that each state-action pair is visited infinitely many times. • Mehryar Mohri - Foundations of Machine Learning page 48 SARSA: On-Policy Algorithm SARSA(π ) 1 Q ← Q0 ￿ initialization, e.g., Q0 = 0. 2 for t ← 0 to T do 3 s ← SelectState() 4 a ← SelectAction(π (Q), s) ￿ policy π derived from Q, e.g., ￿-greedy. 5 for each step of epoch t do 6 r￿ ← Reward(s, a) 7 s￿ ← NextState(s, a) 8 a￿ ← SelectAction(π (Q),￿s￿ ) ￿ policy π derived ￿ from Q, e.g., ￿-greedy. 9 Q(s, a) ← Q(s, a) + αt (s, a) r￿ + γ Q(s￿ , a￿ ) − Q(s, a) 10 s ← s￿ 11 a ← a￿ 12 return Q Mehryar Mohri - Foundations of Machine Learning page 49 No...
View Full Document

Ask a homework question - tutors are online