Temporal dierence of v values mehryar mohri

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 8 Supermartingale Convergence ,Z Theorem: let Xt , Yt￿ t be non-negative random ￿ ￿￿ ∞ variables such that t=0 Yt < ∞ . If E Xt+1 ￿Ft ≤ Xt + Yt − Zt then Xt converges to a limit (with probability one). •￿ • ∞ t=0 Zt < ∞. Mehryar Mohri - Foundations of Machine Learning page 39 Convergence Analysis Convergence of xt+1 = xt + αt D(xt , wt ) , with history Ft defined by Ft = {(xt￿ )t￿ ≤t , (αt￿ )t￿ ≤t , (wt￿ )t￿ <t }. 1 Theorem: let Ψ : x → 2 ￿x − x∗ ￿2 for some x∗ and 2 assume that ￿ ￿ • • • ￿ ∃K1 , K2 : E ￿D(xt , wt )￿2 ￿ Ft ≤ K1 + K2 Ψ(xt ); 2 ￿ ￿￿ ∃c : ∇Ψ(xt )￿ E D(xt , wt ) ￿ Ft ≤ −c Ψ(xt ); ￿∞ ￿∞ 2 αt > 0, t=0 αt = ∞, t=0 αt < ∞. a.s − Then, xt −→ x∗ . Mehryar Mohri - Foundations of Machine Learning page 40 Convergence Analysis Proof: since Ψ is a quadratic function, 1 Ψ(xt+1 ) = Ψ(xt ) + ∇Ψ(xt ) (xt+1 − xt ) + (xt+1 − xt )￿ ∇2 Ψ(xt )(xt+1 − xt ). 2 ￿ Thus, ￿ ￿ ￿ ￿ α2 ￿ ￿￿ ￿￿ t ￿ 2￿ E Ψ(xt+1 )￿Ft = Ψ(xt ) + αt ∇Ψ(xt ) E D(xt , wt )￿Ft + E ￿D(xt , wt )￿ Ft 2 α2 non-neg. for ≤ Ψ(xt ) − αt cΨ(xt ) + t (K1 + K2 Ψ(xt )) 2 large t α2 K1 ￿ α2 K2 ￿ = Ψ(xt ) + t Ψ(xt ). − αt c − t 2 2 By the supermartingale convergence theorem, Ψ(xt ) ￿ 2 ￿∞ ￿ K converges and t=0 αt c − αt2 2 Ψ(xt ) < ∞. ￿∞ ￿∞ 2 Since αt > 0, t=0 αt = ∞, t=0 αt < ∞, Ψ(xt ) must converge to 0. Mehryar Mohri - Foundations of Machine Learning page 41 TD(0) Algorithm Idea: recall Bellman’s linear equations giving V ￿ Vπ (s) = E[r(s, π (s)] + γ Pr[s￿ |s, π (s)]Vπ (s￿ ) s￿ ￿ ￿ ￿ = E r(s, π (s)) + γ Vπ (s )|s . ￿ s Algorithm: temporal difference (TD). s￿ . sample new state update: α depends on number of visits of...
View Full Document

This note was uploaded on 07/12/2012 for the course CSCI GA.2566-00 taught by Professor Mohri during the Spring '12 term at NYU.

Ask a homework question - tutors are online