lecture_11

# As best upper bound known o s mehryar mohri

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ￿ π0 arbitrary policy 2 π ￿ ← nil 3 while (π ￿= π ￿ ) do 4 V ← Vπ ￿ policy evaluation: solve (I − γ Pπ )V = Rπ . 5 π￿ ← π 6 π ← argmaxπ {Rπ + γ Pπ V} ￿ greedy policy improvement. 7 return π Mehryar Mohri - Foundations of Machine Learning page 24 PI Algorithm - Convergence Theorem: let (Vn )n∈N be the sequence of policy values computed by the algorithm, then, Vn ≤ Vn+1 ≤ V∗ . Proof: let πn+1 be the policy improvement at the nth iteration, then, by deﬁnition, Rπn+1 + γ Pπn+1 Vn ≥ Rπn + γ Pπn Vn = Vn . • therefore, R ≥ (I − γ P )V . • note that (I − γ P ) preserves ordering: ￿ πn+1 πn+1 −1 πn+1 • n X ≥ 0 ⇒ (I − γ Pπn+1 ) −1 X= ∞ (γ Pπn+1 )k X k=0 thus, Vn+1 = (I − γ Pπn+1 )−1 Rπn+1 ≥ Vn . Mehryar Mohri - Foundations of Machine Learning page 25 ≥ 0. Notes Two consecutive policy values can be equal only at last iteration. The total number of possible policies is |A||S | , thus, this is the maximal possible number of iterations. ￿ |A||S| ￿ best upper bound known O |S | . • Mehryar Mohri - Foundations of Machine Learning page 26 PI Algorithm - Example a/[3/4, 2] 1 a/[1/4, 2] c/[1, 2] b/[1, 2] d/[1, 3] 2 Initial policy: π0 (1) = b, π0 (2) = c . Evaluation: Vπ0 (1) = 1 + γ Vπ0 (2) Vπ0 (2) = 2 + γ Vπ0 (2). 1+γ 2 Vπ0 (2) = . Thus,Vπ0 (1) = 1−γ 1−γ Mehryar Mohri - Foundations of Machine Learning page 27 VI and PI Algorithms - Comparison Theorem: let (Un )n∈N be the sequence of policy values generated by the VI algorithm, and (Vn )n∈N the one generated by the PI algorithm. If U0 = V0, then, ∗ ∀n ∈ N, Un ≤ Vn ≤ V . Proof: we ﬁrst show that Φ is monotonic. Let U and V be such that U ≤ V and let π be the policy such that Φ(U) = Rπ + γ Pπ U . Then, Φ(U) ≤ Rπ + γ Pπ V ≤ max{R￿ + γ P￿ V} = Φ(V). π π ￿ π Mehryar Mohri - Foundations of Machine Learning page 28 VI and PI A...
View Full Document

## This note was uploaded on 07/12/2012 for the course CSCI GA.2566-00 taught by Professor Mohri during the Spring '12 term at NYU.

Ask a homework question - tutors are online