As best upper bound known o s mehryar mohri

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ￿ π0 arbitrary policy 2 π ￿ ← nil 3 while (π ￿= π ￿ ) do 4 V ← Vπ ￿ policy evaluation: solve (I − γ Pπ )V = Rπ . 5 π￿ ← π 6 π ← argmaxπ {Rπ + γ Pπ V} ￿ greedy policy improvement. 7 return π Mehryar Mohri - Foundations of Machine Learning page 24 PI Algorithm - Convergence Theorem: let (Vn )n∈N be the sequence of policy values computed by the algorithm, then, Vn ≤ Vn+1 ≤ V∗ . Proof: let πn+1 be the policy improvement at the nth iteration, then, by definition, Rπn+1 + γ Pπn+1 Vn ≥ Rπn + γ Pπn Vn = Vn . • therefore, R ≥ (I − γ P )V . • note that (I − γ P ) preserves ordering: ￿ πn+1 πn+1 −1 πn+1 • n X ≥ 0 ⇒ (I − γ Pπn+1 ) −1 X= ∞ (γ Pπn+1 )k X k=0 thus, Vn+1 = (I − γ Pπn+1 )−1 Rπn+1 ≥ Vn . Mehryar Mohri - Foundations of Machine Learning page 25 ≥ 0. Notes Two consecutive policy values can be equal only at last iteration. The total number of possible policies is |A||S | , thus, this is the maximal possible number of iterations. ￿ |A||S| ￿ best upper bound known O |S | . • Mehryar Mohri - Foundations of Machine Learning page 26 PI Algorithm - Example a/[3/4, 2] 1 a/[1/4, 2] c/[1, 2] b/[1, 2] d/[1, 3] 2 Initial policy: π0 (1) = b, π0 (2) = c . Evaluation: Vπ0 (1) = 1 + γ Vπ0 (2) Vπ0 (2) = 2 + γ Vπ0 (2). 1+γ 2 Vπ0 (2) = . Thus,Vπ0 (1) = 1−γ 1−γ Mehryar Mohri - Foundations of Machine Learning page 27 VI and PI Algorithms - Comparison Theorem: let (Un )n∈N be the sequence of policy values generated by the VI algorithm, and (Vn )n∈N the one generated by the PI algorithm. If U0 = V0, then, ∗ ∀n ∈ N, Un ≤ Vn ≤ V . Proof: we first show that Φ is monotonic. Let U and V be such that U ≤ V and let π be the policy such that Φ(U) = Rπ + γ Pπ U . Then, Φ(U) ≤ Rπ + γ Pπ V ≤ max{R￿ + γ P￿ V} = Φ(V). π π ￿ π Mehryar Mohri - Foundations of Machine Learning page 28 VI and PI A...
View Full Document

Ask a homework question - tutors are online