Prs s a xs a ssaa parameters more favorable number

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: lgorithms - Comparison • The proof is by induction on n . Assume U n ≤ Vn then, by the monotonicity of Φ , , Un+1 = Φ(Un ) ≤ Φ(Vn ) = max{Rπ + γ Pπ Vn }. • Let π π n+1 • Then, be the maximizing policy: πn+1 = argmax{Rπ + γ Pπ Vn }. π Φ(Vn ) = Rπn+1 + γ Pπn+1 Vn ≤ Rπn+1 + γ Pπn+1 Vn+1 = Vn+1 . Mehryar Mohri - Foundations of Machine Learning page 29 Notes The PI algorithm converges in a smaller number of iterations than the VI algorithm due to the optimal policy. But, each iteration of the PI algorithm requires computing a policy value, i.e., solving a system of linear equations, which is more expensive to compute that an iteration of the VI algorithm. Mehryar Mohri - Foundations of Machine Learning page 30 Primal Linear Program LP formulation: choose α(s) > 0 , with min V ￿ α(s)V (s) s∈S subject to ∀s ∈ S, ∀a ∈ A, V (s) ≥ E[r(s, a)] + γ Parameters: number rows: |S ||A| . number of columns: |S | . ￿ s ￿ s￿ ∈S α(s) = 1 . Pr[s￿ |s, a]V (s￿ ). • • Mehryar Mohri - Foundations of Machine Learning page 31 Dual Linear Program LP formulation: max x ￿ s∈S,a∈A subject to ∀s ∈ S, E[r(s, a)] x(s, a) ￿ a∈A x(s￿ , a) = α(s￿ ) + γ ∀s ∈ S, ∀a ∈ A, x(s, a) ≥ 0. ￿ Pr[s￿ |s, a] x(s￿ , a) s∈S,a∈A Parameters: more favorable number of rows. number rows: |S | . number of columns: |S ||A| . • • Mehryar Mohri - Foundations of Machine Learning page 32 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page 33 Unknown Model Transition and reward probabilities unknown. • In many practical problems, e.g., robot control, the model of the environment is not known. Training information: sequence of immediate rewards based on actions taken. Learning approches: • model-free: learn policy directly. • model-based: learn model, use it to learn policy. Mehryar Mohr...
View Full Document

Ask a homework question - tutors are online