P is a stochastic matrix thus p max s s pss max

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: : for a finite MDP, Bellman’s equation admits a unique solution given by V0 = (I − γ P)−1 R. Mehryar Mohri - Foundations of Machine Learning page 14 Bellman Equation - Existence and Uniqueness Proof: Bellman’s equation rewritten as V = R + γ PV. • P is a stochastic matrix, thus, ￿ ￿ ￿P￿∞ = max s s￿ |Pss￿ | = max • This implies that ￿γ P￿ s s￿ Pr[s￿ |s, π (s)] = 1. = γ < 1. The eigenvalues of P are all less than one and (I − γ P) is ∞ invertible. Notes: general shortest distance problem (MM, 2002). Mehryar Mohri - Foundations of Machine Learning page 15 Optimal Policy Definition: policy π∗ with maximal value for all states s ∈ S. value of π∗ (optimal value): • ∀s ∈ S, Vπ∗ (s) = max Vπ (s). π • optimal state-action value function: expected a s return for taking action at state and then following optimal policy. Q∗ (s, a) = E[r(s, a)] + γ ￿ s￿ ∈S Mehryar Mohri - Foundations of Machine Learning Pr[s￿ | s, a]V ∗ (s￿ ). page 16 Optimal Values - Bellman Equations Property: the following equalities hold: ∀s ∈ S, V ∗ (s) = max Q∗ (s, a). a∈A V ∗ (s) ≤ max Q∗ (s, a). Proof: by definition, for all s , • a∈A ∗ If for some s we had V ∗ (s) < max Q (s, a) , then a∈A maximizing action would define a better policy. Thus, V ∗ (s) = max{E[r(s, a)] + γ a∈A Mehryar Mohri - Foundations of Machine Learning ￿ s￿ ∈S Pr[s￿ |s, a]V ∗ (s￿ )}. page 17 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page 18 Known Model Setting: environment model known. Problem: find optimal policy. Algorithms: value iteration. policy iteration. linear programming. • • • Mehryar Mohri - Foundations of Machine Learning page 19 Value Iteration Algorithm Φ(V)(s) = max a∈A ￿...
View Full Document

Ask a homework question - tutors are online