lecture_11

# P is a stochastic matrix thus p max s s pss max

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: : for a ﬁnite MDP, Bellman’s equation admits a unique solution given by V0 = (I − γ P)−1 R. Mehryar Mohri - Foundations of Machine Learning page 14 Bellman Equation - Existence and Uniqueness Proof: Bellman’s equation rewritten as V = R + γ PV. • P is a stochastic matrix, thus, ￿ ￿ ￿P￿∞ = max s s￿ |Pss￿ | = max • This implies that ￿γ P￿ s s￿ Pr[s￿ |s, π (s)] = 1. = γ < 1. The eigenvalues of P are all less than one and (I − γ P) is ∞ invertible. Notes: general shortest distance problem (MM, 2002). Mehryar Mohri - Foundations of Machine Learning page 15 Optimal Policy Deﬁnition: policy π∗ with maximal value for all states s ∈ S. value of π∗ (optimal value): • ∀s ∈ S, Vπ∗ (s) = max Vπ (s). π • optimal state-action value function: expected a s return for taking action at state and then following optimal policy. Q∗ (s, a) = E[r(s, a)] + γ ￿ s￿ ∈S Mehryar Mohri - Foundations of Machine Learning Pr[s￿ | s, a]V ∗ (s￿ ). page 16 Optimal Values - Bellman Equations Property: the following equalities hold: ∀s ∈ S, V ∗ (s) = max Q∗ (s, a). a∈A V ∗ (s) ≤ max Q∗ (s, a). Proof: by deﬁnition, for all s , • a∈A ∗ If for some s we had V ∗ (s) < max Q (s, a) , then a∈A maximizing action would deﬁne a better policy. Thus, V ∗ (s) = max{E[r(s, a)] + γ a∈A Mehryar Mohri - Foundations of Machine Learning ￿ s￿ ∈S Pr[s￿ |s, a]V ∗ (s￿ )}. page 17 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page 18 Known Model Setting: environment model known. Problem: ﬁnd optimal policy. Algorithms: value iteration. policy iteration. linear programming. • • • Mehryar Mohri - Foundations of Machine Learning page 19 Value Iteration Algorithm Φ(V)(s) = max a∈A ￿...
View Full Document

## This note was uploaded on 07/12/2012 for the course CSCI GA.2566-00 taught by Professor Mohri during the Spring '12 term at NYU.

Ask a homework question - tutors are online