Reward reward received rt1 rst at at rt1 st mehryar

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: State reached st+1 = δ (st , at ). reward Reward received: rt+1 = r(st , at ). at /rt+1 st Mehryar Mohri - Foundations of Machine Learning at+1 /rt+2 st+1 st+2 page 8 Environment MDPs - Properties Finite MDPs: A and S finite sets. Finite horizon when T < ∞. Reward r(s, a) : often deterministic function. Mehryar Mohri - Foundations of Machine Learning page 9 Example - Robot Picking up Balls start search/[.1, R1] search/[.9, R1] carry/[.5, R3] other Mehryar Mohri - Foundations of Machine Learning carry/[.5, -1] pickup/[1, R2] page 10 Policy Definition: a policy is a mapping π : S → A. Objective: find policy π maximizing expected return. ￿ T −t finite horizon: τ =0 r(st+τ , π(st+τ )). ￿ T −t τ infinite horizon: τ =0 γ r(st+τ , π(st+τ )), γ ∈ [0, 1). • • Theorem: there exists an optimal policy from any start state. Mehryar Mohri - Foundations of Machine Learning page 11 Policy Value Definition: the value of a policy π at state s is finite horizon: • Vπ (s) = E ￿ T −t ￿ τ =0 r(st+τ , π (st+τ )) | st = s . • infinite horizon: dicount factor γ ∈ [0, 1), Vπ (s) = E ￿ T −t ￿ τ =0 ￿ ￿ γ τ r(st+τ , π (st+τ )) | st = s . Problem: find policy π with maximum value for all states. Mehryar Mohri - Foundations of Machine Learning page 12 Policy Evaluation Analysis of policy value: Vπ (s) = E ￿ T −t ￿ ￿ γ τ r(st+τ , π (st+τ )) | st = s ￿ T −t ￿ τ =0 ￿ = E[r(s, π (s)] + γ E γ τ r(st+1+τ , π (st+1+τ )) | st = s τ =0 = E[r(s, π (s)] + γ E[Vπ (δ (s, π (s)))]. Bellman equation (system of linear equations): Vπ (s) = E[r(s, π (s)] + γ ￿ s￿ Mehryar Mohri - Foundations of Machine Learning Pr[s￿ |s, π (s)]Vπ (s￿ ). page 13 Bellman Equation - Existence and Uniqueness Notation: transition probability matrix Ps,s￿ = Pr[s￿ |s, π(s)]. value column matrix V = Vπ (s). expected reward column matrix: R = E[r(s, π(s)]. • • • Theorem...
View Full Document

This note was uploaded on 07/12/2012 for the course CSCI GA.2566-00 taught by Professor Mohri during the Spring '12 term at NYU.

Ask a homework question - tutors are online