lecture_11

# Reward reward received rt1 rst at at rt1 st mehryar

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: State reached st+1 = δ (st , at ). reward Reward received: rt+1 = r(st , at ). at /rt+1 st Mehryar Mohri - Foundations of Machine Learning at+1 /rt+2 st+1 st+2 page 8 Environment MDPs - Properties Finite MDPs: A and S ﬁnite sets. Finite horizon when T < ∞. Reward r(s, a) : often deterministic function. Mehryar Mohri - Foundations of Machine Learning page 9 Example - Robot Picking up Balls start search/[.1, R1] search/[.9, R1] carry/[.5, R3] other Mehryar Mohri - Foundations of Machine Learning carry/[.5, -1] pickup/[1, R2] page 10 Policy Deﬁnition: a policy is a mapping π : S → A. Objective: ﬁnd policy π maximizing expected return. ￿ T −t ﬁnite horizon: τ =0 r(st+τ , π(st+τ )). ￿ T −t τ inﬁnite horizon: τ =0 γ r(st+τ , π(st+τ )), γ ∈ [0, 1). • • Theorem: there exists an optimal policy from any start state. Mehryar Mohri - Foundations of Machine Learning page 11 Policy Value Deﬁnition: the value of a policy π at state s is ﬁnite horizon: • Vπ (s) = E ￿ T −t ￿ τ =0 r(st+τ , π (st+τ )) | st = s . • inﬁnite horizon: dicount factor γ ∈ [0, 1), Vπ (s) = E ￿ T −t ￿ τ =0 ￿ ￿ γ τ r(st+τ , π (st+τ )) | st = s . Problem: ﬁnd policy π with maximum value for all states. Mehryar Mohri - Foundations of Machine Learning page 12 Policy Evaluation Analysis of policy value: Vπ (s) = E ￿ T −t ￿ ￿ γ τ r(st+τ , π (st+τ )) | st = s ￿ T −t ￿ τ =0 ￿ = E[r(s, π (s)] + γ E γ τ r(st+1+τ , π (st+1+τ )) | st = s τ =0 = E[r(s, π (s)] + γ E[Vπ (δ (s, π (s)))]. Bellman equation (system of linear equations): Vπ (s) = E[r(s, π (s)] + γ ￿ s￿ Mehryar Mohri - Foundations of Machine Learning Pr[s￿ |s, π (s)]Vπ (s￿ ). page 13 Bellman Equation - Existence and Uniqueness Notation: transition probability matrix Ps,s￿ = Pr[s￿ |s, π(s)]. value column matrix V = Vπ (s). expected reward column matrix: R = E[r(s, π(s)]. • • • Theorem...
View Full Document

## This note was uploaded on 07/12/2012 for the course CSCI GA.2566-00 taught by Professor Mohri during the Spring '12 term at NYU.

Ask a homework question - tutors are online