lecture_11 - Foundations of Machine Learning Lecture 11...

Info iconThis preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon
Foundations of Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research [email protected]
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Mehryar Mohri - Foundations of Machine Learning page 2 Reinforcement Learning
Background image of page 2
page Mehryar Mohri - Foundations of Machine Learning 3 Reinforcement Learning Agent exploring environment . Interactions with environment: Problem : ±nd action policy that maximizes cumulative reward over the course of interactions. Environment Agent action state reward
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning 4 Key Features Contrast with supervised learning: no explicit labeled training data. distribution de±ned by actions taken. Delayed rewards or penalties. RL trade-off: exploration (of unknown states and actions) to gain more reward information; vs. exploitation (of known information) to optimize reward.
Background image of page 4
page Mehryar Mohri - Foundations of Machine Learning 5 Applications Robot control e.g., Robocup Soccer Teams (Stone et al., 1999) . Board games, e.g., TD-Gammon (Tesauro, 1995) . Elevator scheduling (Crites and Barto, 1996) . Telecommunications. Inventory management. Dynamic radio channel assignment.
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning 6 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem
Background image of page 6
page Mehryar Mohri - Foundations of Machine Learning Markov Decision Process (MDP) De±nition : a Markov Decision Process is de±ned by: a set of decision epochs . a set of states , possibly in±nite. a start state or initial state ; a set of actions , possibly in±nite. a transition probability : distribution over destination states . a reward probability : distribution over rewards returned . 7 S { 0 ,...,T } A Pr[ s ° | s, a ] Pr[ r ° | s, a ] s ° = δ ( s, a ) r ° = r ( s, a ) s 0
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning Model State observed at time : Action taken at time : State reached . Reward received: . 8 t s t S. t a t A. s t s t +1 s t +2 a t /r t +1 a t +1 /r t +2 Environment Agent action state reward s t +1 = δ ( s t ,a t ) r t +1 = r ( s t t )
Background image of page 8
page Mehryar Mohri - Foundations of Machine Learning MDPs - Properties Finite MDPs : and ±nite sets. Finite horizon when . Reward : often deterministic function. 9 A S r ( s, a ) T<
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning 10 Example - Robot Picking up Balls start search/[.1, R1] other search/[.9, R1] carry/[.5, R3] carry/[.5, -1] pickup/[1, R2]
Background image of page 10
page Mehryar Mohri - Foundations of Machine Learning Policy De±nition : a policy is a mapping Objective : ±nd policy maximizing expected return. ±nite horizon: in±nite horizon: Theorem : there exists an optimal policy from any start state. 11 π : S A. π ° T t τ =0 r ( s t + τ ( s t + τ )) . ° T t τ =0 γ τ r ( s t + τ ( s t + τ )) [0 , 1) .
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning Policy Value De±nition : the value of a policy at state is ±nite horizon: in±nite horizon: dicount factor , Problem : ±nd policy with maximum value for all states.
Background image of page 12
Image of page 13
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 3

lecture_11 - Foundations of Machine Learning Lecture 11...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online