{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

lecture_11 - Foundations of Machine Learning Lecture 11...

Info icon This preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon
Foundations of Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research [email protected]
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Mehryar Mohri - Foundations of Machine Learning page 2 Reinforcement Learning
Image of page 2
page Mehryar Mohri - Foundations of Machine Learning 3 Reinforcement Learning Agent exploring environment . Interactions with environment: Problem : find action policy that maximizes cumulative reward over the course of interactions. Environment Agent action state reward
Image of page 3

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning 4 Key Features Contrast with supervised learning: no explicit labeled training data. distribution defined by actions taken. Delayed rewards or penalties. RL trade-off: exploration (of unknown states and actions) to gain more reward information; vs. exploitation (of known information) to optimize reward.
Image of page 4
page Mehryar Mohri - Foundations of Machine Learning 5 Applications Robot control e.g., Robocup Soccer Teams (Stone et al., 1999) . Board games, e.g., TD-Gammon (Tesauro, 1995) . Elevator scheduling (Crites and Barto, 1996) . Telecommunications. Inventory management. Dynamic radio channel assignment.
Image of page 5

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning 6 This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem
Image of page 6
page Mehryar Mohri - Foundations of Machine Learning Markov Decision Process (MDP) Definition : a Markov Decision Process is defined by: a set of decision epochs . a set of states , possibly infinite. a start state or initial state ; a set of actions , possibly infinite. a transition probability : distribution over destination states . a reward probability : distribution over rewards returned . 7 S { 0 , . . . , T } A Pr[ s | s, a ] Pr[ r | s, a ] s = δ ( s, a ) r = r ( s, a ) s 0
Image of page 7

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning Model State observed at time : Action taken at time : State reached . Reward received: . 8 t s t S. t a t A. s t s t +1 s t +2 a t /r t +1 a t +1 /r t +2 Environment Agent action state reward s t +1 = δ ( s t , a t ) r t +1 = r ( s t , a t )
Image of page 8
page Mehryar Mohri - Foundations of Machine Learning MDPs - Properties Finite MDPs : and finite sets. Finite horizon when . Reward : often deterministic function. 9 A S r ( s, a ) T <
Image of page 9

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning 10 Example - Robot Picking up Balls start search/[.1, R1] other search/[.9, R1] carry/[.5, R3] carry/[.5, -1] pickup/[1, R2]
Image of page 10
page Mehryar Mohri - Foundations of Machine Learning Policy Definition : a policy is a mapping Objective : find policy maximizing expected return. finite horizon: infinite horizon: Theorem : there exists an optimal policy from any start state. 11 π : S A. π T t τ =0 r ( s t + τ , π ( s t + τ )) . T t τ =0 γ τ r ( s t + τ , π ( s t + τ )) , γ [0 , 1) .
Image of page 11

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
page Mehryar Mohri - Foundations of Machine Learning Policy Value Definition : the value of a policy at state is finite horizon: infinite horizon: dicount factor , Problem : find policy with maximum value for all states.
Image of page 12
Image of page 13
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern