383-Fall11-Lec22 - 1 CMPSCI 383 Nov 29, 2011 Reinforcement...

Info iconThis preview shows pages 1–17. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 CMPSCI 383 Nov 29, 2011 Reinforcement Learning 2 Today s lecture Review of Chapter 17: Making Complex Decisions Sequential decision problems The motivation and advantages of reinforcement learning. Passive learning Policy evaluation Direct utility estimation Adaptive dynamic programming Temporal Difference (TD) learning 3 Check this out http://www.technologyreview.com/computing/39156/ MIT's Technology Review has an in-depth interview with Peter Norvig, Google'sDirector of Research, and Eric Horvitz, a Distinguished Scientist at MicrosoftResearch about their optimism for the futureof AI. 4 A Simple Example Gridworld with 2 Goal states Actions: Up, Down, Left, Right Fully observable: Agent knows where it is 5 Transition Model 6 Markov Assumption 7 Agent s Utility Function Performance depends on the entire sequence of states and actions. Environment history In each state, the agent receives a reward R ( s ) . The reward is real-valued. It may be positive or negative. Utility of environment history = sum of reward received. 8 Reward function-.04-.04-.04-.04-.04-.04-.04-.04 9 Decision Rules Decision rules say what to do in each state. Often called policies, . Action for state s is given by (s) . 10 Our Goal Find the policy that maximizes the expected sum of rewards. Called an optimal policy . 11 Markov Decision Process (MDP) M=(S,A,P,R) S = set of possible states A = set of possible actions P(s |s,a) gives transition probabilities R = reward function Goal: Fnd an optimal policy, * . 12 Finite/Infnite Horizon Finite horizon: the game ends after N steps. Innite horizon: the game never ends With a nite horizon, the optimal action in a given state could change over time. The optimal policy is nonstationary . With innite horizon, the optimal policy is stationary . 13 Utilities over time Additive rewards: Discounted rewards: Discount factor: 14 Discounted Rewards Would you rather have a marshmallow now, or two in 20 minutes? Infnite sums! 15 Utility of States Given a policy, we can defne the utility oF a state: 16 Policy Evaluation Finding the utility of states for a given policy. Solve a system of linear equations: An instance of a Bellman Equation....
View Full Document

Page1 / 55

383-Fall11-Lec22 - 1 CMPSCI 383 Nov 29, 2011 Reinforcement...

This preview shows document pages 1 - 17. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online