This preview shows pages 1–17. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 1 CMPSCI 383 Nov 29, 2011 Reinforcement Learning 2 Today s lecture Review of Chapter 17: Making Complex Decisions Sequential decision problems The motivation and advantages of reinforcement learning. Passive learning Policy evaluation Direct utility estimation Adaptive dynamic programming Temporal Difference (TD) learning 3 Check this out http://www.technologyreview.com/computing/39156/ MIT's Technology Review has an indepth interview with Peter Norvig, Google'sDirector of Research, and Eric Horvitz, a Distinguished Scientist at MicrosoftResearch about their optimism for the futureof AI. 4 A Simple Example Gridworld with 2 Goal states Actions: Up, Down, Left, Right Fully observable: Agent knows where it is 5 Transition Model 6 Markov Assumption 7 Agent s Utility Function Performance depends on the entire sequence of states and actions. Environment history In each state, the agent receives a reward R ( s ) . The reward is realvalued. It may be positive or negative. Utility of environment history = sum of reward received. 8 Reward function.04.04.04.04.04.04.04.04 9 Decision Rules Decision rules say what to do in each state. Often called policies, . Action for state s is given by (s) . 10 Our Goal Find the policy that maximizes the expected sum of rewards. Called an optimal policy . 11 Markov Decision Process (MDP) M=(S,A,P,R) S = set of possible states A = set of possible actions P(s s,a) gives transition probabilities R = reward function Goal: Fnd an optimal policy, * . 12 Finite/Infnite Horizon Finite horizon: the game ends after N steps. Innite horizon: the game never ends With a nite horizon, the optimal action in a given state could change over time. The optimal policy is nonstationary . With innite horizon, the optimal policy is stationary . 13 Utilities over time Additive rewards: Discounted rewards: Discount factor: 14 Discounted Rewards Would you rather have a marshmallow now, or two in 20 minutes? Infnite sums! 15 Utility of States Given a policy, we can defne the utility oF a state: 16 Policy Evaluation Finding the utility of states for a given policy. Solve a system of linear equations: An instance of a Bellman Equation....
View
Full
Document
 Fall '11
 AndrewBarto
 Artificial Intelligence

Click to edit the document details