This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CS221 Lecture notes Reinforcement learning I In supervised learning, we assumed we had a training set { ( x (1) ,y (1) ) ,..., ( x ( m ) ,y ( m ) ) } in which we had the “right answer” y ( i ) for every instace x ( i ) . For example, supervised learning algorithms would learn to drive by predicting the actions of an expert human driver. Today, we’re going to talk about a different set ting called reinforcement learning , where we don’t know the right answers ahead of time; we only know a “reward function” which tells us the goodness of particular states. (For instance, we might believe that a state where the helicopter is in the air is better than one where it is lying in bits and pieces on the ground.) Essentially, we specify a reward function mapping states of the world to real numbers. The algorithm’s goal is to find a sequence of actions which maximizes this reward function over time. This temporal component means the algorithm does not simply have to make a one shot decision. For instance, the helicopter must choose actions which not only allow it to stay in the air at this exact moment, but which also keep it stable enough that it can remain in the air continually. If the world were completely deterministic, it would be easy to maximize such a reward function using techniques from our lectures on search. How ever, in almost all robotics problems, the dynamics are stochastic , in that the same action doesn’t always lead to the exact same result every time. For instance, telling a robot to move one meter forward could typically result in it moving anywhere between 95 and 105 cm forward, due to factors such as slippage of the wheels. Even a small amount of randomness would cause big problems for the deterministic search algorithms we covered earlier in the course. We will take an approach where we “reward” the robot for desired out comes and “punish” it for undesired ones. One challenge faced by rein forcement learning is the credit assignment problem . Upon reaching a 1 2 position with negative reward, it may not be obvious which previous action had caused that negative reward. For instance, suppose you are driving and you crash your car. Chances are, you slammed on your breaks shortly before the crash. You wouldn’t want to conclude from this that it’s a bad idea to ever step on the breaks again. Rather, the crash was probably due to an action you chose much earlier, such as your decision to go 90 MPH down the highway. In this lecture, we present a standard reinforcement learning formalism called the Markov decision process (MDP). MDPs allow us to model the (stochastic) dynamics of the world as well as the desired outcomes. As we will see, within this formalism, we can tractably compute the optimal behaviors which maximize the reward function over time....
View
Full
Document
This note was uploaded on 11/30/2009 for the course CS 221 taught by Professor Koller,ng during the Winter '09 term at Stanford.
 Winter '09
 KOLLER,NG
 Artificial Intelligence, Algorithms

Click to edit the document details