This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CS221 Lecture notes #8 Reinforcement learning I In supervised learning, we had a training set { ( x (1) ,y (1) ) ,..., ( x ( m ) ,y ( m ) ) } for which we had the right answer y ( i ) for every instace x ( i ) . For example, supervised learning algorithms would learn to drive by predicting the actions of an expert human driver. In this set of notes, well talk about a different setting called reinforcement learning , where we dont know the right an swers ahead of time; we only know a reward function which tells us the goodness of particular states. (For instance, we might believe that a state where the helicopter is in the air is better than one where it is lying in bits and pieces on the ground.) In detail, we specify a reward function mapping states of the world to real numbers. The algorithms goal is to find a sequence of actions which maximizes this reward function over time. This temporal component means the algorithm does not simply have to make a one shot decision. For instance, the helicopter must choose actions which not only allow it to stay in the air at this exact moment, but which also keep it stable enough that it can remain in the air continually. If the world were completely deterministic, it would be easy to maximize such a reward function using techniques from our lectures on search. How ever, in many robotics systems, the dynamics are stochastic , in that the same action doesnt always lead to the exact same result every time. For instance, telling a robot to move one meter forward could typically result in it moving anywhere between 95 and 105 cm forward, due to factors such as slippage of the wheels. Even a small amount of randomness would cause big problems for the deterministic search algorithms we covered earlier in the course. We will take an approach where we reward the robot for desired out comes and punish it for undesired ones. One challenge faced by rein forcement learning is the credit assignment problem . Upon reaching a 1 2 position with negative reward, it may not be obvious which previous action had caused that negative reward. For instance, suppose you are driving and you crash your car. Chances are, you slammed on your breaks shortly before the crash. You wouldnt want to conclude from this that its a bad idea to ever step on the breaks again. Rather, the crash was probably due to an action you chose much earlier, such as your decision to go 90 MPH down the highway. In these notes, we present the standard reinforcement learning formalism, known as the Markov decision process (MDP). MDPs allow us to model the (stochastic) dynamics of the world as well as the desired outcomes. As we will see, within this formalism, we can tractably compute the optimal behaviors which maximize the reward function over time....
View
Full
Document
This note was uploaded on 12/15/2009 for the course CS 221 taught by Professor Koller,ng during the Fall '09 term at Stanford.
 Fall '09
 KOLLER,NG
 Algorithms

Click to edit the document details