{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

cs221-notes8 - CS221 Lecture notes#8 Reinforcement learning...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
CS221 Lecture notes #8 Reinforcement learning I In supervised learning, we had a training set { ( x (1) , y (1) ) , . . . , ( x ( m ) , y ( m ) ) } for which we had the “right answer” y ( i ) for every instace x ( i ) . For example, supervised learning algorithms would learn to drive by predicting the actions of an expert human driver. In this set of notes, we’ll talk about a different setting called reinforcement learning , where we don’t know the right an- swers ahead of time; we only know a “reward function” which tells us the goodness of particular states. (For instance, we might believe that a state where the helicopter is in the air is better than one where it is lying in bits and pieces on the ground.) In detail, we specify a reward function mapping states of the world to real numbers. The algorithm’s goal is to find a sequence of actions which maximizes this reward function over time. This temporal component means the algorithm does not simply have to make a one shot decision. For instance, the helicopter must choose actions which not only allow it to stay in the air at this exact moment, but which also keep it stable enough that it can remain in the air continually. If the world were completely deterministic, it would be easy to maximize such a reward function using techniques from our lectures on search. How- ever, in many robotics systems, the dynamics are stochastic , in that the same action doesn’t always lead to the exact same result every time. For instance, telling a robot to move one meter forward could typically result in it moving anywhere between 95 and 105 cm forward, due to factors such as slippage of the wheels. Even a small amount of randomness would cause big problems for the deterministic search algorithms we covered earlier in the course. We will take an approach where we “reward” the robot for desired out- comes and “punish” it for undesired ones. One challenge faced by rein- forcement learning is the credit assignment problem . Upon reaching a 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
2 position with negative reward, it may not be obvious which previous action had caused that negative reward. For instance, suppose you are driving and you crash your car. Chances are, you slammed on your breaks shortly before the crash. You wouldn’t want to conclude from this that it’s a bad idea to ever step on the breaks again. Rather, the crash was probably due to an action you chose much earlier, such as your decision to go 90 MPH down the highway. In these notes, we present the standard reinforcement learning formalism, known as the Markov decision process (MDP). MDPs allow us to model the (stochastic) dynamics of the world as well as the desired outcomes. As we will see, within this formalism, we can tractably compute the optimal behaviors which maximize the reward function over time.
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}