CS221 Lecture notes #8
Reinforcement learning I
In supervised learning, we had a training set
{
(
x
(1)
, y
(1)
)
, . . . ,
(
x
(
m
)
, y
(
m
)
)
}
for which we had the “right answer”
y
(
i
)
for every instace
x
(
i
)
. For example,
supervised learning algorithms would learn to drive by predicting the actions
of an expert human driver. In this set of notes, we’ll talk about a different
setting called
reinforcement learning
, where we don’t know the right an
swers ahead of time; we only know a “reward function” which tells us the
goodness of particular states. (For instance, we might believe that a state
where the helicopter is in the air is better than one where it is lying in bits
and pieces on the ground.)
In detail, we specify a
reward function
mapping states of the world to
real numbers.
The algorithm’s goal is to find a sequence of actions which
maximizes this reward function over time. This temporal component means
the algorithm does not simply have to make a one shot decision. For instance,
the helicopter must choose actions which not only allow it to stay in the air
at this exact moment, but which also keep it stable enough that it can remain
in the air continually.
If the world were completely deterministic, it would be easy to maximize
such a reward function using techniques from our lectures on search. How
ever, in many robotics systems, the dynamics are
stochastic
, in that the
same action doesn’t always lead to the exact same result every time.
For
instance, telling a robot to move one meter forward could typically result in
it moving anywhere between 95 and 105 cm forward, due to factors such as
slippage of the wheels. Even a small amount of randomness would cause big
problems for the deterministic search algorithms we covered earlier in the
course.
We will take an approach where we “reward” the robot for desired out
comes and “punish” it for undesired ones.
One challenge faced by rein
forcement learning is the
credit assignment problem
. Upon reaching a
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
2
position with negative reward, it may not be obvious which previous action
had caused that negative reward. For instance, suppose you are driving and
you crash your car. Chances are, you slammed on your breaks shortly before
the crash. You wouldn’t want to conclude from this that it’s a bad idea to
ever step on the breaks again.
Rather, the crash was probably due to an
action you chose much earlier, such as your decision to go 90 MPH down the
highway.
In these notes, we present the standard reinforcement learning formalism,
known as the
Markov decision process (MDP).
MDPs allow us to model
the (stochastic) dynamics of the world as well as the desired outcomes. As
we will see, within this formalism, we can tractably compute the optimal
behaviors which maximize the reward function over time.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Fall '09
 KOLLER,NG
 Algorithms, optimal policy, Markov decision process, total payoff, policy iteration

Click to edit the document details