CS229 Lecture notes
Andrew Ng
Part XIII
Reinforcement Learning and
Control
We now begin our study of reinforcement learning and adaptive control.
In supervised learning, we saw algorithms that tried to make their outputs
mimic the labels
y
given in the training set. In that setting, the labels gave
an unambiguous “right answer” for each of the inputs
x
.
In contrast, for
many sequential decision making and control problems, it is very difficult to
provide this type of explicit supervision to a learning algorithm. For example,
if we have just built a fourlegged robot and are trying to program it to walk,
then initially we have no idea what the “correct” actions to take are to make
it walk, and so do not know how to provide explicit supervision for a learning
algorithm to try to mimic.
In the reinforcement learning framework, we will instead provide our al
gorithms only a reward function, which indicates to the learning agent when
it is doing well, and when it is doing poorly. In the fourlegged walking ex
ample, the reward function might give the robot positive rewards for moving
forwards, and negative rewards for either moving backwards or falling over.
It will then be the learning algorithm’s job to figure out how to choose actions
over time so as to obtain large rewards.
Reinforcement learning has been successful in applications as diverse as
autonomous helicopter flight, robot legged locomotion, cellphone network
routing, marketing strategy selection, factory control, and efficient webpage
indexing. Our study of reinforcement learning will begin with a definition of
the
Markov decision processes (MDP)
, which provides the formalism in
which RL problems are usually posed.
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
2
1
Markov decision processes
A Markov decision process is a tuple (
S, A,
{
P
sa
}
, γ, R
), where:
•
S
is a set of
states
. (For example, in autonomous helicopter flight,
S
might be the set of all possible positions and orientations of the heli
copter.)
•
A
is a set of
actions
. (For example, the set of all possible directions in
which you can push the helicopter’s control sticks.)
•
P
sa
are the state transition probabilities.
For each state
s
∈
S
and
action
a
∈
A
,
P
sa
is a distribution over the state space. We’ll say more
about this later, but briefly,
P
sa
gives the distribution over what states
we will transition to if we take action
a
in state
s
.
•
γ
∈
[0
,
1) is called the
discount factor
.
•
R
:
S
×
A
7→
R
is the
reward function
. (Rewards are sometimes also
written as a function of a state
S
only, in which case we would have
R
:
S
7→
R
).
The dynamics of an MDP proceeds as follows: We start in some state
s
0
,
and get to choose some action
a
0
∈
A
to take in the MDP. As a result of our
choice, the state of the MDP randomly transitions to some successor state
s
1
, drawn according to
s
1
∼
P
s
0
a
0
. Then, we get to pick another action
a
1
.
This is the end of the preview.
Sign up
to
access the rest of the document.
 '09
 Algorithms, value iteration, Markov decision process, MDPs, state transition probabilities

Click to edit the document details