CS 4758/6758: Robot Learning: Homework 6 preview
Due: May 6th
1 Reinforcement Learning: The Inverted Pendulum (50 pts)
In this problem, you will apply reinforcement learning to automatically design a policy for a
diﬃcult control task, without ever using any explicit knowledge of the dynamics of the underlying
system.
θ
x
The problem we will consider is the inverted pendulum or
the pole-balancing problem
1
.
Consider the ﬁgure shown. A thin pole is connected via
a free hinge to a cart, which can move laterally on a smooth
table surface. The controller is said to have failed if either the
angle of the pole deviates by more than a certain amount from
the vertical position (i.e., if the pole falls over), or if the cart’s
position goes out of bounds (i.e., if it falls oﬀ the end of the
table). Our objective is to develop a controller to balance the
pole with these constraints, by appropriately having the cart accelerate left and right.
We have written a simple Matlab simulator for this problem. The simulation proceeds in
discrete time cycles (steps). The state of the cart and pole at any time is completely characterized
by 4 parameters: the cart position
x
, the cart velocity ˙
x
, the angle of the pole
θ
measured as its
deviation from the vertical position, and the angular velocity of the pole
˙
θ
. Since it’d be simpler to
consider reinforcement learning in a discrete state space, we have approximated the state space by
a discretization that maps a state vector (
x,
˙
x,θ,
˙
θ
) into a number from 1 to NUM
STATES. Your
learning algorithm will need to deal only with this discretized representation of the states.
At every time step, the controller must choose one of two actions - push (accelerate) the cart
right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These
are represented as actions 1 and 2 respectively in the code. When the action choice is made, the
simulator updates the state parameters according to the underlying dynamics, and provides a new
discretized state.
We will assume that the reward
R
(
s
) is a function of the current state only. When the pole
angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given, and
the system is reinitialized randomly. At all other times, the reward is zero. Your program must
learn to balance the pole using only the state transitions and rewards observed.
The ﬁles for this problem are in
hw6p1.zip
. Most of the the code has already been written for
you, and you need to make changes only to
control.m
in the places speciﬁed. This ﬁle can be run
in Matlab to show a display and to plot a learning curve at the end. Read the comments at the
top of the ﬁle for more details on the working of the simulation
2
To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities
and rewards) for the underlying MDP, solve Bellman’s equations for this estimated MDP to obtain
a value function, and act greedily with respect to this value function.
Brieﬂy, you will maintain a current model of the MDP and a current estimate of the value
function. Initially, each state has estimated reward zero, and the estimated transition probabilities
1
The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html
2
Note that the routine for drawing the cart does not work in Octave.Setting min
trial
length
to
start
display to a
very large number disables it
p. 1