1. (Problem 1 in the attached file)
Reinforcement Learning: The Inverted Pendulum (50 pts)
In this problem, you will apply reinforcement learning to automatically design a policy for a
dicult control task, without ever using any explicit knowledge of the dynamics of the underlying
The problem we will consider is the inverted pendulum or
the pole-balancing problem 1.
Consider the gure shown. A thin pole is connected via
a free hinge to a cart, which can move laterally on a smooth
table surface. The controller is said to have failed if either the
angle of the pole deviates by more than a certain amount from
the vertical position (i.e., if the pole falls over), or if the cart's
position goes out of bounds (i.e., if it falls o the end of the
table). Our objective is to develop a controller to balance the
pole with these constraints, by appropriately having the cart accelerate left and right.
We have written a simple Matlab simulator for this problem. The simulation proceeds in
discrete time cycles (steps). The state of the cart and pole at any time is completely characterized
by 4 parameters: the cart position x, the cart velocity x_ , the angle of the pole measured as its
deviation from the vertical position, and the angular velocity of the pole _. Since it'd be simpler to
consider reinforcement learning in a discrete state space, we have approximated the state space by
a discretization that maps a state vector (x; x_ ; ; _) into a number from 1 to NUM STATES. Your
learning algorithm will need to deal only with this discretized representation of the states.
At every time step, the controller must choose one of two actions - push (accelerate) the cart
right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These
are represented as actions 1 and 2 respectively in the code. When the action choice is made, the
simulator updates the state parameters according to the underlying dynamics, and provides a new
We will assume that the reward R(s) is a function of the current state only. When the pole
angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given, and
the system is reinitialized randomly. At all other times, the reward is zero. Your program must
learn to balance the pole using only the state transitions and rewards observed.
The les for this problem are in hw6p1.zip. Most of the the code has already been written for
you, and you need to make changes only to control.m in the places specied. This le can be run
in Matlab to show a display and to plot a learning curve at the end. Read the comments at the
top of the le for more details on the working of the simulation 2
To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities
and rewards) for the underlying MDP, solve Bellman's equations for this estimated MDP to obtain
a value function, and act greedily with respect to this value function.
y, you will maintain a current model of the MDP and a current estimate of the value
function. Initially, each state has estimated reward zero, and the estimated transition probabilities
1The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html
2Note that the routine for drawing the cart does not work in Octave.Setting min trial length to start display to a
very large number disables it
are uniform (equally likely to end up in any other state).
During the simulation, you must choose actions at each time step according to some current
policy. As the program goes along taking actions, it will gather observations on transitions and
rewards, which it can use to get a better estimate of the MDP model. Since it is inecient to
update the whole estimated MDP after every observation, we will store the state transitions and
reward observations each time, and update the model and value function/policy only periodically.
Thus, you must maintain counts of the total number of times the transition from state si to state
sj using action a has been observed (similarly for the rewards). Note that the rewards at any state
are deterministic, but the state transitions are not because of the discretization of the state space
(several dierent but close congurations may map onto the same discretized state).
Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition
probabilities and rewards as the average of the observed values (if any). Your program must then
use value iteration to solve Bellman's equations on the estimated MDP, to get the value function
and new optimal policy for the new model. For value iteration, use a convergence criterion that
checks if the maximum absolute change in the value function on an iteration exceeds some specied
Finally, assume that the whole learning procedure has converged once several consecutive at-
tempts (dened by the parameter NO LEARNING THRESHOLD) to solve Bellman's equation all
converge in the rst iteration. Intuitively, this indicates that the estimated model has stopped
The code outline for this problem is already in control.m, and you need to write code fragments
only at the places specied in the le. There are several details (convergence criteria etc.) that are
also explained inside the code. Use a discount factor of
Implement the reinforcement learning algorithm as specied, and run it. How many trials (how
many times did the pole fall over or the cart fall o) did it take before the algorithm converged?
Hand in your implementation of control.m, and the plot it produces.
In This problem, we show that MDP is gaarentted to nd the optimal policy. Consider an MDP
with nite state and action spaces, and discount factor . Let B be the Bellman update operator
with V a vector of values for each state. I.e., if V = B(V ), then
V 0(s) = R(s) +
(c) We say that V is a xed point of B if B(V ) = V . Using the fact that the Bellman update
operator is a
-contraction in the max-norm, prove that B has at most one xed point -i.e.,
that there is at most one solution to the Bellman equations. You may assume that B has at
least one xed point.
We need you to clarify your question for our tutors! Clarification request: Dear Student, We... View the full answer