View the step-by-step solution to:

# CS 4758/6758: Robot Learning: Homework 6 preview Due: May 6th 1 Reinforcement Learning: The Inverted Pendulum (50 pts) In this problem, you will...

Please look at the attached file for a better understanding of the problems: (problem 1 and problem 2c)

1. (Problem 1 in the attached file)

Reinforcement Learning: The Inverted Pendulum (50 pts)
In this problem, you will apply reinforcement learning to automatically design a policy for a
dicult control task, without ever using any explicit knowledge of the dynamics of the underlying
system.

x
The problem we will consider is the inverted pendulum or
the pole-balancing problem 1.
Consider the gure shown. A thin pole is connected via
a free hinge to a cart, which can move laterally on a smooth
table surface. The controller is said to have failed if either the
angle of the pole deviates by more than a certain amount from
the vertical position (i.e., if the pole falls over), or if the cart's
position goes out of bounds (i.e., if it falls o the end of the
table). Our objective is to develop a controller to balance the
pole with these constraints, by appropriately having the cart accelerate left and right.
We have written a simple Matlab simulator for this problem. The simulation proceeds in
discrete time cycles (steps). The state of the cart and pole at any time is completely characterized
by 4 parameters: the cart position x, the cart velocity x_ , the angle of the pole  measured as its
deviation from the vertical position, and the angular velocity of the pole _. Since it'd be simpler to
consider reinforcement learning in a discrete state space, we have approximated the state space by
a discretization that maps a state vector (x; x_ ; ; _) into a number from 1 to NUM STATES. Your
learning algorithm will need to deal only with this discretized representation of the states.
At every time step, the controller must choose one of two actions - push (accelerate) the cart
right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These
are represented as actions 1 and 2 respectively in the code. When the action choice is made, the
simulator updates the state parameters according to the underlying dynamics, and provides a new
discretized state.
We will assume that the reward R(s) is a function of the current state only. When the pole
angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given, and
the system is reinitialized randomly. At all other times, the reward is zero. Your program must
learn to balance the pole using only the state transitions and rewards observed.
The les for this problem are in hw6p1.zip. Most of the the code has already been written for
you, and you need to make changes only to control.m in the places speci ed. This le can be run
in Matlab to show a display and to plot a learning curve at the end. Read the comments at the
top of the le for more details on the working of the simulation 2
To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities
and rewards) for the underlying MDP, solve Bellman's equations for this estimated MDP to obtain
a value function, and act greedily with respect to this value function.
Brie
y, you will maintain a current model of the MDP and a current estimate of the value
function. Initially, each state has estimated reward zero, and the estimated transition probabilities
1The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html
2Note that the routine for drawing the cart does not work in Octave.Setting min trial length to start display to a
very large number disables it
p. 1
are uniform (equally likely to end up in any other state).
During the simulation, you must choose actions at each time step according to some current
policy. As the program goes along taking actions, it will gather observations on transitions and
rewards, which it can use to get a better estimate of the MDP model. Since it is inecient to
update the whole estimated MDP after every observation, we will store the state transitions and
reward observations each time, and update the model and value function/policy only periodically.
Thus, you must maintain counts of the total number of times the transition from state si to state
sj using action a has been observed (similarly for the rewards). Note that the rewards at any state
are deterministic, but the state transitions are not because of the discretization of the state space
(several di erent but close con gurations may map onto the same discretized state).
Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition
probabilities and rewards as the average of the observed values (if any). Your program must then
use value iteration to solve Bellman's equations on the estimated MDP, to get the value function
and new optimal policy for the new model. For value iteration, use a convergence criterion that
checks if the maximum absolute change in the value function on an iteration exceeds some speci ed
tolerance.
Finally, assume that the whole learning procedure has converged once several consecutive at-
tempts (de ned by the parameter NO LEARNING THRESHOLD) to solve Bellman's equation all
converge in the rst iteration. Intuitively, this indicates that the estimated model has stopped
changing signi cantly.
The code outline for this problem is already in control.m, and you need to write code fragments
only at the places speci ed in the le. There are several details (convergence criteria etc.) that are
also explained inside the code. Use a discount factor of
= :995.
Implement the reinforcement learning algorithm as speci ed, and run it. How many trials (how
many times did the pole fall over or the cart fall o ) did it take before the algorithm converged?
Hand in your implementation of control.m, and the plot it produces.

2c.

In This problem, we show that MDP is gaarentted to nd the optimal policy. Consider an MDP
with nite state and action spaces, and discount factor . Let B be the Bellman update operator
with V a vector of values for each state. I.e., if V = B(V ), then
V 0(s) = R(s) +
max
a2A
X
s02S
Psa(s0)V (s0)

(c) We say that V is a xed point of B if B(V ) = V . Using the fact that the Bellman update
operator is a
-contraction in the max-norm, prove that B has at most one xed point -i.e.,
that there is at most one solution to the Bellman equations. You may assume that B has at
least one xed point.

CS 4758/6758: Robot Learning: Homework 6 preview Due: May 6th 1 Reinforcement Learning: The Inverted Pendulum (50 pts) In this problem, you will apply reinforcement learning to automatically design a policy for a diﬃcult control task, without ever using any explicit knowledge of the dynamics of the underlying system. θ x The problem we will consider is the inverted pendulum or the pole-balancing problem 1 . Consider the ﬁgure shown. A thin pole is connected via a free hinge to a cart, which can move laterally on a smooth table surface. The controller is said to have failed if either the angle of the pole deviates by more than a certain amount from the vertical position (i.e., if the pole falls over), or if the cart’s position goes out of bounds (i.e., if it falls oﬀ the end of the table). Our objective is to develop a controller to balance the pole with these constraints, by appropriately having the cart accelerate left and right. We have written a simple Matlab simulator for this problem. The simulation proceeds in discrete time cycles (steps). The state of the cart and pole at any time is completely characterized by 4 parameters: the cart position x , the cart velocity ˙ x , the angle of the pole θ measured as its deviation from the vertical position, and the angular velocity of the pole ˙ θ . Since it’d be simpler to consider reinforcement learning in a discrete state space, we have approximated the state space by a discretization that maps a state vector ( x, ˙ x,θ, ˙ θ ) into a number from 1 to NUM STATES. Your learning algorithm will need to deal only with this discretized representation of the states. At every time step, the controller must choose one of two actions - push (accelerate) the cart right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These are represented as actions 1 and 2 respectively in the code. When the action choice is made, the simulator updates the state parameters according to the underlying dynamics, and provides a new discretized state. We will assume that the reward R ( s ) is a function of the current state only. When the pole angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given, and the system is reinitialized randomly. At all other times, the reward is zero. Your program must learn to balance the pole using only the state transitions and rewards observed. The ﬁles for this problem are in hw6p1.zip . Most of the the code has already been written for you, and you need to make changes only to control.m in the places speciﬁed. This ﬁle can be run in Matlab to show a display and to plot a learning curve at the end. Read the comments at the top of the ﬁle for more details on the working of the simulation 2 To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities and rewards) for the underlying MDP, solve Bellman’s equations for this estimated MDP to obtain a value function, and act greedily with respect to this value function. Brieﬂy, you will maintain a current model of the MDP and a current estimate of the value function. Initially, each state has estimated reward zero, and the estimated transition probabilities 1 The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html 2 Note that the routine for drawing the cart does not work in Octave.Setting min trial length to start display to a very large number disables it p. 1
are uniform (equally likely to end up in any other state). During the simulation, you must choose actions at each time step according to some current policy. As the program goes along taking actions, it will gather observations on transitions and rewards, which it can use to get a better estimate of the MDP model. Since it is ineﬃcient to update the whole estimated MDP after every observation, we will store the state transitions and reward observations each time, and update the model and value function/policy only periodically. Thus, you must maintain counts of the total number of times the transition from state s i to state s j using action a has been observed (similarly for the rewards). Note that the rewards at any state are deterministic, but the state transitions are not because of the discretization of the state space (several diﬀerent but close conﬁgurations may map onto the same discretized state). Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition probabilities and rewards as the average of the observed values (if any). Your program must then use value iteration to solve Bellman’s equations on the estimated MDP, to get the value function and new optimal policy for the new model. For value iteration, use a convergence criterion that checks if the maximum absolute change in the value function on an iteration exceeds some speciﬁed tolerance. Finally, assume that the whole learning procedure has converged once several consecutive at- tempts (deﬁned by the parameter NO LEARNING THRESHOLD) to solve Bellman’s equation all converge in the ﬁrst iteration. Intuitively, this indicates that the estimated model has stopped changing signiﬁcantly. The code outline for this problem is already in control.m , and you need to write code fragments only at the places speciﬁed in the ﬁle. There are several details (convergence criteria etc.) that are also explained inside the code. Use a discount factor of γ = . 995. Implement the reinforcement learning algorithm as speciﬁed, and run it. How many trials (how many times did the pole fall over or the cart fall oﬀ) did it take before the algorithm converged? Hand in your implementation of control.m , and the plot it produces. 2 Reinforcement Learning MDP In This problem, we show that MDP is gaarentted to ﬁnd the optimal policy. Consider an MDP with ﬁnite state and action spaces, and discount factor . Let B be the Bellman update operator with V a vector of values for each state. I.e., if V = B ( V ), then V 0 ( s ) = R ( s ) + γ max a A X s 0 S P sa ( s 0 ) V ( s 0 ) (a) Prove that if V 1 ( s ) V 2 ( s ) for all s S , then B ( V 1 )( s ) B ( v 2 )( s ) for all s S (b) Prove that for any V , || B π ( V ) - V π || γ || V - V π || where || V || = max s S | V ( s ) | Intuitively, this means that applying the Bellman operator B π to any value function V , brings that value function closer to the value function for π , V π . This also means that applying B π repeatedly (an inﬁnite number of times) B π ( B π ( B π ··· B π ( V ) ··· )) will result in the value function V π (a little bit more is needed to make this completely formal, but we will not worry about that here). Use the fact that for any α , x R n , if i a i = 1 and a i 0, then i α i x i max i x i p. 2
Show entire document

We need you to clarify your question for our tutors! Clarification request: Dear Student, We... View the full answer

### Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.

### -

Educational Resources
• ### -

Study Documents

Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

Browse Documents