Please look at the attached file for a better understanding of the problems: (problem 1 and problem 2c)

1. (Problem 1 in the attached file)

Reinforcement Learning: The Inverted Pendulum (50 pts)

In this problem, you will apply reinforcement learning to automatically design a policy for a

dicult control task, without ever using any explicit knowledge of the dynamics of the underlying

system.

x

The problem we will consider is the inverted pendulum or

the pole-balancing problem 1.

Consider the gure shown. A thin pole is connected via

a free hinge to a cart, which can move laterally on a smooth

table surface. The controller is said to have failed if either the

angle of the pole deviates by more than a certain amount from

the vertical position (i.e., if the pole falls over), or if the cart's

position goes out of bounds (i.e., if it falls o the end of the

table). Our objective is to develop a controller to balance the

pole with these constraints, by appropriately having the cart accelerate left and right.

We have written a simple Matlab simulator for this problem. The simulation proceeds in

discrete time cycles (steps). The state of the cart and pole at any time is completely characterized

by 4 parameters: the cart position x, the cart velocity x_ , the angle of the pole measured as its

deviation from the vertical position, and the angular velocity of the pole _. Since it'd be simpler to

consider reinforcement learning in a discrete state space, we have approximated the state space by

a discretization that maps a state vector (x; x_ ; ; _) into a number from 1 to NUM STATES. Your

learning algorithm will need to deal only with this discretized representation of the states.

At every time step, the controller must choose one of two actions - push (accelerate) the cart

right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These

are represented as actions 1 and 2 respectively in the code. When the action choice is made, the

simulator updates the state parameters according to the underlying dynamics, and provides a new

discretized state.

We will assume that the reward R(s) is a function of the current state only. When the pole

angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given, and

the system is reinitialized randomly. At all other times, the reward is zero. Your program must

learn to balance the pole using only the state transitions and rewards observed.

The les for this problem are in hw6p1.zip. Most of the the code has already been written for

you, and you need to make changes only to control.m in the places specied. This le can be run

in Matlab to show a display and to plot a learning curve at the end. Read the comments at the

top of the le for more details on the working of the simulation 2

To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities

and rewards) for the underlying MDP, solve Bellman's equations for this estimated MDP to obtain

a value function, and act greedily with respect to this value function.

Brie

y, you will maintain a current model of the MDP and a current estimate of the value

function. Initially, each state has estimated reward zero, and the estimated transition probabilities

1The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html

2Note that the routine for drawing the cart does not work in Octave.Setting min trial length to start display to a

very large number disables it

p. 1

are uniform (equally likely to end up in any other state).

During the simulation, you must choose actions at each time step according to some current

policy. As the program goes along taking actions, it will gather observations on transitions and

rewards, which it can use to get a better estimate of the MDP model. Since it is inecient to

update the whole estimated MDP after every observation, we will store the state transitions and

reward observations each time, and update the model and value function/policy only periodically.

Thus, you must maintain counts of the total number of times the transition from state si to state

sj using action a has been observed (similarly for the rewards). Note that the rewards at any state

are deterministic, but the state transitions are not because of the discretization of the state space

(several dierent but close congurations may map onto the same discretized state).

Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition

probabilities and rewards as the average of the observed values (if any). Your program must then

use value iteration to solve Bellman's equations on the estimated MDP, to get the value function

and new optimal policy for the new model. For value iteration, use a convergence criterion that

checks if the maximum absolute change in the value function on an iteration exceeds some specied

tolerance.

Finally, assume that the whole learning procedure has converged once several consecutive at-

tempts (dened by the parameter NO LEARNING THRESHOLD) to solve Bellman's equation all

converge in the rst iteration. Intuitively, this indicates that the estimated model has stopped

changing signicantly.

The code outline for this problem is already in control.m, and you need to write code fragments

only at the places specied in the le. There are several details (convergence criteria etc.) that are

also explained inside the code. Use a discount factor of

= :995.

Implement the reinforcement learning algorithm as specied, and run it. How many trials (how

many times did the pole fall over or the cart fall o) did it take before the algorithm converged?

Hand in your implementation of control.m, and the plot it produces.

2c.

In This problem, we show that MDP is gaarentted to nd the optimal policy. Consider an MDP

with nite state and action spaces, and discount factor . Let B be the Bellman update operator

with V a vector of values for each state. I.e., if V = B(V ), then

V 0(s) = R(s) +

max

a2A

X

s02S

Psa(s0)V (s0)

(c) We say that V is a xed point of B if B(V ) = V . Using the fact that the Bellman update

operator is a

-contraction in the max-norm, prove that B has at most one xed point -i.e.,

that there is at most one solution to the Bellman equations. You may assume that B has at

least one xed point.

1. (Problem 1 in the attached file)

Reinforcement Learning: The Inverted Pendulum (50 pts)

In this problem, you will apply reinforcement learning to automatically design a policy for a

dicult control task, without ever using any explicit knowledge of the dynamics of the underlying

system.

x

The problem we will consider is the inverted pendulum or

the pole-balancing problem 1.

Consider the gure shown. A thin pole is connected via

a free hinge to a cart, which can move laterally on a smooth

table surface. The controller is said to have failed if either the

angle of the pole deviates by more than a certain amount from

the vertical position (i.e., if the pole falls over), or if the cart's

position goes out of bounds (i.e., if it falls o the end of the

table). Our objective is to develop a controller to balance the

pole with these constraints, by appropriately having the cart accelerate left and right.

We have written a simple Matlab simulator for this problem. The simulation proceeds in

discrete time cycles (steps). The state of the cart and pole at any time is completely characterized

by 4 parameters: the cart position x, the cart velocity x_ , the angle of the pole measured as its

deviation from the vertical position, and the angular velocity of the pole _. Since it'd be simpler to

consider reinforcement learning in a discrete state space, we have approximated the state space by

a discretization that maps a state vector (x; x_ ; ; _) into a number from 1 to NUM STATES. Your

learning algorithm will need to deal only with this discretized representation of the states.

At every time step, the controller must choose one of two actions - push (accelerate) the cart

right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These

are represented as actions 1 and 2 respectively in the code. When the action choice is made, the

simulator updates the state parameters according to the underlying dynamics, and provides a new

discretized state.

We will assume that the reward R(s) is a function of the current state only. When the pole

angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given, and

the system is reinitialized randomly. At all other times, the reward is zero. Your program must

learn to balance the pole using only the state transitions and rewards observed.

The les for this problem are in hw6p1.zip. Most of the the code has already been written for

you, and you need to make changes only to control.m in the places specied. This le can be run

in Matlab to show a display and to plot a learning curve at the end. Read the comments at the

top of the le for more details on the working of the simulation 2

To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities

and rewards) for the underlying MDP, solve Bellman's equations for this estimated MDP to obtain

a value function, and act greedily with respect to this value function.

Brie

y, you will maintain a current model of the MDP and a current estimate of the value

function. Initially, each state has estimated reward zero, and the estimated transition probabilities

1The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html

2Note that the routine for drawing the cart does not work in Octave.Setting min trial length to start display to a

very large number disables it

p. 1

are uniform (equally likely to end up in any other state).

During the simulation, you must choose actions at each time step according to some current

policy. As the program goes along taking actions, it will gather observations on transitions and

rewards, which it can use to get a better estimate of the MDP model. Since it is inecient to

update the whole estimated MDP after every observation, we will store the state transitions and

reward observations each time, and update the model and value function/policy only periodically.

Thus, you must maintain counts of the total number of times the transition from state si to state

sj using action a has been observed (similarly for the rewards). Note that the rewards at any state

are deterministic, but the state transitions are not because of the discretization of the state space

(several dierent but close congurations may map onto the same discretized state).

Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition

probabilities and rewards as the average of the observed values (if any). Your program must then

use value iteration to solve Bellman's equations on the estimated MDP, to get the value function

and new optimal policy for the new model. For value iteration, use a convergence criterion that

checks if the maximum absolute change in the value function on an iteration exceeds some specied

tolerance.

Finally, assume that the whole learning procedure has converged once several consecutive at-

tempts (dened by the parameter NO LEARNING THRESHOLD) to solve Bellman's equation all

converge in the rst iteration. Intuitively, this indicates that the estimated model has stopped

changing signicantly.

The code outline for this problem is already in control.m, and you need to write code fragments

only at the places specied in the le. There are several details (convergence criteria etc.) that are

also explained inside the code. Use a discount factor of

= :995.

Implement the reinforcement learning algorithm as specied, and run it. How many trials (how

many times did the pole fall over or the cart fall o) did it take before the algorithm converged?

Hand in your implementation of control.m, and the plot it produces.

2c.

In This problem, we show that MDP is gaarentted to nd the optimal policy. Consider an MDP

with nite state and action spaces, and discount factor . Let B be the Bellman update operator

with V a vector of values for each state. I.e., if V = B(V ), then

V 0(s) = R(s) +

max

a2A

X

s02S

Psa(s0)V (s0)

(c) We say that V is a xed point of B if B(V ) = V . Using the fact that the Bellman update

operator is a

-contraction in the max-norm, prove that B has at most one xed point -i.e.,

that there is at most one solution to the Bellman equations. You may assume that B has at

least one xed point.

#### Top Answer

We need you to clarify your question for our tutors! Clarification request: Dear Student, We... View the full answer