View the step-by-step solution to:

# CS 4758/6758: Robot Learning: Homework 6 preview Due: May 6th 1 Reinforcement Learning: The Inverted Pendulum (50 pts) In this problem, you will...

Hi,
can I get the answer for problem 2C in the attached file?
CS 4758/6758: Robot Learning: Homework 6 preview Due: May 6th 1 Reinforcement Learning: The Inverted Pendulum (50 pts) In this problem, you will apply reinforcement learning to automatically design a policy for a diﬃcult control task, without ever using any explicit knowledge of the dynamics of the underlying system. θ x The problem we will consider is the inverted pendulum or the pole-balancing problem 1 . Consider the ﬁgure shown. A thin pole is connected via a free hinge to a cart, which can move laterally on a smooth table surface. The controller is said to have failed if either the angle of the pole deviates by more than a certain amount from the vertical position (i.e., if the pole falls over), or if the cart’s position goes out of bounds (i.e., if it falls oﬀ the end of the table). Our objective is to develop a controller to balance the pole with these constraints, by appropriately having the cart accelerate left and right. We have written a simple Matlab simulator for this problem. The simulation proceeds in discrete time cycles (steps). The state of the cart and pole at any time is completely characterized by 4 parameters: the cart position x , the cart velocity ˙ x , the angle of the pole θ measured as its deviation from the vertical position, and the angular velocity of the pole ˙ θ . Since it’d be simpler to consider reinforcement learning in a discrete state space, we have approximated the state space by a discretization that maps a state vector ( x, ˙ x,θ, ˙ θ ) into a number from 1 to NUM STATES. Your learning algorithm will need to deal only with this discretized representation of the states. At every time step, the controller must choose one of two actions - push (accelerate) the cart right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These are represented as actions 1 and 2 respectively in the code. When the action choice is made, the simulator updates the state parameters according to the underlying dynamics, and provides a new discretized state. We will assume that the reward R ( s ) is a function of the current state only. When the pole angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given, and the system is reinitialized randomly. At all other times, the reward is zero. Your program must learn to balance the pole using only the state transitions and rewards observed. The ﬁles for this problem are in hw6p1.zip . Most of the the code has already been written for you, and you need to make changes only to control.m in the places speciﬁed. This ﬁle can be run in Matlab to show a display and to plot a learning curve at the end. Read the comments at the top of the ﬁle for more details on the working of the simulation 2 To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities and rewards) for the underlying MDP, solve Bellman’s equations for this estimated MDP to obtain a value function, and act greedily with respect to this value function. Brieﬂy, you will maintain a current model of the MDP and a current estimate of the value function. Initially, each state has estimated reward zero, and the estimated transition probabilities 1 The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html 2 Note that the routine for drawing the cart does not work in Octave.Setting min trial length to start display to a very large number disables it p. 1
are uniform (equally likely to end up in any other state). During the simulation, you must choose actions at each time step according to some current policy. As the program goes along taking actions, it will gather observations on transitions and rewards, which it can use to get a better estimate of the MDP model. Since it is ineﬃcient to update the whole estimated MDP after every observation, we will store the state transitions and reward observations each time, and update the model and value function/policy only periodically. Thus, you must maintain counts of the total number of times the transition from state s i to state s j using action a has been observed (similarly for the rewards). Note that the rewards at any state are deterministic, but the state transitions are not because of the discretization of the state space (several diﬀerent but close conﬁgurations may map onto the same discretized state). Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition probabilities and rewards as the average of the observed values (if any). Your program must then use value iteration to solve Bellman’s equations on the estimated MDP, to get the value function and new optimal policy for the new model. For value iteration, use a convergence criterion that checks if the maximum absolute change in the value function on an iteration exceeds some speciﬁed tolerance. Finally, assume that the whole learning procedure has converged once several consecutive at- tempts (deﬁned by the parameter NO LEARNING THRESHOLD) to solve Bellman’s equation all converge in the ﬁrst iteration. Intuitively, this indicates that the estimated model has stopped changing signiﬁcantly. The code outline for this problem is already in control.m , and you need to write code fragments only at the places speciﬁed in the ﬁle. There are several details (convergence criteria etc.) that are also explained inside the code. Use a discount factor of γ = . 995. Implement the reinforcement learning algorithm as speciﬁed, and run it. How many trials (how many times did the pole fall over or the cart fall oﬀ) did it take before the algorithm converged? Hand in your implementation of control.m , and the plot it produces. 2 Reinforcement Learning MDP In This problem, we show that MDP is gaarentted to ﬁnd the optimal policy. Consider an MDP with ﬁnite state and action spaces, and discount factor . Let B be the Bellman update operator with V a vector of values for each state. I.e., if V = B ( V ), then V 0 ( s ) = R ( s ) + γ max a A X s 0 S P sa ( s 0 ) V ( s 0 ) (a) Prove that if V 1 ( s ) V 2 ( s ) for all s S , then B ( V 1 )( s ) B ( v 2 )( s ) for all s S (b) Prove that for any V , || B π ( V ) - V π || γ || V - V π || where || V || = max s S | V ( s ) | Intuitively, this means that applying the Bellman operator B π to any value function V , brings that value function closer to the value function for π , V π . This also means that applying B π repeatedly (an inﬁnite number of times) B π ( B π ( B π ··· B π ( V ) ··· )) will result in the value function V π (a little bit more is needed to make this completely formal, but we will not worry about that here). Use the fact that for any α , x R n , if i a i = 1 and a i 0, then i α i x i max i x i p. 2
Show entire document

### Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.

### -

Educational Resources
• ### -

Study Documents

Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

Browse Documents