hw6 - terminal state. There are two possible actions, a and...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
CS 6375 Machine Learning Homework 6 Due: 05/07/2008 1. Reinforcement learning application. (15 pts) Read a paper about using reinforcement learning for an application. Briefly summarize the paper, and explain clearly the states, reward, and actions for the task. 2. MDP. (30 pts) The following figure shows an MDP with N states. All states have two actions (north and right) except Sn, which can only self-loop. As you can see from the figure, all state transitions are deterministic. The discount factor is γ . (a) What is J*(Sn)? (b) What is the optimal policy? (c) What is J*(S 1 )? (d) Use value iteration to solve this MDP. What is J 1 (S 1 ) and J 2 (S 1 ) in the first and second iteration respectively? Hint: If you don’t remember the formula for summing up geometric series, you will need the following one, where 0 <= α <1:
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
3. Policy iteration. (25 pts) Consider the following MDP with three states, with rewards -1, -2, 0 respectively. State 3 is the
Background image of page 2
Background image of page 3
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: terminal state. There are two possible actions, a and b , for states 1 and 2. The transition probabilities for the two actions are shown in the figure. Use a discounting factor of 0.5. (a) Assume the initial policy has action b in both states 1 and 2. Apply policy iteration to determine the optimal policy and the values of states 1 and 2. Show the steps. (b) What if the initial policy has action a in both states? 4. Programming. (30 pts) In the grid world shown below, the agent can move in the four compass directions starting from S. The goal state is G. If the reward on reaching the goal is 100 and γ =0.9, use Q-learning to learn the optimal policy. Calculate Q*(S, a). For learning, generate the sequences randomly. Instructions: (a) Print the optimal policy and the final Q values. (b) Submit your code and readme file...
View Full Document

This note was uploaded on 01/25/2012 for the course CS 6375 taught by Professor Yangliu during the Spring '09 term at University of Texas at Dallas, Richardson.

Page1 / 3

hw6 - terminal state. There are two possible actions, a and...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online