This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: terminal state. There are two possible actions, a and b , for states 1 and 2. The transition probabilities for the two actions are shown in the figure. Use a discounting factor of 0.5. (a) Assume the initial policy has action b in both states 1 and 2. Apply policy iteration to determine the optimal policy and the values of states 1 and 2. Show the steps. (b) What if the initial policy has action a in both states? 4. Programming. (30 pts) In the grid world shown below, the agent can move in the four compass directions starting from S. The goal state is G. If the reward on reaching the goal is 100 and γ =0.9, use Q-learning to learn the optimal policy. Calculate Q*(S, a). For learning, generate the sequences randomly. Instructions: (a) Print the optimal policy and the final Q values. (b) Submit your code and readme file...
View Full Document
This note was uploaded on 01/25/2012 for the course CS 6375 taught by Professor Yangliu during the Spring '09 term at University of Texas at Dallas, Richardson.
- Spring '09
- Machine Learning