This preview shows pages 1–3. Sign up to view the full content.
Massachusetts Institute of Technology
16.410 Principles of Automated Reasoning
and Decision Making
Problem Set #10
Due:
Session 25
Learning, Control and Adversaries
Objectives
In this problem set you will develop your understanding of how agents act to
maximize their utility, while interacting within a changing environment.
The methods
you will apply include decision tree learning for classification, Markov decision
processes and reinforcement learning, and adversarial, game tree search.
Readings
The material in this problem set corresponds primarily to Lectures 7, 23 and 24.
Please review the corresponding lecture notes and any assigned readings, specified in the
notes.
Note that Lecture 7 covers gametree search, Lecture 23 covers decisiontree
learning, and Lecture 24 covers reinforcement learning and control, based on Markov
Decision Processes.
Problem 1 –MDPs: Tortoise and Hare
The following question is taken from last year’s final.
We all know, as the story goes,
that the Tortoise beat the Hare to the finish line.
The Tortoise was slow, but extremely
focused on the finish line, while the Hare was fast, but easily distracted.
Although the
Tortoise crossed the finish first, who really gained the greatest reward, the Tortoise or the
Hare?
It’s a matter of perspective.
To resolve this age old question, we frame the race as
an MDP, solve for the optimal policy, and use this policy to determine once and for all
whose path is best, the Tortoise or the Hare.
C
F
2
3
1
T / 10
H / 18
H / 50
T / 0
T orH / 100
T or H / 0
T or H / 0
We model the race with the above MDP.
The race starts at 1, and finishes at F.
2 and 3
State
Action
Next State
Reward
1
T
3
10
1
H
2
18
2
H
C
50
2
T
3
0
3
T or H
F
100
C
T or H
C
0
F
T or H
F
0
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document denote intermediate check points along the race course, while C denotes a Cabbage patch,
which is very enticing to the Hare.
Actions are T and H.
T denotes actions focused
towards the finish line, while H denotes an action that grabs the greatest immediate
reward. The tortoise’s sequence <T, T> is the shortest path to the finish line.
The hare’s
sequence <H,H> is the direct path to the cabbage patch, with rewards along the way. <H,
T, T> represents a mixed strategy, balancing immediate and long term reward.
Part A. Value Function and Policy for Tortoise Discount
This is the end of the preview. Sign up
to
access the rest of the document.
This note was uploaded on 11/07/2011 for the course AERO 16.410 taught by Professor Brianwilliams during the Fall '05 term at MIT.
 Fall '05
 BrianWilliams

Click to edit the document details