AMS 556 Dynamic Programming - Homework 01 Solution
1) Equivalent Randomized Markov Policy
Starting from x0 = 1, we list all the paths that can happen and the probabilities they
can happen, in the Table I.
From the Table I we can easily read o the followin
Lecture 2: Properties of Strategic Measures
In this lecture, we describe two fundamental properties of strategic measures. The rst property is that for any initial distribution , the set of
strategic measures is convex. This means that, if a decision make
Lecture 5: Optimality Operators.
For x X, a A(x), and for a nonnegative or nonpositive function f on
X, we dene
p(y|x, a)f (y).
(1)
P a f (x) =
yX
Lemma 1 P a V+ (x) V+ (x) for any x X and for any a A(x).
Proof. If V+ (x) = then the statement of the lemma
Lecture 3: Properties of Strategic Measures
Representation of strategic measures for randomized Markov
policies via strategic measures for (nonrandomized) Markov policies.
The set M of Markov policies is the set of functions from X cfw_0, 1, . . .
to A su
Lecture 6: Optimality Equation
Theorem 1 If the General Convergence Condition holds then
V = T V.
(1)
Proof First, we show that V (x) T V (x) for all x X. In view of Corollary
6 from Lecture 4, for any > 0 there exists an -optimal nonrandomized
policy tha
Lecture 9: Value Iterations for Discounted Dynamic Programming
Consider a function V0 and consider the values T n V0 . It is natural to
study the conditions under which the values T n V0 are well-dened and the
sequence cfw_T n V0 converges to V in some s
Lecture 7: Stationary Optimal Policies
The following Lemma states that any stationary optimal policy is conserving.
Lemma 1 Let the General Convergence Condition hold. Then any stationary optimal policy is conserving.
Proof For a stationary policy , we ha
Lecture 8: The Principle of Contraction Mappings
Metric Spaces. A metric space is a pair (Y, d) of two things: a set Y
and a nonnegative function d on Y Y called a distance between x and y
from Y .
Denition. A pair (Y, d) forms a metric space if Y is a se
Lecture 10: Finite-Horizon Models
Nonstationary Models. An MDP is called nonstationary if the sets
of available actions, transition probabilities, and rewards depend on the
time parameter. In particular, an innite-horizon nonstationary policy is
dened by
Lecture 12: Optimal stopping problem
Consider a Markov chain with the state space X and with the matrix
of transition probabilities P . There is a function g(x). Each state has two
actions: to continue or to stop. If the decision is to continue (c), the o
Lecture 11: Value Iteration
Let V0 be a function on X such that there exist two positive numbers K1
and K2 such that V0 K1 V+ + K2 . We recall that T n V0 (x) = V (x, n, V0 )
In particular, T n V0 = V (x, n, 0) = V (x, n). Let Vn (x) = V (x, n). In this
l
Lecture 14: An example when Markov -optimal policies do not exist.
We consider an example of a positive dynamic programming problem
when the model satises when action sets are nite but for any K > 0 there
is no randomized Markov K-optimal policy. In parti
Lecture 15: -optimal policies I.
Though stationary optimal and -optimal policies may not exist for positive dynamic programming problems, the following two theorems indicate
that, in some sense, there exist good stationary policies for positive dynamic
pr
Lecture 13: Examples when optimal policies do not exist
We shall consider two examples of positive dynamic programming problems when the model satises the Compactness Conditions but there is no
stationary optimal policies. In the rst example, the state sp
Lecture 16: -optimal policies II.
Theorem 3 in Lecture describes the existence of multiplicative -optimal
policies in positive dynamic programming. In addition, Example 1 implies
that -optimal policies may not exist. In negative dynamic programming,
the s
Lecture 17: Discounted MDPs with Finite State and Action Sets.
In the previous lectures, we considered value iteration algorithm for Discounted Markov Decision Processes. In this lecture, we consider other two
methods: policy iteration and linear programm
Lecture 4: Total Reward Criteria.
Finite-horizon expected total rewards are
[N 1
]
vN (x, , , f ) = E
n r(xn , an ) + N f (xN ) ,
x
(1)
n=0
whenever the expectation is well-dened. In this equation, x is the initial
state, is a policy, is the discount fac
Lecture 1: Model Denitions and Notations
Let N = cfw_0, 1, . . . and let Rn be an n-dimensional Euclidean space,
R = R1 . A Markov Decision Process (MDP) is dened through the following
objects:
a state space X (we also use notations I and S for the state
AMS 556 Dynamic Programming Homework Set 1
1/2
1/2
a
1
b
2
Consider an MDP with two states 1 and 2 presented on the diagram.
There is one action at state 1 and two actions a and b at state 2.
Problem 1
Consider a policy a that always selects action a at s
Lecture 2: Properties of Strategic Measures
In this lecture, we describe two fundamental properties of strategic measures. The first property is that for any initial distribution , the set of
strategic measures is convex. This means that, if a decision ma
Lecture 4: Total Reward Criteria.
Finite-horizon expected total rewards are
[N 1
]
vN (x, , , f ) := Ex
n r(xn , an ) + N f (xN ) ,
(1)
n=0
whenever the expectation is well-defined. In this equation, N = 0, 1, . . . is
the horizon length, x X is the init
Lecture 6: Optimality Equation
Theorem 1 If the General Convergence Condition holds then
V = T V.
(1)
Proof First, we show that V (x) T V (x) for all x X. In view of Corollary
6 from Lecture 4, for any > 0 there exists an -optimal nonrandomized
policy tha
Lecture 5: Optimality Operators.
For x X, a A(x), and for a nonnegative or nonpositive function
we define
f : X R,
P a f (x) =
p(y|x, a)f (y).
yX
Lemma 1 For each x X and for each a A(x), the inequality P a V+ (x)
V+ (x) holds and, if the General Conver
Lecture 9: Value Iterations for Discounted Dynamic Programming
Consider a function V0 and consider the values T n V0 . It is natural to
study the conditions under which the values T n V0 are well-defined and the
sequence cfw_T n V0 converges to V in some
Lecture 9: Value Iterations for Discounted Dynamic Programming
Consider a function V0 and consider the values T n V0 . It is natural to
study the conditions under which the values T n V0 are well-defined and the
sequence cfw_T n V0 converges to V in some
Lecture 1: Model Definitions and Notations
Let N = cfw_0, 1, . . . and let Rn be an n-dimensional Euclidean space,
R = R1 . A Markov Decision Process (MDP) is defined through the following
objects:
a state space X (we also use notations I and S for the st
AMS 556 Dynamic Programming Homework Set 2
Problem 1. Is it true that GCC v+(x, ) < for all x and for all randomized Markov policies ?
Problem 2. Is it true that GCC v+(x, ) < for all x and for all nonrandomized Markov policies
?
Problem 3:
Consider a Pos
AMS556
Dynamic Programming
Homework assignment 3
Problem 1. Let X = cfw_(0, 0) cfw_(i, j) : i = 1, 2, . . . , j = i + 1, . . . , 0, 1, and A = cfw_c, s,
where c stands for continue and s stands for stop. In states cfw_(i, j) where j < 0 or j = 1, we have
AMS 556 Dynamic Programming Homework Set 1
1/2
1/2
a
1
b
2
Consider an MDP with two states 1 and 2 presented on the diagram.
There is one action at state 1 and two actions a and b at state 2.
Problem 1
Consider a policy :
if x1=1 then we always select act
AMS 556 Dynamic Programming - Homework 03 Solutions
1) Example 5.8
On each edge, we indicate the reward (r), action (a), and transition probability (p) in
the following form
ra(p)
where r takes value 0 or 1, a takes value s or c. If (p) does not appear in