Markov Decision Processes
CS 221
Section 6
October 30, 2009
Today we will discuss several sample MDP problems. The solutions are included here, so you
can work through them on your own if you like.
1.
MDPs with Random Stopping Times
Suppose we have a Markov Decision Process (MDP)
M
= (
S
,
A
, P
sa
, γ, R
), where
S
is a
discrete state space with
n
states, and the rewards are discounted by a factor
γ
. (Recall
that
P
sa
(
s
0
) is the “transition model” and
R
(
s
) is the reward function.) We can view this
process as a game where we begin in some state
s
0
∈ S
and take turns selecting actions
and transitioning to new states, accumulating rewards along the way. At the
n
th turn, we
first receive some (discounted) reward,
γ
n
R
(
s
), for the current state
s
. Then, we select an
action,
a
∈ A
and transition, randomly, to a new state
s
0
according to the probabilities,
P
sa
(
s
0
). Since the discount factor is
γ <
1, our rewards become smaller and smaller as the
game goes on. (Hence, the optimal strategy will try to accumulate big rewards early.)
Now consider a slight modification of this game. At the start of each turn we receive an
undiscounted
reward,
R
(
s
), and then flip a biased coin that lands
heads
with probability
²
,
0
< ²
≤
1. If the coin lands
heads
, then the game is stopped and we are left with whatever
reward we have accumulated thus far. Otherwise, we choose our action and we transition
to the next state according to
P
sa
, as usual. We will now show that this new game can
be expressed as an MDP. In addition, we’ll also show that the value of this game (i.e., the
largest reward we expect to gain from playing it) is equivalent to the
discounted
reward in
the original MDP,
M
.
Define a new MDP,
˜
M
= (
˜
S
,
A
,
˜
P
sa
,
1
,
˜
R
). This MDP has the same action space as
M
,
but the discount factor is 1, and we have a different state space, transition model, and
reward function. We’ll construct the MDP
˜
M
so that it is just like the MDP
M
, but with
some modifications to include the coinflipping rules defined above.
In particular, we’re going to add a new state called the “sink” state, which we’ll denote
e
. If the coin toss comes up
heads
, then we’ll transition, always, to this state and remain
there forever (accumulating 0 reward each turn). If the coin toss is
tails
, then we’ll just
transition according to
P
sa
as before, with no chance of enter the sink state,
e
.
(a) Complete the construction by specifying explicitly
˜
S
,
˜
P
sa
, and
˜
R
for the new MDP,
˜
M
.
Answer:
Let
˜
S
=
S ∪ {
e
}
, where
e
is the new sink state.
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
CS221 Section #6
2
Now, let’s assume we’re in a state
s
∈ S
(i.e.,
s
6
=
e
), then we have:
˜
P
sa
(
s
0

heads
)
=
(
1
if
s
0
=
e
,
0
if
s
0
∈ S
,
and
˜
P
sa
(
s
0

tails
)
=
(
0
if
s
0
=
e
,
P
sa
(
s
0
)
if
s
0
∈ S
.
Thus, using
p
(
heads
) =
²
, we can derive:
˜
P
sa
(
s
0
)
=
˜
P
sa
(
s

heads
)
p
(
heads
) +
˜
P
sa
(
s

tails
)
p
(
tails
)
=
(
²
if
s
0
=
e
,
(1

²
)
·
P
sa
(
s
0
)
if
s
0
∈ S
.
This is the end of the preview.
Sign up
to
access the rest of the document.
 '09
 KOLLER,NG
 Artificial Intelligence, Optimization, Markov decision process

Click to edit the document details