On Polynomial Cases of the Unichain Classification Problem
for Markov Decision Processes
Eugene A. Feinberg
*
Department of Applied Mathematics and Statistics
State University of New York at Stony Brook
Stony Brook, NY 117943600
Fenghsu Yang
†
Department of Applied Mathematics and Statistics
State University of New York at Stony Brook
Stony Brook, NY 117943600
Abstract
The unichain classification problem detects whether
a finite state and action MDP is unichain under all
deterministic policies. This problem is
NP
hard [11].
This paper provides polynomial algorithms for the
unichain classification for an MDP with either a state
that is recurrent under all deterministic policies or
with a state that is absorbing under some action.
Keywords:
Markov Decision Process;
Unichain
condition; Recurrent state
1
Introduction
This paper deals with discretetime Markov Deci
sion Processes (MDPs) with finite state and action
sets.
The probability structure of an MDP is de
fined by a state space
S
=
{
1
, . . . , N
}
, finite sets of
actions
A
(
i
) for all
i
∈
S
, and transition probabili
ties
p
(
j

i, a
), where
i, j
∈
S
and
a
∈
A
(
i
). A deter
ministic policy
ϕ
is a function from
S
to
S
i
∈
S
A
(
i
)
which assigns an action
ϕ
(
i
)
∈
A
(
i
) to each state
i
∈
S.
Each deterministic policy defines a stochastic
matrix
P
(
φ
) = (
p
(
j

i, ϕ
(
i
))
i,j
=1
,...,N
.
This stochas
tic matrix can be viewed as a transition matrix of a
homogeneous Markov chain. A transition matrix de
fines which states of the Markov chain are recurrent,
transient, and equivalent.
We say that a state
i
∈
S
is
transient
(
recurrent
)
if it is transient (recurrent) under all deterministic
*
[email protected], Tel: 16316328179.
†
[email protected]
policies. An MDP is called
multichain
, if the transi
tion matrix corresponding to at least one determinis
tic policy
ϕ
contains two or more nonempty recurrent
classes. Otherwise, an MDP is called
unichain
. Thus,
under any deterministic policy, the state space of a
unichain MDP consists of a single recurrent class plus
a possible empty set of transient states.
The unichain property is important for MDPs with
average rewards per unit time. In particular, stronger
results on the existence of optimal policies hold and
better algorithms are available for unichain MDPs
than for general MDPs; see Kallenberg [8] for de
tail. Since Howard [6] introduced the policy iteration
algorithms, unichain MDPs have been treated sep
arately from general MDPs; see e.g.
[4, 5, 8, 10].
Kallenberg [7] studied irreducibility,
communicat
ing, weakly communicating, and unichain classifica
tion problems for MDPs.
For the first three prob
lems, Kallenberg [7] constructed polynomial algo
rithms.
For the unichain classification problem,
Kallenberg [7], [8, p. 41] posted a problem whether
a polynomial algorithm exists. Tsitsiklis [11] solved
this problem by proving that the unichain classifica
tion problem is
NP
hard.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Fall '11
 EugeneA.Feinberg
 Statistics, Applied Mathematics, Markov chain, Markov decision process, MDPs, unichain classiﬁcation problem

Click to edit the document details