{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Feinberg_Yang2

# Feinberg_Yang2 - On Polynomial Cases of the Unichain...

This preview shows pages 1–2. Sign up to view the full content.

On Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes Eugene A. Feinberg * Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600 Fenghsu Yang Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600 Abstract The unichain classification problem detects whether a finite state and action MDP is unichain under all deterministic policies. This problem is NP -hard [11]. This paper provides polynomial algorithms for the unichain classification for an MDP with either a state that is recurrent under all deterministic policies or with a state that is absorbing under some action. Keywords: Markov Decision Process; Unichain condition; Recurrent state 1 Introduction This paper deals with discrete-time Markov Deci- sion Processes (MDPs) with finite state and action sets. The probability structure of an MDP is de- fined by a state space S = { 1 , . . . , N } , finite sets of actions A ( i ) for all i S , and transition probabili- ties p ( j | i, a ), where i, j S and a A ( i ). A deter- ministic policy ϕ is a function from S to S i S A ( i ) which assigns an action ϕ ( i ) A ( i ) to each state i S. Each deterministic policy defines a stochastic matrix P ( φ ) = ( p ( j | i, ϕ ( i )) i,j =1 ,...,N . This stochas- tic matrix can be viewed as a transition matrix of a homogeneous Markov chain. A transition matrix de- fines which states of the Markov chain are recurrent, transient, and equivalent. We say that a state i S is transient ( recurrent ) if it is transient (recurrent) under all deterministic * [email protected], Tel: 16316328179. [email protected] policies. An MDP is called multichain , if the transi- tion matrix corresponding to at least one determinis- tic policy ϕ contains two or more nonempty recurrent classes. Otherwise, an MDP is called unichain . Thus, under any deterministic policy, the state space of a unichain MDP consists of a single recurrent class plus a possible empty set of transient states. The unichain property is important for MDPs with average rewards per unit time. In particular, stronger results on the existence of optimal policies hold and better algorithms are available for unichain MDPs than for general MDPs; see Kallenberg [8] for de- tail. Since Howard [6] introduced the policy iteration algorithms, unichain MDPs have been treated sep- arately from general MDPs; see e.g. [4, 5, 8, 10]. Kallenberg [7] studied irreducibility, communicat- ing, weakly communicating, and unichain classifica- tion problems for MDPs. For the first three prob- lems, Kallenberg [7] constructed polynomial algo- rithms. For the unichain classification problem, Kallenberg [7], [8, p. 41] posted a problem whether a polynomial algorithm exists. Tsitsiklis [11] solved this problem by proving that the unichain classifica- tion problem is NP -hard.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}