This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: On Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes Eugene A. Feinberg * Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 117943600 Fenghsu Yang † Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 117943600 Abstract The unichain classification problem detects whether a finite state and action MDP is unichain under all deterministic policies. This problem is NPhard [11]. This paper provides polynomial algorithms for the unichain classification for an MDP with either a state that is recurrent under all deterministic policies or with a state that is absorbing under some action. Keywords: Markov Decision Process; Unichain condition; Recurrent state 1 Introduction This paper deals with discretetime Markov Deci sion Processes (MDPs) with finite state and action sets. The probability structure of an MDP is de fined by a state space S = { 1 ,...,N } , finite sets of actions A ( i ) for all i ∈ S , and transition probabili ties p ( j  i,a ), where i,j ∈ S and a ∈ A ( i ). A deter ministic policy ϕ is a function from S to S i ∈ S A ( i ) which assigns an action ϕ ( i ) ∈ A ( i ) to each state i ∈ S. Each deterministic policy defines a stochastic matrix P ( φ ) = ( p ( j  i,ϕ ( i )) i,j =1 ,...,N . This stochas tic matrix can be viewed as a transition matrix of a homogeneous Markov chain. A transition matrix de fines which states of the Markov chain are recurrent, transient, and equivalent. We say that a state i ∈ S is transient ( recurrent ) if it is transient (recurrent) under all deterministic * efeinberg@notes.cc.sunysb.edu, Tel: 16316328179. † fyang@ic.sunysb.edu policies. An MDP is called multichain , if the transi tion matrix corresponding to at least one determinis tic policy ϕ contains two or more nonempty recurrent classes. Otherwise, an MDP is called unichain . Thus, under any deterministic policy, the state space of a unichain MDP consists of a single recurrent class plus a possible empty set of transient states. The unichain property is important for MDPs with average rewards per unit time. In particular, stronger results on the existence of optimal policies hold and better algorithms are available for unichain MDPs than for general MDPs; see Kallenberg [8] for de tail. Since Howard [6] introduced the policy iteration algorithms, unichain MDPs have been treated sep arately from general MDPs; see e.g. [4, 5, 8, 10]. Kallenberg [7] studied irreducibility, communicat ing, weakly communicating, and unichain classifica tion problems for MDPs. For the first three prob lems, Kallenberg [7] constructed polynomial algo rithms. For the unichain classification problem, Kallenberg [7], [8, p. 41] posted a problem whether a polynomial algorithm exists. Tsitsiklis [11] solved this problem by proving that the unichain classifica tion problem is NPhard....
View
Full
Document
This note was uploaded on 12/06/2011 for the course MATH 101 taught by Professor Eugenea.feinberg during the Fall '11 term at State University of New York.
 Fall '11
 EugeneA.Feinberg
 Statistics, Applied Mathematics

Click to edit the document details