{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}


Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
MATHEMATICS OF OPERATIONS RESEARCH Vol. 30, No. 3, August 2005, pp. 545–561 issn 0364-765X eissn 1526-5471 05 3003 0545 inf orms ® doi 10.1287/moor.1050.0148 © 2005 INFORMS On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies Shie Mannor Department of Electrical and Computer Engineering, McGill University, 3480 University Street, Montreal, Québec, Canada H3A 2A7, [email protected] , www.ece.mcgill.ca/˜shie/ John N. Tsitsiklis Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, [email protected] , web.mit.edu/˜jnt/www/home.html We consider the empirical state-action frequencies and the empirical reward in weakly communicating finite-state Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vector-valued empirical rewards. Key words : Markov decision processes; state-action frequencies; large deviations; empirical measure MSC2000 subject classification : Primary: 90C40; secondary: 60F99, 60B10 OR/MS subject classification : Primary: probability, Markov processes; secondary: dynamic programming/optimal control, Markov finite state History : Received March 6, 2003; revised April 6, 2004. 1. Introduction. We consider a Markov decision process (MDP) that satisfies a weak communication assumption and describe a polytope of possible state-action frequency vec- tors. We show that for every point in the polytope, there exists a policy that gets “very close” to that point. More accurately, for every point in the polytope, we specify a policy that guarantees that the empirical state-action frequency vector converges to that point, with probability one. Moreover, we show that under the prescribed policy, the probability of a large distance between the point and the empirical state-action frequency vector decays exponentially with time. On the other hand, we show that no policy can “get far” from this polytope even without the weak communication assumption. Specifically, we show that the probability of a large distance between the empirical state-action frequency vector and the polytope decays exponentially with time, uniformly over all admissible policies. While the emphasis of this work is on bounds on the empirical frequencies, we also derive some apparently new results on state-action frequency polytopes. Under the weak communication assumption, our results establish that the polytope we consider is the same as the set of possible limits (both in expectation and almost surely) of the empirical frequency vector under different policies. This extends results in Derman [ 7 ] and Puterman [ 15 ], which assumed a unichain structure. These references also showed that every point in the polytope can be achieved by a stationary policy. In contrast, for the more general case that
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.
  • Spring '09
  • R.Srikant
  • Probability theory, Markov chain, Andrey Markov, Markov decision process, Mathematics of Operations Research, empirical state-action frequencies

{[ snackBarMessage ]}