This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: MATHEMATICS OF OPERATIONS RESEARCH Vol. 30, No. 3, August 2005, pp. 545–561 issn 0364-765X eissn 1526-5471 05 3003 0545 inf orms ® doi 10.1287/moor.1050.0148 © 2005 INFORMS On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies Shie Mannor Department of Electrical and Computer Engineering, McGill University, 3480 University Street, Montreal, Québec, Canada H3A 2A7, [email protected] , www.ece.mcgill.ca/˜shie/ John N. Tsitsiklis Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, [email protected] , web.mit.edu/˜jnt/www/home.html We consider the empirical state-action frequencies and the empirical reward in weakly communicating finite-state Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vector-valued empirical rewards. Key words : Markov decision processes; state-action frequencies; large deviations; empirical measure MSC2000 subject classification : Primary: 90C40; secondary: 60F99, 60B10 OR/MS subject classification : Primary: probability, Markov processes; secondary: dynamic programming/optimal control, Markov finite state History : Received March 6, 2003; revised April 6, 2004. 1. Introduction. We consider a Markov decision process (MDP) that satisfies a weak communication assumption and describe a polytope of possible state-action frequency vec- tors. We show that for every point in the polytope, there exists a policy that gets “very close” to that point. More accurately, for every point in the polytope, we specify a policy that guarantees that the empirical state-action frequency vector converges to that point, with probability one. Moreover, we show that under the prescribed policy, the probability of a large distance between the point and the empirical state-action frequency vector decays exponentially with time. On the other hand, we show that no policy can “get far” from this polytope even without the weak communication assumption. Specifically, we show that the probability of a large distance between the empirical state-action frequency vector and the polytope decays exponentially with time, uniformly over all admissible policies. While the emphasis of this work is on bounds on the empirical frequencies, we also derive some apparently new results on state-action frequency polytopes. Under the weak communication assumption, our results establish that the polytope we consider is the same as the set of possible limits (both in expectation and almost surely) of the empirical frequency vector under different policies. This extends results in Derman [ 7 ] and Puterman [...
View Full Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
- Spring '09