This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: MATHEMATICS OF OPERATIONS RESEARCH Vol. 30, No. 3, August 2005, pp. 545–561 issn 0364765X eissn 15265471 05 3003 0545 inf orms ® doi 10.1287/moor.1050.0148 © 2005 INFORMS On the Empirical StateAction Frequencies in Markov Decision Processes Under General Policies Shie Mannor Department of Electrical and Computer Engineering, McGill University, 3480 University Street, Montreal, Québec, Canada H3A 2A7, [email protected] , www.ece.mcgill.ca/˜shie/ John N. Tsitsiklis Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, [email protected] , web.mit.edu/˜jnt/www/home.html We consider the empirical stateaction frequencies and the empirical reward in weakly communicating finitestate Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vectorvalued empirical rewards. Key words : Markov decision processes; stateaction frequencies; large deviations; empirical measure MSC2000 subject classification : Primary: 90C40; secondary: 60F99, 60B10 OR/MS subject classification : Primary: probability, Markov processes; secondary: dynamic programming/optimal control, Markov finite state History : Received March 6, 2003; revised April 6, 2004. 1. Introduction. We consider a Markov decision process (MDP) that satisfies a weak communication assumption and describe a polytope of possible stateaction frequency vec tors. We show that for every point in the polytope, there exists a policy that gets “very close” to that point. More accurately, for every point in the polytope, we specify a policy that guarantees that the empirical stateaction frequency vector converges to that point, with probability one. Moreover, we show that under the prescribed policy, the probability of a large distance between the point and the empirical stateaction frequency vector decays exponentially with time. On the other hand, we show that no policy can “get far” from this polytope even without the weak communication assumption. Specifically, we show that the probability of a large distance between the empirical stateaction frequency vector and the polytope decays exponentially with time, uniformly over all admissible policies. While the emphasis of this work is on bounds on the empirical frequencies, we also derive some apparently new results on stateaction frequency polytopes. Under the weak communication assumption, our results establish that the polytope we consider is the same as the set of possible limits (both in expectation and almost surely) of the empirical frequency vector under different policies. This extends results in Derman [ 7 ] and Puterman [...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details