MATHEMATICS OF OPERATIONS RESEARCH
Vol. 30, No. 3, August 2005, pp. 545–561
issn
0364765X
eissn
15265471 05 3003 0545
inf
orms
®
doi
10.1287/moor.1050.0148
© 2005 INFORMS
On the Empirical StateAction Frequencies in Markov
Decision Processes Under General Policies
Shie Mannor
Department of Electrical and Computer Engineering, McGill University, 3480 University Street,
Montreal, Québec, Canada H3A 2A7,
[email protected]
,
www.ece.mcgill.ca/˜shie/
John N. Tsitsiklis
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology,
Cambridge, Massachusetts 02139,
[email protected]
,
web.mit.edu/˜jnt/www/home.html
We consider the empirical stateaction frequencies and the empirical reward in weakly communicating finitestate
Markov decision processes under general policies. We define a certain polytope and establish that every element of
this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we
show that the probability of exceeding a given distance between the empirical frequency vector and the polytope
decays exponentially with time under every policy. We provide similar results for vectorvalued empirical rewards.
Key words
: Markov decision processes; stateaction frequencies; large deviations; empirical measure
MSC2000 subject classification
: Primary: 90C40; secondary: 60F99, 60B10
OR/MS subject classification
: Primary: probability, Markov processes; secondary: dynamic programming/optimal
control, Markov finite state
History
: Received March 6, 2003; revised April 6, 2004.
1. Introduction.
We consider a Markov decision process (MDP) that satisfies a weak
communication assumption and describe a polytope of possible stateaction frequency vec
tors. We show that for every point in the polytope, there exists a policy that gets “very
close” to that point. More accurately, for every point in the polytope, we specify a policy
that guarantees that the empirical stateaction frequency vector converges to that point, with
probability one. Moreover, we show that under the prescribed policy, the probability of
a large distance between the point and the empirical stateaction frequency vector decays
exponentially with time. On the other hand, we show that no policy can “get far” from this
polytope even without the weak communication assumption. Specifically, we show that the
probability of a large distance between the empirical stateaction frequency vector and the
polytope decays exponentially with time, uniformly over all admissible policies.
While the emphasis of this work is on bounds on the empirical frequencies, we also
derive some apparently new results on stateaction frequency polytopes. Under the weak
communication assumption, our results establish that the polytope we consider is the same as
the set of possible limits (both in expectation and almost surely) of the empirical frequency
vector under different policies. This extends results in Derman [
7
] and Puterman [
15
],
which assumed a unichain structure. These references also showed that every point in the
polytope can be achieved by a stationary policy. In contrast, for the more general case that
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '09
 R.Srikant
 Probability theory, Markov chain, Andrey Markov, Markov decision process, Mathematics of Operations Research, empirical stateaction frequencies

Click to edit the document details