This preview shows page 1. Sign up to view the full content.
Unformatted text preview: long term gratiﬁcation (alternative 1).
0.01
0.99
0.01
0.99
0.99
③
♥
③
♥
2
2
✿♥
✘1
②
✿♥
✘1
②
②
(2)
0.01 r(1) =1
1
r1 =0
r1 =0
r2 =50
2
Decision 1 Decision 2 Figure 4.10: A Markov decision problem with two alternatives in state 2.
It is also possible to consider the situation in which the rewards for each decision are
(k)
associated with transitions; that is, for decision k in state i, the reward rij is associated
with a transition from i to j . This means that the expected reward for a transition from
P (k) (k)
(k)
i with decision k is given by ri = j Pij rij . Thus, as in the previous section, there
is no essential loss in generality in restricting attention to the case in which rewards are
associated with the states.
The set of rules used by the decision maker in selecting diﬀerent alternatives at each stage
of the chain is called a policy. We want to consider the expected aggregate reward over n
trials of the “Markov chain,” as a function of the policy used by the decision maker. If the
policy uses the same decision, say ki , at each occurrence of state i, for each i, then that
(k )
policy corresponds to a homogeneous Markov chain with transition probabilities Pij i . We
denote the matrix of these transition probabilities as [P k ], where k = (k1 , . . . , kM ). Such a
policy, i.e., making the decision for each state i independent of time, is called a stationary
policy. The aggregate reward for any such stationary policy was found in the previous
section. Since both rewards and transition probabilities depend only on the state and the
corresponding decision, and not on time, one feels intuitively that stationary policies make
a certain amount of sense over a long period of time. On the other hand, assuming some
ﬁnal reward ui for being in state i at the end of the nth trial, one might expect the best
policy to depend on time, at least close to the end of the n trials.
In what follows, we ﬁrst derive the optimal policy for maximizing expected aggregate reward
over an arbitrary number n of trials. We shall see that the decision at time m, 0 ≤ m < n, for 4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING 167 the optimal policy does in fact depend both on m and on the ﬁnal rewards {ui ; 1 ≤ i ≤ M}.
We call this optimal policy the optimal dynamic policy. This policy is found from the
dynamic programming algorithm, which, as we shall see, is conceptually very simple. We
then go on to ﬁnd the relationship between the optimal dynamic policy and the optimal
stationary policy and show that each has the same long term gain per trial. 4.6.2 Dynamic programming algorithm As in our development of Markov chains with rewards, we consider expected aggregate
reward over n time periods and we use stages, counting backwards from the ﬁnal trial.
First consider the optimum decision with just one trial (i.e., with just one stage). We start
(k)
in a given state i at stage 1, make a decision k, obtain the reward ri , then go to some state
(k)
j with probability Pij and obtain the ﬁnal reward uj . This expected aggregate reward is
maximized over the choice of k, i.e.,
X (k)
(k)
∗
vi (1, u ) = max{ri +
Pij uj }.
(4.52)
k j ∗
We use the notation vi (n, u ) to represent the maximum expected aggregate reward for
∗
n stages starting in state i. Note that vi (1, u ) depends on the ﬁnal reward vector u =
T
(u1 , u2 , . . . , uM ) . Next consider the maximum expected aggregate reward starting in state
i at stage 2. For each state j , 1 ≤ j ≤ M, let vj (1, u ) be the expected aggregate reward,
over stages 1 and 0, for some arbitrary policy, conditional on the chain being in state j at
stage 1. Then if decision k is made in state i at stage 2, the expected aggregate reward for
P (k)
(k)
stage 2 is ri + j Pij vj (1, u ). Note that no matter what policy is chosen at stage 2, this
expression is maximized at stage 1 by choosing the stage 1 policy that maximizes vj (1, u ).
∗
Thus, independent of what we choose at stage 2 (or at earlier times), we must use vj (1, u )
for the aggregate gain from stage 1 onward in order to maximize the overall aggregate gain
∗
from stage 2. Thus, at stage 2, we achieve maximum expected aggregate gain, vi (2, u ), by
choosing the k that achieves the following maximum:
X (k)
(k)
∗
∗
vi (2, u ) = max {ri +
Pij vj (1, u )}.
(4.53)
k j Repeating this argument for successively larger n, we obtain the general expression
X (k)
(k)
∗
∗
vi (n, u ) = max{ri +
Pij vj (n − 1, u )}.
(4.54)
k j Note that this is almost the same as (4.33), diﬀering only by the maximization over k. We
can also write this in vector form, for n ≥ 1, as
v ∗ (n, u ) = max{r k + [P k ]v ∗ (n − 1, u )},
k (4.55) where for n = 1, we take v ∗ (0, u ) = u . Here k is a set (or vector) of decisions, k =
(k1 , k2 , . . . , kM ) where ki is the decision to be used in state i. [P k ] denotes a matrix whose 168 CHAPTER 4. FINITESTATE MARKOV CHAINS
(k ) (k ) (i, j ) element is Pij i , and r k denotes a vector whose ith element is ri i . The maximization
ove...
View
Full
Document
This note was uploaded on 09/27/2010 for the course EE 229 taught by Professor R.srikant during the Spring '09 term at University of Illinois, Urbana Champaign.
 Spring '09
 R.Srikant

Click to edit the document details