……
(WIN)
Fig. 10.6. A sequence of tictactoe moves. The solid lines represent the moves taken during a
game; the dashed lines represent moves that we (our RL player) considered but did not make.
Here is how the tictactoe problem would be approached using reinforcement
learning and approximate value functions. First we set up a table of numbers, one
376
Advanced Artificial Intelligence
for each possible state of the game. Each number will be the latest estimate of the
probability of our winning from that state. We treat this estimate as the state's
value, and the whole table is the learned value function. State A has higher value
than state B, or is considered "better" than state B, if the current estimate of the
probability of our winning from A is higher than it is from B. Assuming we
always play
◇
s, then for all states with three
◇
s in a row the probability of
winning is 1, because we have already won. Similarly, for all states with three Os
in a row, or that are "filled up," the correct probability is 0, as we cannot win
from them. We set the initial values of all the other states to 0.5, representing a
guess that we have a 50% chance of winning.
We play many games against the opponent. To select our moves we examine
the states that would result from each of our possible moves (one for each blank
space on the board) and look up their current values in the table. Most of the time,
we move greedily, selecting the move that leads to the state with the greatest
value, that is, with the highest estimated probability of winning. Occasionally,
however, we select randomly from among the other moves instead. These are
called exploratory moves because they make us to experience states that we
might otherwise never see. A sequence of moves made and considered during a
game can be diagrammed as in Figure 10.6.
While we are playing, we change the values of the states in which we find
ourselves during the game. We attempt to make more accurate estimates of the
probabilities of winning. To do this, we “back up” the value of the state after
each greedy move to the state before the move, as suggested by the arrows in
Figure 10.6. More precisely, the current value of the earlier state is adjusted to be
closer to the value of the later state. This can be done by moving the earlier
state's value a fraction of the way toward the value of the later state. If we let s
n
denote the state before the greedy move, and s
n+1
the state after the move, then
the update to the estimated value of s, denoted V(s), can be written as,
1
(
)
(
(
(
)
(
))
n
n
n
n
V s
S s
c V s
V s
+
=
+

）
(10.14)
where
c
is a small positive fraction called the stepsize parameter, which
influences the rate of learning. This update rule is an example of a
temporaldifference learning method, so called because its changes are based on
a difference, V(
s
n+
1
)V(
s
n
) , between estimates at two different times.
You've reached the end of your free preview.
Want to read all 630 pages?
 Fall '16
 Artificial Intelligence, ARTIFICIAL INTELLIGENCE