…… (WIN) Fig. 10.6. A sequence of tic-tac-toe moves. The solid lines represent the moves taken during a game; the dashed lines represent moves that we (our RL player) considered but did not make. Here is how the tic-tac-toe problem would be approached using reinforcement learning and approximate value functions. First we set up a table of numbers, one
376 Advanced Artificial Intelligence for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state's value, and the whole table is the learned value function. State A has higher value than state B, or is considered "better" than state B, if the current estimate of the probability of our winning from A is higher than it is from B. Assuming we always play ◇ s, then for all states with three ◇ s in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Os in a row, or that are "filled up," the correct probability is 0, as we cannot win from them. We set the initial values of all the other states to 0.5, representing a guess that we have a 50% chance of winning. We play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves (one for each blank space on the board) and look up their current values in the table. Most of the time, we move greedily, selecting the move that leads to the state with the greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from among the other moves instead. These are called exploratory moves because they make us to experience states that we might otherwise never see. A sequence of moves made and considered during a game can be diagrammed as in Figure 10.6. While we are playing, we change the values of the states in which we find ourselves during the game. We attempt to make more accurate estimates of the probabilities of winning. To do this, we “back up” the value of the state after each greedy move to the state before the move, as suggested by the arrows in Figure 10.6. More precisely, the current value of the earlier state is adjusted to be closer to the value of the later state. This can be done by moving the earlier state's value a fraction of the way toward the value of the later state. If we let s n denote the state before the greedy move, and s n+1 the state after the move, then the update to the estimated value of s, denoted V(s), can be written as, 1 ( ) ( ( ( ) ( )) n n n n V s S s c V s V s + = + - ） (10.14) where c is a small positive fraction called the step-size parameter, which influences the rate of learning. This update rule is an example of a temporal-difference learning method, so called because its changes are based on a difference, V( s n+ 1 )-V( s n ) , between estimates at two different times.
You've reached the end of your free preview.
Want to read all 630 pages?
- Fall '16
- Artificial Intelligence, ARTIFICIAL INTELLIGENCE