This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: A Modern Bayesian Look at the Multi~Armed Bandit Steven L. Scott August 9, 2010 Abstract A multi—armed bandit is a particular type of experimental design where the goal is to ac
cumulate the largest possible reward. Rewards come from a payoff distribution with unknown
parameters that are to be learned through sequential experimentation. This article describes a
heuristic for managing multi—armed bandits called randomized probability matching, which ran
domly allocates observations to arms according the Bayesian posterior probability that each arm
is optimal. Advances in Bayesian computation have made randomized probability matching easy to
apply to virtually any payoff distribution. This ﬂexibility frees the experimenter to work with pay‘
off distributions that correspond to certain classical experimental designs that have the potential
to outperform methods which are “optimal” in simpler contexts. We summarize the relationships
betWeen randomized probability matching and several related heuristics that have been used in the reinforcement learning literature. 1 Introduction A multidrmed bandit is a sequential experiment
with the goal of achieving the largest possible
reward from a payoff distribution with unknown
parameters. At each stage the experimenter
must decide which arm of the experiment to ob
serve next. The choice involves a fundamental
trade—off between the utility gain from exploit—
ing arms that appear to be doing well (based on
limited sample information) with the informa—
tion gain from emploring arms that might poten—
tially be optimal, but which appear to be inferior
because of sampling variability. This article re—
views several techniques that have been used to
manage the multi'armed bandit problem. Par—
ticular attention is paid to a technique known as
randomized probability matching, which can be
implemented quite simply in a modern Bayesian
computing environment, and which can combine
good ideas from both sequential and classical ex—
perimental design. Multiarmed bandits have an important role to play in modern production environments that
emphasize “continuous improvement,” where
products remain in a perpetual state of fea—
ture testing. Online software (such as a web
site, an online advertisement, or a cloud service)
is especially amenable to continuous improve
ment because experimental variation is easy to
introduce, and because user responses to on—
line stimuli are often quickly observed. Indeed,
several frameworks for improving online services
through experimentation have been develOped.
Google’s website optimizer (Google, 2010) is one
well known example. Designers provide website
optimizer several versions of their website dif—
fering in font, image choice, layout, and other
design elements. Website Optimizer randomly
diverts trafﬁc to the different configurations in
search of conﬁgurations that have a high proba—
bility of producing successful outcomes, or con—
versions, as deﬁned by the website owner. One
disincentive for website owners to engage in on~
line experiments is the fear that too much traffic
will be diverted to inferior conﬁgurations in the
name of experimental validity. Thus the explo~ ration/exploitation trade off arises because ex~
perimenters must weigh the potential gain of an
increased conversion rate at the end of the exper
iment with the cost of a reduced conversion rate
while it runs. Treating product improvement ex—
periments as multi—arrned bandits can dramati—
cally reduce the cost of experimentation. The name “multi—armed bandit” is an allusion
to a “one~armed bandit,” a colloquial term for a
slot machine. The straightforward analogy is to
imagine different website conﬁgurations as a row
of slot machines, each with its own probability
of producing a reward (i.e. a conversion). The
multiarmed bandit problem is notoriously resis—
tant to analysis (Whittle, 1979), though optimal
solutions are available in certain special cases
(Gittins, 1979', Bellman, 1956). Even then, the
optimal solutions are hard to compute, rely on
artiﬁcial discount factors, and fail to generalize
to realistic reward distributions. They can also
exhibit incomplete learning, meaning that there
is a positive probability of playing the wrong arm
forever (Brezzi and Lai, 2000). Because of the drawbacks associated with op—
timal solutions, analysts often turn to heuristics
to manage the exploration/exploitation trade—
off. Randomized probability matching is a par—
ticularly appealing heuristic that plays each arm
in proportion to its probability of being optimal.
Randomized probability matching is easy to im—
plement, broadly applicable, and combines sev~
eral attractive features of other popular heuris
tics. Randomized probability matching is an
old idea (Thompson, 1933, 1935), but modern
Bayesian computation has dramatically broad
ened the class of reward distributions to which
it can be applied. The simplicity of randomized probability
matching allows the multi—armed bandit to in
corporate powerful ideas from classical design.
For example, both bandits and classical exper
iments must face the exploding number of pos—
sible conﬁgurations as factors are added to the
experiment. Classical experiments handle this
problem using fractional factorial designs (see, e.g. Cox and Reid, 2000), which surrender the
ability to ﬁt certain complex interactions in or
der to reduce the number of experimental runs.
These designs combat the curse of dimensionality
by indirectly learning about an arm’s reward dis—
tribution by examining rewards from other arms
with similar characteristics. Bandits can use
the fractional factorial idea by assuming that a
model, such as a probit or logistic regression, de«
termines the reward distributions of the different
arms. Assuming a parametric model allows the
bandit to focus on a lower dimensional param—
eter space and thus potentially achieve greater
rewards than “optimal” solutions that make no
parametric assumptions. It is worth noting that there are also impor—
tant differences between classical experiments
and bandits. For example, the traditional op
timality criteria from classical experiments (D—
optimality, Aooptimality, etc.) tend to produce
balanced experiments where all treatment effects
can be accurately estimated. In a multiarmed
bandit it is actually undesirable to accurately es—
timate treatment effects (i.e. the parameters the
reward distribution) for inferior arms. Instead,
the bandit aims to gather just enough, informa—
tion about a sub—optimal arm to determine that
it is sub~optimal, at which point further explo‘
ration becomes wasteful. A second difference
is the importance placed on statistical signiﬁ
cance. Classical experiments are designed to be
analyzed using methods that tightly control the
type~l error rate under a null hypothesis of no
effect. But when the cost of switching between
products is small (as with software testing), the
typeI error rate is of little relevance to the ban
dit. A type—l error corresponds to switching to
a different arm that provides no material. advan—
tage over the current arm. By contrast, a typeeIl
error means failing to switch to a superior arm,
which could carry a substantial cost. Thus when
switching costs are small almost all the costs lie
in type~Il errors, which makes the usual notion of
statistical signiﬁcance largely irrelevant. Finally,
classical experiments typically focus on designs for linear models because the information ma—
trix in a linear model is a function of the design
matrix. Designs for nonlinear models like probit
or logistic regression are complicated by the fact
the information matrix depends on unknown pa—
rameters (Chaloner and Verdinelli, 1995). This
complication presents no particular difﬁculty to
the multi—armed bandit played under random
ized probability matching. The remainder of this paper is structured as
follows. Section 2 describes the principle of ran—
domized probability matching in greater detail.
Section 3 reviews other approaches for multi~
armed bandits, including the Gittins index and
several popular heuristics. Section 4 presents
a simulation study that investigates the perfor—
mance of randomized probability matching in
the unstructured binomial bandit, where optimal
solutions are available. Section 5 describes a sec
ond simulation study in which the reward distri
bution has low dimensional structure, where “op—
timal” methods do poorly. There is an impor«
tant symmetry between Sections 4 and 5. Sec—
tion 4 illustrates the cost savings that sequen
tial learning can have over classic experiments.
Section 5 illustrates the improvements that can
be brought to sequential learning by incorporat—
ing classical ideas like fractional factorial design.
Section 6 concludes with observations about ex
tending multi~armed bandits to more elaborate
settings. 2 Randomized Probability Matching Let yt : (3/1, . . . ,yt) denote the sequence of re
wards observed up to time 15. Let at denote
the arm of the bandit that was played at time
t. We suppose that each yt was generated inde
pendently from the reward distribution fat(yf6),
where 9 is an unknown parameter vector, and
some components of 8 may be shared across the
different arms. To make the notation concrete, consider two
speciﬁc examples, both of which take gt 6 {0, 1}. Continuous rewards are also possible, of course,
but we will focus on binary rewards because
counts of clicks or conversions are the typical
measure of success in e—commerce. The ﬁrst
example is the binomial bandit, in which 8 z
(61, . . . , 6k), and fa(yt9) is the Bernoulli distri—
bution with success probability 9a. The binomial
bandit is the canonical bandit problem appear~
ing most often in the literature. The second ex«
ample is the fractional factorial bandit, where at
corresponds to a set of levels for a group of ex
perimental factors (including potential interac'
tions), coded as dummy variables in the vector
xt. Let is: denote the number of possible con«
ﬁgurations of xi, and let at E {l,...k} refer
to a particular conﬁguration according to some
labeling scheme. The probability of success is
fa(yt = 16) : 9(6Txt), where g is a binomial
link function, such as probit or logistic. We refer
to the case where g is the CDF of the standard
normal distribution as the probit bandit. Let MAE) : E(yt0,at z a) denote the ex»
pected reward from fa(y{9). If 6 were known
then the optimal long run strategy would be to
always choose the arm with the largest M109).
Let 39(6) denote a prior distribution on (9, from
which we may compute (1) The computation in equation (1) can be ex
pressed as an integral of an indicator function. Let Ia(0) 2 1 if (raw) 2: max{n1(6), . . . ,nk(l9)},
and Ia(6) : 0 otherwise. Then was : Proud = maxwi; ‘ .  will we : E<Ia<e>> = / new) are. <2) If a priori little is known about 9 then the im
plied distribution on M will be exchangeable, and
thus wag will be uniform. As rewards from the
bandit are observed, the parameters of the re—
ward distribution are learned through Bayesian
updating. At time t the posterior distribution of Bis 10(9lYt) o< 29(9) H fa.(yTl9), (3)
7:1 from which one may compute mat : PT‘(/,La 3: maiX{/J117 ‘ ' ' = E(Ia(9)1yt), as in equation Randomized probability matching allocates
observation t + l to arm a with probability wat.
Randomized probability matching is not known
to optimize any speciﬁc utility function, but it is
easy to apply in general settings, it balances ex—
ploration and exploitation in a natural way, and
it tends to allocate observations efficiently from
both inferential and economic perspectives. It is
compatible with batch updates of the posterior
distribution, and the methods used to compute
the allocation probabilities make it easy to corn
pute the expected amount of lost reward relative
to playing the optimal arm of the bandit from
the beginning. Finally, randomized probability
matching is free of arbitrary tuning parameters
that must be set by the analyst. (4) 2.1 Computing Allocation Probabili—
ties For some families of reward distributions it is
possible to compute wat either analytically or by
quadrature. In any case it is easy to compute
mm by simulation. Let 8(1), . . . , 6(a) be a sample
of independent draws from p(91yt). Then by the
law of large numbers, (5) w :2
at Gaoo G 1 0
lim — 2146(9)). 9:1
Equation (5) simply says to estimate mm by
the empirical proportion of Monte Carlo sam—
ples in which pawigl) is maximal. If fa is in
the exponential family and 39(6) is a conjugate
prior distribution then independent draws of 6
are possible. Otherwise we may draw a sequence 9(1), 9(2), . . . from an ergodic Markov chain with
p(61yt) as its stationary distribution (Tierney,
1994). In the latter case equation (5) remains
unchanged, but it is justiﬁed by the ergodic the
orem rather than the law of large numbers. Posterior draws of 6 are all that is needed to
apply randomized probability matching. Such
draws are available for a very wide class of mod—
els through Markov chain Monte Carlo and other
sampling algorithms, which means randomized
probability matching can be applied with almost
any family of reward distributions. 2.2 Implicit Allocation The optimality probabilities do not need to be
computed explicitly. It will usually be faster
to simulate a N wat by simulating a single
draw of (9“) from p(6yt), then choosing a :
arg max,L haw”). 2.3 Balancing Exploration and Ex—
ploitation Randomized probability matching naturally in~
corporates uncertainty about 6 because rum; is
deﬁned as an integral over the entire posterior
distribution p(6lyt). To illustrate, consider the
binomial bandit with It :: 2 under independent
beta priors. Figure 1(a) plots p(81,62y) assum
ing we have observed 20 successes and 30 failures
from the ﬁrst arm, along with 2 success and 1
failure from the second arm. In panel (b) the
second arm is replaced with an arm which has
generated 20 successes and 10 failures. Thus it
has the same empirical success rate as the sec—
ond arm in panel (a), but with a larger sample
size. In both plots the optimality probability
for the first arm is the probability a dot lands
below the 45—degree line (which equation (5) es—
timates by counting simulated dots). ln panel
(a) the probability that the ﬁrst arm is optimal
is around 18%, deSpite having a lower empiri
cal success rate than the second arm. In panel
(b) the larger sample size causes the the poste—
rior distribution to tighten, which lowers the first L0
_L 08
J 0.6 0.2 0.0
.L yS Figure 1: .7000 draws from the joint distribution of two independent beta distributions. In both cases the
horizontal axis represents a beta (20,30) distribution The vertical axis is (a) beta(2,]), (b) beta(20,]0). arm’s optimality probability to 0.8%. This example demonstrates that the need to
experiment decreases as more is learned about
the parameters of the reward distribution. If
the two largest values of paw) are distinct then
maxa wat eventually converges to 1. If the
161 S k: largest values of paw) are identical then
maxa wat need not converge, but wat may drift
on a subset of the probability simplex that di
vides 100% of probability among the Is; Optimal
alternatives. This is obviously just as good as
convergence from the perspective of total reward
accumulation. Notice the alignment between the inferential
goal of ﬁnding the optimal arm, and the eco
nomic goal of accumulating reward. From both
perspectives it is desirable for superior arms to
rapidly accumulate observations. Downweight—
ing inferior arms leads to larger economic rea
wards, while larger sample sizes for superior arms
means that the optimal arm can be more quickly
distinguished from its close competitors. 3 Other Solutions This Section reviews a collection of strategies
that have been used with multi—armed bandit
problems. We discuss pure exploration strate—
gies, purely greedy strategies, hybrid strategies,
and Gittins indices. Of these, only Gittins in‘
dices carry any guarantee of optimality, and then
only under a very particular scheme of discount—
ing future rewards. 3.1 The Gittins Index Gittins (1979) provided a method of comput
ing the optimal strategy in certain bandit prob
lems. Gittins assumed a geometrically dis<
counted stream of future rewards with present
value PV 2 23:0 ﬂy, for some 0 S y < 1. Git—
tins provided an algorithm for computing the ex—
pected discounted present value of playing arm (1,
assuming optimal play in the future, a quantity
that has since become known as the “Gittins in—
dex.” Thus, by deﬁnition, playing the arm with
the largest Gittins index maximizes the expected
present value of discounted future rewards. The Gittins index has the further remarkable prop
erty that it can be computed separately for each
arm, in ignorance of the other arms. A policy
with this property is known as an indem policy. Logical and computational difficulties have
prevented the widespread adoption of Gittins in—
dices. Powell (2007) notes that “Unfortunately,
at the time of this writing, there do not exist easy
to use software utilities for computing standard
Gittins indices.” Sutton and Barto (1998) add
“Unfortunately, neither the theory nor the com
putational tractability of [Gittins indicesl appear
to generalize to the full reinforcement learning
problem .. 3” Although it is hard to compute Gittins in‘
dices exactly, Brezzi and Lai 2002) have devel
Oped an approximate Gittins index based on a
normal approximation to p(6ly). Figure 2 plots
the approximation for the binomial bandit with
two different values of 7. Both sets of indices
converge to a/(a + b) as a and b grow large,
but the rate of convergence slows as y ~+ 1.
The Brezzi and Lai approximation to the Git—
tins index is as follows. Let 8,, : E(61yn,an),
van : Varmalyn), 03(9) 2 Var(yt6,a,t 2 a),
and c ::  log 7. Then ~ 1 2 2’cm
1/ z 6 + v / 6
a m n) a w (602%) < >
where
3/2 ifs g 0.2,
0.49 — (111er2 if 0.1 < s g 1,
212(5) 2 0.63 — 0265*”2 if 1 < s g 5,
0.77~0.588_1/z if5 < .3 315,
(2logs w log logs — log 16%)“2 if s > 15.
(7) Computing aside, there are three logical is—
sues that challenge the Gittins index (and the
broader class of index polices). The ﬁrst is that
it requires the arms to have distinct parame~
ters. Thus Gittins indices cannot be applied
to the problems involving covariates or struc~
tured experimental factors. The second prob‘
lem is the need to choose 7. Geometric dis~
counting only makes sense if arms are played at equally spaced time intervals. Otherwise a
higher discount factor should be used for periods
of higher traffic, but if the discounting scheme is
anything other than geometric then Gittins in—
dices are no longer Optimal (Gittins and Wang,
1992; Berry and Fristedt, 1985). A ﬁnal issue is known as incomplete learn
ing, which means that the Gittins index is an
inconsistent estimator of the location of the op
timal arm. This is because the Gittins policy
eventually chooses one arm on which to continue
forever, and there is a positive probability that
the chosen arm is sub—Optimal (Brezzi and Lai,
2000). 3.2 Heuristic strategies 3.2.1 Equal Allocation One naive method of playing a multi~armed ban—
dit is to equally allocate observations to arms un—
til the maximum optimality probability exceeds
some threshold,/and then play the winning arm
afterward. This strategy leads to stable esti~
mates of 6 for all the arms, but Section 4 demon
strates that it is grossly inefﬁcient with respect
to the overall reward. Of the methods consid~
ered here, equal allocation most closely corre—
sponds to a non—sequential classical experiment ’ (the full—factorial design). 3.2.2 PlaytheWinner In play—thewinner, if arm a results in a suc—
cess at time t, then it will be played again at
time t+ 1. If a failure is observed, then the
next arm is either chosen at random or the arms
are cycled through deterministically. Play‘the~
winne...
View
Full Document
 Fall '08
 Staff

Click to edit the document details