This preview shows pages 1–18. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: A Modern Bayesian Look at the Multi~Armed Bandit Steven L. Scott August 9, 2010 Abstract A multi—armed bandit is a particular type of experimental design where the goal is to ac
cumulate the largest possible reward. Rewards come from a payoff distribution with unknown
parameters that are to be learned through sequential experimentation. This article describes a
heuristic for managing multi—armed bandits called randomized probability matching, which ran
domly allocates observations to arms according the Bayesian posterior probability that each arm
is optimal. Advances in Bayesian computation have made randomized probability matching easy to
apply to virtually any payoff distribution. This ﬂexibility frees the experimenter to work with pay‘
off distributions that correspond to certain classical experimental designs that have the potential
to outperform methods which are “optimal” in simpler contexts. We summarize the relationships
betWeen randomized probability matching and several related heuristics that have been used in the reinforcement learning literature. 1 Introduction A multidrmed bandit is a sequential experiment
with the goal of achieving the largest possible
reward from a payoff distribution with unknown
parameters. At each stage the experimenter
must decide which arm of the experiment to ob
serve next. The choice involves a fundamental
trade—off between the utility gain from exploit—
ing arms that appear to be doing well (based on
limited sample information) with the informa—
tion gain from emploring arms that might poten—
tially be optimal, but which appear to be inferior
because of sampling variability. This article re—
views several techniques that have been used to
manage the multi'armed bandit problem. Par—
ticular attention is paid to a technique known as
randomized probability matching, which can be
implemented quite simply in a modern Bayesian
computing environment, and which can combine
good ideas from both sequential and classical ex—
perimental design. Multiarmed bandits have an important role to play in modern production environments that
emphasize “continuous improvement,” where
products remain in a perpetual state of fea—
ture testing. Online software (such as a web
site, an online advertisement, or a cloud service)
is especially amenable to continuous improve
ment because experimental variation is easy to
introduce, and because user responses to on—
line stimuli are often quickly observed. Indeed,
several frameworks for improving online services
through experimentation have been develOped.
Google’s website optimizer (Google, 2010) is one
well known example. Designers provide website
optimizer several versions of their website dif—
fering in font, image choice, layout, and other
design elements. Website Optimizer randomly
diverts trafﬁc to the different configurations in
search of conﬁgurations that have a high proba—
bility of producing successful outcomes, or con—
versions, as deﬁned by the website owner. One
disincentive for website owners to engage in on~
line experiments is the fear that too much traffic
will be diverted to inferior conﬁgurations in the
name of experimental validity. Thus the explo~ ration/exploitation trade off arises because ex~
perimenters must weigh the potential gain of an
increased conversion rate at the end of the exper
iment with the cost of a reduced conversion rate
while it runs. Treating product improvement ex—
periments as multi—arrned bandits can dramati—
cally reduce the cost of experimentation. The name “multi—armed bandit” is an allusion
to a “one~armed bandit,” a colloquial term for a
slot machine. The straightforward analogy is to
imagine different website conﬁgurations as a row
of slot machines, each with its own probability
of producing a reward (i.e. a conversion). The
multiarmed bandit problem is notoriously resis—
tant to analysis (Whittle, 1979), though optimal
solutions are available in certain special cases
(Gittins, 1979', Bellman, 1956). Even then, the
optimal solutions are hard to compute, rely on
artiﬁcial discount factors, and fail to generalize
to realistic reward distributions. They can also
exhibit incomplete learning, meaning that there
is a positive probability of playing the wrong arm
forever (Brezzi and Lai, 2000). Because of the drawbacks associated with op—
timal solutions, analysts often turn to heuristics
to manage the exploration/exploitation trade—
off. Randomized probability matching is a par—
ticularly appealing heuristic that plays each arm
in proportion to its probability of being optimal.
Randomized probability matching is easy to im—
plement, broadly applicable, and combines sev~
eral attractive features of other popular heuris
tics. Randomized probability matching is an
old idea (Thompson, 1933, 1935), but modern
Bayesian computation has dramatically broad
ened the class of reward distributions to which
it can be applied. The simplicity of randomized probability
matching allows the multi—armed bandit to in
corporate powerful ideas from classical design.
For example, both bandits and classical exper
iments must face the exploding number of pos—
sible conﬁgurations as factors are added to the
experiment. Classical experiments handle this
problem using fractional factorial designs (see, e.g. Cox and Reid, 2000), which surrender the
ability to ﬁt certain complex interactions in or
der to reduce the number of experimental runs.
These designs combat the curse of dimensionality
by indirectly learning about an arm’s reward dis—
tribution by examining rewards from other arms
with similar characteristics. Bandits can use
the fractional factorial idea by assuming that a
model, such as a probit or logistic regression, de«
termines the reward distributions of the different
arms. Assuming a parametric model allows the
bandit to focus on a lower dimensional param—
eter space and thus potentially achieve greater
rewards than “optimal” solutions that make no
parametric assumptions. It is worth noting that there are also impor—
tant differences between classical experiments
and bandits. For example, the traditional op
timality criteria from classical experiments (D—
optimality, Aooptimality, etc.) tend to produce
balanced experiments where all treatment effects
can be accurately estimated. In a multiarmed
bandit it is actually undesirable to accurately es—
timate treatment effects (i.e. the parameters the
reward distribution) for inferior arms. Instead,
the bandit aims to gather just enough, informa—
tion about a sub—optimal arm to determine that
it is sub~optimal, at which point further explo‘
ration becomes wasteful. A second difference
is the importance placed on statistical signiﬁ
cance. Classical experiments are designed to be
analyzed using methods that tightly control the
type~l error rate under a null hypothesis of no
effect. But when the cost of switching between
products is small (as with software testing), the
typeI error rate is of little relevance to the ban
dit. A type—l error corresponds to switching to
a different arm that provides no material. advan—
tage over the current arm. By contrast, a typeeIl
error means failing to switch to a superior arm,
which could carry a substantial cost. Thus when
switching costs are small almost all the costs lie
in type~Il errors, which makes the usual notion of
statistical signiﬁcance largely irrelevant. Finally,
classical experiments typically focus on designs for linear models because the information ma—
trix in a linear model is a function of the design
matrix. Designs for nonlinear models like probit
or logistic regression are complicated by the fact
the information matrix depends on unknown pa—
rameters (Chaloner and Verdinelli, 1995). This
complication presents no particular difﬁculty to
the multi—armed bandit played under random
ized probability matching. The remainder of this paper is structured as
follows. Section 2 describes the principle of ran—
domized probability matching in greater detail.
Section 3 reviews other approaches for multi~
armed bandits, including the Gittins index and
several popular heuristics. Section 4 presents
a simulation study that investigates the perfor—
mance of randomized probability matching in
the unstructured binomial bandit, where optimal
solutions are available. Section 5 describes a sec
ond simulation study in which the reward distri
bution has low dimensional structure, where “op—
timal” methods do poorly. There is an impor«
tant symmetry between Sections 4 and 5. Sec—
tion 4 illustrates the cost savings that sequen
tial learning can have over classic experiments.
Section 5 illustrates the improvements that can
be brought to sequential learning by incorporat—
ing classical ideas like fractional factorial design.
Section 6 concludes with observations about ex
tending multi~armed bandits to more elaborate
settings. 2 Randomized Probability Matching Let yt : (3/1, . . . ,yt) denote the sequence of re
wards observed up to time 15. Let at denote
the arm of the bandit that was played at time
t. We suppose that each yt was generated inde
pendently from the reward distribution fat(yf6),
where 9 is an unknown parameter vector, and
some components of 8 may be shared across the
different arms. To make the notation concrete, consider two
speciﬁc examples, both of which take gt 6 {0, 1}. Continuous rewards are also possible, of course,
but we will focus on binary rewards because
counts of clicks or conversions are the typical
measure of success in e—commerce. The ﬁrst
example is the binomial bandit, in which 8 z
(61, . . . , 6k), and fa(yt9) is the Bernoulli distri—
bution with success probability 9a. The binomial
bandit is the canonical bandit problem appear~
ing most often in the literature. The second ex«
ample is the fractional factorial bandit, where at
corresponds to a set of levels for a group of ex
perimental factors (including potential interac'
tions), coded as dummy variables in the vector
xt. Let is: denote the number of possible con«
ﬁgurations of xi, and let at E {l,...k} refer
to a particular conﬁguration according to some
labeling scheme. The probability of success is
fa(yt = 16) : 9(6Txt), where g is a binomial
link function, such as probit or logistic. We refer
to the case where g is the CDF of the standard
normal distribution as the probit bandit. Let MAE) : E(yt0,at z a) denote the ex»
pected reward from fa(y{9). If 6 were known
then the optimal long run strategy would be to
always choose the arm with the largest M109).
Let 39(6) denote a prior distribution on (9, from
which we may compute (1) The computation in equation (1) can be ex
pressed as an integral of an indicator function. Let Ia(0) 2 1 if (raw) 2: max{n1(6), . . . ,nk(l9)},
and Ia(6) : 0 otherwise. Then was : Proud = maxwi; ‘ .  will we : E<Ia<e>> = / new) are. <2) If a priori little is known about 9 then the im
plied distribution on M will be exchangeable, and
thus wag will be uniform. As rewards from the
bandit are observed, the parameters of the re—
ward distribution are learned through Bayesian
updating. At time t the posterior distribution of Bis 10(9lYt) o< 29(9) H fa.(yTl9), (3)
7:1 from which one may compute mat : PT‘(/,La 3: maiX{/J117 ‘ ' ' = E(Ia(9)1yt), as in equation Randomized probability matching allocates
observation t + l to arm a with probability wat.
Randomized probability matching is not known
to optimize any speciﬁc utility function, but it is
easy to apply in general settings, it balances ex—
ploration and exploitation in a natural way, and
it tends to allocate observations efficiently from
both inferential and economic perspectives. It is
compatible with batch updates of the posterior
distribution, and the methods used to compute
the allocation probabilities make it easy to corn
pute the expected amount of lost reward relative
to playing the optimal arm of the bandit from
the beginning. Finally, randomized probability
matching is free of arbitrary tuning parameters
that must be set by the analyst. (4) 2.1 Computing Allocation Probabili—
ties For some families of reward distributions it is
possible to compute wat either analytically or by
quadrature. In any case it is easy to compute
mm by simulation. Let 8(1), . . . , 6(a) be a sample
of independent draws from p(91yt). Then by the
law of large numbers, (5) w :2
at Gaoo G 1 0
lim — 2146(9)). 9:1
Equation (5) simply says to estimate mm by
the empirical proportion of Monte Carlo sam—
ples in which pawigl) is maximal. If fa is in
the exponential family and 39(6) is a conjugate
prior distribution then independent draws of 6
are possible. Otherwise we may draw a sequence 9(1), 9(2), . . . from an ergodic Markov chain with
p(61yt) as its stationary distribution (Tierney,
1994). In the latter case equation (5) remains
unchanged, but it is justiﬁed by the ergodic the
orem rather than the law of large numbers. Posterior draws of 6 are all that is needed to
apply randomized probability matching. Such
draws are available for a very wide class of mod—
els through Markov chain Monte Carlo and other
sampling algorithms, which means randomized
probability matching can be applied with almost
any family of reward distributions. 2.2 Implicit Allocation The optimality probabilities do not need to be
computed explicitly. It will usually be faster
to simulate a N wat by simulating a single
draw of (9“) from p(6yt), then choosing a :
arg max,L haw”). 2.3 Balancing Exploration and Ex—
ploitation Randomized probability matching naturally in~
corporates uncertainty about 6 because rum; is
deﬁned as an integral over the entire posterior
distribution p(6lyt). To illustrate, consider the
binomial bandit with It :: 2 under independent
beta priors. Figure 1(a) plots p(81,62y) assum
ing we have observed 20 successes and 30 failures
from the ﬁrst arm, along with 2 success and 1
failure from the second arm. In panel (b) the
second arm is replaced with an arm which has
generated 20 successes and 10 failures. Thus it
has the same empirical success rate as the sec—
ond arm in panel (a), but with a larger sample
size. In both plots the optimality probability
for the first arm is the probability a dot lands
below the 45—degree line (which equation (5) es—
timates by counting simulated dots). ln panel
(a) the probability that the ﬁrst arm is optimal
is around 18%, deSpite having a lower empiri
cal success rate than the second arm. In panel
(b) the larger sample size causes the the poste—
rior distribution to tighten, which lowers the first L0
_L 08
J 0.6 0.2 0.0
.L yS Figure 1: .7000 draws from the joint distribution of two independent beta distributions. In both cases the
horizontal axis represents a beta (20,30) distribution The vertical axis is (a) beta(2,]), (b) beta(20,]0). arm’s optimality probability to 0.8%. This example demonstrates that the need to
experiment decreases as more is learned about
the parameters of the reward distribution. If
the two largest values of paw) are distinct then
maxa wat eventually converges to 1. If the
161 S k: largest values of paw) are identical then
maxa wat need not converge, but wat may drift
on a subset of the probability simplex that di
vides 100% of probability among the Is; Optimal
alternatives. This is obviously just as good as
convergence from the perspective of total reward
accumulation. Notice the alignment between the inferential
goal of ﬁnding the optimal arm, and the eco
nomic goal of accumulating reward. From both
perspectives it is desirable for superior arms to
rapidly accumulate observations. Downweight—
ing inferior arms leads to larger economic rea
wards, while larger sample sizes for superior arms
means that the optimal arm can be more quickly
distinguished from its close competitors. 3 Other Solutions This Section reviews a collection of strategies
that have been used with multi—armed bandit
problems. We discuss pure exploration strate—
gies, purely greedy strategies, hybrid strategies,
and Gittins indices. Of these, only Gittins in‘
dices carry any guarantee of optimality, and then
only under a very particular scheme of discount—
ing future rewards. 3.1 The Gittins Index Gittins (1979) provided a method of comput
ing the optimal strategy in certain bandit prob
lems. Gittins assumed a geometrically dis<
counted stream of future rewards with present
value PV 2 23:0 ﬂy, for some 0 S y < 1. Git—
tins provided an algorithm for computing the ex—
pected discounted present value of playing arm (1,
assuming optimal play in the future, a quantity
that has since become known as the “Gittins in—
dex.” Thus, by deﬁnition, playing the arm with
the largest Gittins index maximizes the expected
present value of discounted future rewards. The Gittins index has the further remarkable prop
erty that it can be computed separately for each
arm, in ignorance of the other arms. A policy
with this property is known as an indem policy. Logical and computational difficulties have
prevented the widespread adoption of Gittins in—
dices. Powell (2007) notes that “Unfortunately,
at the time of this writing, there do not exist easy
to use software utilities for computing standard
Gittins indices.” Sutton and Barto (1998) add
“Unfortunately, neither the theory nor the com
putational tractability of [Gittins indicesl appear
to generalize to the full reinforcement learning
problem .. 3” Although it is hard to compute Gittins in‘
dices exactly, Brezzi and Lai 2002) have devel
Oped an approximate Gittins index based on a
normal approximation to p(6ly). Figure 2 plots
the approximation for the binomial bandit with
two different values of 7. Both sets of indices
converge to a/(a + b) as a and b grow large,
but the rate of convergence slows as y ~+ 1.
The Brezzi and Lai approximation to the Git—
tins index is as follows. Let 8,, : E(61yn,an),
van : Varmalyn), 03(9) 2 Var(yt6,a,t 2 a),
and c ::  log 7. Then ~ 1 2 2’cm
1/ z 6 + v / 6
a m n) a w (602%) < >
where
3/2 ifs g 0.2,
0.49 — (111er2 if 0.1 < s g 1,
212(5) 2 0.63 — 0265*”2 if 1 < s g 5,
0.77~0.588_1/z if5 < .3 315,
(2logs w log logs — log 16%)“2 if s > 15.
(7) Computing aside, there are three logical is—
sues that challenge the Gittins index (and the
broader class of index polices). The ﬁrst is that
it requires the arms to have distinct parame~
ters. Thus Gittins indices cannot be applied
to the problems involving covariates or struc~
tured experimental factors. The second prob‘
lem is the need to choose 7. Geometric dis~
counting only makes sense if arms are played at equally spaced time intervals. Otherwise a
higher discount factor should be used for periods
of higher traffic, but if the discounting scheme is
anything other than geometric then Gittins in—
dices are no longer Optimal (Gittins and Wang,
1992; Berry and Fristedt, 1985). A ﬁnal issue is known as incomplete learn
ing, which means that the Gittins index is an
inconsistent estimator of the location of the op
timal arm. This is because the Gittins policy
eventually chooses one arm on which to continue
forever, and there is a positive probability that
the chosen arm is sub—Optimal (Brezzi and Lai,
2000). 3.2 Heuristic strategies 3.2.1 Equal Allocation One naive method of playing a multi~armed ban—
dit is to equally allocate observations to arms un—
til the maximum optimality probability exceeds
some threshold,/and then play the winning arm
afterward. This strategy leads to stable esti~
mates of 6 for all the arms, but Section 4 demon
strates that it is grossly inefﬁcient with respect
to the overall reward. Of the methods consid~
ered here, equal allocation most closely corre—
sponds to a non—sequential classical experiment ’ (the full—factorial design). 3.2.2 PlaytheWinner In play—thewinner, if arm a results in a suc—
cess at time t, then it will be played again at
time t+ 1. If a failure is observed, then the
next arm is either chosen at random or the arms
are cycled through deterministically. Play‘the~
winner can be nearly optimal when the best arm
has a very high success rate. Berry and Fristedt
(1985) show that if arm a is optimal at time t,
and it results in a success, then it is also op
timal to play arm a at time t+ 1. However,
play—thewinner overexplores when the success
rate of the optimal arm is low. Play—the—winner
tends toward equal allocation as the success rate
of the optimal arm tends to zero. (b) Figure 2: Brezzi and Lai’s approximation to the Gittins index for the binomial bandit problem with (a) 7 z .8, and (b) 7 z .999. 3.2.3 Deterministic Greedy Strategies An algorithm that focuses purely on exploita—
tion is said to be greedy. A textbook exam—
ple of a greedy algorithm is to always choose
the arm with the highest sample mean reward.
This approach has been shown to do a poor
job of producing large overall rewards because
it fails to adequately explore the other arms
(Sutton and Barto, 1998). A somewhat bet
ter greedy algorithm is deterministic probability
matching, which always chooses the arm with the
highest optimality probability wat. One practicality to keep in mind with greedy
strategies is that they can do a poor job when
batch updating is employed. For logistical rea—
sons the bandit may have to be played multi~
ple times before it can learn from recent activ
ity. This can occur if data arrive in batches in
stead of in real time (e.g. computer logs might be
scraped once per hour, or once per day). Greedy
algorithms can suffer in batch updating because
they play the same arm for an entire update cy— cle. This can add substantial variance to the
total reward relative to randomized probability
matching. Greedy strategies perform especially
poorly in the early phases of batch updating be
cause they only learn about one arm per update
cycle. 3.2.4 Hybrid Strategies Hybrid strategies are greedy strategies that have
been modiﬁed to force some amount of explo—
ration. One example is an 6—greedy strategy.
Here each allocation takes place according to a
greedy algorithm with probability 1 — 6, oth
erwise a random allocation takes place. The
equal allocation strategy is an e—greedy algo
rithm with e 2: 1. One can criticize an e—greedy
strategy on the grounds that it has poor asymp
totic behavior, because it continues to explore
long after the Optimal solution becomes appar—
ent. This leads to the notion of an e—decreasing
strategy, which is an 6~greedy strategy where 6
decreases over time. Both e—«greedy and emdecreasing strategies
are wasteful in the sense that they use simple
random sampling as the basis for exploration. A
more fruitful approach would be to use stratiﬁed
sampling that under—samples arms that are likely
to be subeoptimal. Softmax learning (Luce,
1959) is one such an example. Softmax learning
is a randomized strategy that allocates observa—
tion 15 + 1 to arm 0. with probability 1, : expert/r) 8
at gleam/TY H where T is a tuning parameter to be chosen
experimentally. Softmax learning with fixed
7' shares the same asymptotic inefﬁciency as
e~greedy strategies, which can be eliminated by
gradually decreasing 7' to zero. Randomized probability matching combines
aspects of all the preceding strategies. It is
e~greedy in the sense that it employs deter~
ministic probability matching with probability
maxa mat, and a random (though nonuniform)
exploration with probability 5 2 l — maxa mat.
It is e~decreasing in the sense that in non—
degenerate cases maxa wag ——> 1. However it
should be noted that maxa wat can sometimes
decrease in the short run if the data warrant.
The stratified exploration provided by softmax
learning matches that used by randomized prob—
ability matching to the extent that wat is de—
termined by pm through a multinomial logistic
regression with coefﬁcient 1/7“. The benefit of
randomized probability matching is that the tun~
ing parameters and their decay schedules evolve
in principled, data determined ways rather than
being arbitrarily set by the analyst. 3.3 The Importance of Randomiza—
tion Randomization is an important component of a
bandit allocation strategy because, like the Git—
tins index described in Section 3.1, greedy strate
gies can fail to consistently estimate which arm
is optimal. Yang and Zhu (2002) proved that an e—decreasing randomization can restore consis—
tency to a greedy policy. We conjecture that ran
domized probability matching does consistently
estimate the optimal arm. If a* denotes the in—
dex of the optimal arm, then a sufﬁcient condi—
tion for the conjecture to be true is for wag > 6
for some 6 > 0, where e is independent of t. 4 The Binomial Bandit This Section describes a collection of simula—
tion studies comparing randomized probability
matching to various other learning algorithms
in the context of the binomial bandit, Where
there are ample competitors. The binomial bane
dit assumes that the rewards in each conﬁgu~
ration are independent Bernoulli random vari—
ables with success probabilities 91, . . . ,Qk). For
simplicity we assume the uniform prior distri~
bution (93 N U (0, 1), independently across a.
Let Yat and Nat denote the cumulative num
ber of successes and trials observed for arm a up to time t. Then the posterior distribution of
9 Z (91,...,0k) iS k
pwlyt) = H Beware + 1.2a — at + 1), <9) a:1 where Brawler, /3) denotes the density of the beta
distribution for random variable 6 with parame«
ters Oz and d. The optimality probability 1
ma: / Bereairat+1.Nat~i/a+n
0 H P7“(l9j < 6am. + 1, Nﬁ _ rﬁ + 1) dea
j¢¢1
(10)
can easily be computed either by quadrature or
simulation (see Figures 3 and 4). Our simulation studies focus on regret, the cu~
mulative expected lost reward, relative to play
ing the optimal arm from the beginning of the
experiment. Let M09) : maxa{ua(6)}, the ex
pected reward under the truly optimal arm, and
let mat denote the number of observations that compute.probopt <~ function(y, n){
k <v~ lengthCy)
ans <~ numeriCCk)
for(i in 1:k){
indx < (1:k)[‘i]
f <— function(x){
r <~ dbeta(x, yfi]+1, nii]~y[i]+1)
foer in indx)
r < r * pbeta(x, y[j]+1, n[j]~y[j]+1)
return(r) }
ans[i] = integrate(f,0,1)$value }
return<ans) } Figure 3: R code for computing equation {10) by
quadnnuna sim.post < function(y, n, ndraws){
k <— lengthﬂy)
ans <— matrix<nrow=ndraws, ncol=k)
no <— n~y
forCi in 11k) ans[,i]<rbeta(ndraws,y[i]+1,no[i]+1) return(ans)} prob.winner < function(post){
k <— ncol(y)
w <~ table(factor(max.col(post), 1evels=1:k))
return(w/sum(W))} compute.win.prob <~ functionCy, n, ndraws){
return(prob.winner(sim.post(y, n, ndraws)))} Figure 4: R code for computing equation {10} by
sinialotiort were allocated to arm a at time t. Then the ex
pected regret at time t is Lt :Znat(ﬂ*(6) —#a(6))a (11) and the cumulative regret at time T is L :2
23121 Lt. The units of Lt are the units of reward,
so with 0/ l rewards Lt is the expected number
of lost successes relative to the unknown optimal
strategy. The simulation study consisted of 100 experi—
ments, each with k 2 10 “true” values of :9 inde~
pendently generated from the 21(0, T16) distribu
tion. The ﬁrst such value was assumed to be the
current conﬁguration and was assigned 106 prior
observations, so that it was effectively a known
“champion” with the remaining 9 arms being new “challengers.” We consider both batch and
real time updates. 4.1 Batch Updating In this study, each update contains a Pois—
son(1000) number of observations allocated to
the different arms based on one of several allo—
cation schemes. We ran each experiment until
either the maximal wat exceeded a threshold of
.957 or else 100 time periods had elapsed. 4.1.1 RPM vs. Equal Allocation Our ﬁrst example highlights the substantial
gains that can be had from sequential learn—
ing relative to a classical non—sequential exper—
iment. Figure 5 compares the cumulative re—
gret for the randomized probability matching
and equal allocation strategies across 100 sim—
ulation experiments. The regret from the equal
allocation strategy is more than an order of mag«
nitude greater than under probability matching.
Figure 6 shows why it is so large. Each box—
plot in Figure 6 represents the distribution of
L; for experiments that were still active at time
t. The regret distribution at time 1 is the same
in both panels, but the expected regret under
randomized probability matching quickly falls to
zero as subOptimal arms are identiﬁed and ex—
cluded from further study. The equal alloca“
tion strategy continues allocating observations to
sub—optimal arms for the duration of the experi
ment. This produces more accurate estimates of
6 for the suboptimal arms but at great expense. The inferential consequences of the multi
armed bandit can be seen in Figure 7, which
shows how the posterior distribution of 6a
evolves over time for a single experiment. No
tice how the 95% credibility interval for the op—
timal arm (in the upper left plot of each panel)
is smaller in panel (a) than in panel (b), despite
the fact that the experiment in panel (b) ran
longer. Likewise, notice that the 95% credible in—
tervals for the suboptimal arms are much wider H’)...
9—1
>‘
u
c
a.)
3
:7
93
LL
m
o Expected Number of Lost Conversions 50 100 l: [ElTﬂllM I H 150 200 (a) ll 1 LIL“
250 Frequency 30 25 20 15 10 Expected Number of Lost Conversions W7 3000 4000 (b) I 5000 K
6000 Figure 5: Cumulative regret in each of 100 simulation experiments under (a) randomized probability match
ing) and (b) equal allocation. Note that the scales diﬁ‘er by an order of magnitude. For comparison purposes,
both panels include a rug—plot showing the regret distribution under probability matching. 4O 50
1 L055 10
L v
s
x
l
u
K
«
x F
20 Test Period (81) Loss 60 50 4O 30 20 10 n
l. .
V
an: « n '
m. y .l c
n. x . .
. . l .. n
3: “Que ~;« m“,
Mu. . x' .' v mm .
v‘. ., ,
x u'vn
u, v: .
l l .I‘
“Mn. mm,
“Hummu.
s";.HI‘n‘ .x
V . “mun”, .
“mumu.
“ \lullu‘ul
u, u. l,
. . l u
“mumn
n. u I
H‘ v u.” v .V v
v H .0 ' l y;
I'M"  I"I v‘ ' ll
,H.“ u "‘3 IV: h
3m, “1...” :l c :H
"H"’” "‘x' ‘H ‘n « m u ' x v
m . v v . .v
.,u ’ H. “mg _ [rm
m.” H . ,: n H...
“mm.” l... “Wm,”
..Hli§”1".‘x‘#v‘ll‘hl‘ﬂ
,'.‘...u.nu.,~.,ul ,.
' n:".»‘~,'v.x‘n'u
u l. A h It
,. w H
i .u .. u K “II t' v m...qu .
. l n.
.. .u Wavy” H
I: m .. u." 'n . .u. um
.;.;.n.;;;w,;:‘:‘..umm
A “ ‘ A, 4‘ {Au .
1 ﬁt v
e H a. ‘ w
l. ,
., l
i” n 4m m
mun”
'mm'n
m'u
.,.A ,
l i .9 . .
nun“. m, u. U‘ \
“Hum.
HAM I.
L Y
Y " 'I
.
m . . ..
. ;..
. ‘ ... _..
m. .... n. munmm u.
“H'u'unn u
1n.4ln.nuu ~
m'. , .
"""n‘4uu
mu.” m.“
m‘mm. A u 0 20 I
60 Test Period 03) 80 Figure 6: Expected regret per time period under (a) randomized probability matching, and { b ) equal allocation.
Bomplots show variation across 100 simulated experiments. 10 0 Ln 6 D “'7 O
m .— — N m  ’ N
‘ l I 1 A O
045:
mo
0.05 0.00
~ov15 rms
3 o.10 3 ~01!)
g , . . . . _ , V I ‘ . > . . , , , , _ . l . V . . V , . A v V l . . i , . . V, .93
' ‘ ' ‘ ' ‘ ‘ ' ‘ ' _ . . y . V . V V V . V . . V . . , V r V . A r V . i ' V I ‘ l I l b ‘ ' ‘ V i l v V l l ‘ y V r V L N k
R .005
Rx
................. ,, 1 x I h ‘ eom
N v a m e g
olIs~ 015"
005—
om..‘—'*'—..,,,.,m..‘*“_ (a) (b) Figure 7: Evolution of the posterior distribution (means and upper and lower 95% credibility bounds) of6
under (a) randomized probability matching, (b) equal allocation. Each panel corresponds to an arm. Arms
are sorted according to optimality probability when the experiment ends. Loss Test Period Figure 8: Regret from deterministic probability matching. The ﬁrst ten periods were spent learning the
parameters of each arm. 11 Expected Less
600 800 i 000 i 200
i l 1 1 400
l
00 Method Mean SD
RPM 50.1 62.2
DPM 54.7 122.3
Largest Sample Mean 81.4 225.7 (b) Figure 9: (a) Stacked stripcha’rt showing cumulative regret after excluding the ﬁrst 10 test periods. Panel
(:5) shows means and standard deviations of the expected losses plotted in panel Greedy methods have a
higher chance of zero regret. Randomized probability matching has a lower variance. under probability matching than under equal a1—
location. 4.1.2 RPM vs. Greedy Algorithms The experiment described above was also run un~
der the purely greedy strategies of deterministic
probability matching and playing the arm with
the largest mean. Neither purely greedy strategy
is suitable for the batch updating used in our
experiment. .Because a 21(0, 1) prior was used
for each 9a, and because the “true” values were
simulated from 21(0, 1—10), an arm with no obser~
vations will have a higher mean and higher opti—
mality probability than an armwith observed re
wards. Thus a purely greedy strategy will spend
the ﬁrst In batches cycling through the arms3 as
signing all the observations to each one in turn
(see Figure 8). This form of exploration has the
same expected regret as equal allocation, but
with a larger variance, because all bets are placed
on a single arm. Figure 9 compares the cumulative expected regret for the two greedy strategies to random
probability matching after excluding the ﬁrst 10
time steps, after the greedy algorithms have ex—
plored all 10 arms. The reward under random
ized probability matching has a much lower stan
dard deviation than under either of the greedy
strategies. It also has the lowest sample mean,
though the difference between its mean and that
of deterministic probability matching is not sta—
tistically signiﬁcant. Notice that the frequency
of exact zero regret is lowest for randomized
probability matching, but its positive losses are
less than those suffered by the greedy methods. 4.2 Real Time Updating A third simulation study pitted randomized
probability matching against the Gittins index,
in the setting where Gittins is optimal. This time
the experiment was run for 10,000 time steps,
each with a single play of the bandit. Again the
simulation uses It 2 10 and independently draws
the true success probabilities from Um, Fig 12 ure 10 compares randomized and deterministic
probability matching with Gittins index strate—
gies where y r: .999 and y x .8, Of the four
methods, randomized probability matching did
the worst job of accumulating total reward, but
it had the smallest Standard deviation, and it se
lected the optimal arm at the end of the exper~
iment the largest number of times. The Gittins
index with y 2 .999 gathered the largest reward,
but with a larger standard deviation and lower
probability of selecting the optimal arm. Deter
ministic probability matching did slightly worse
than the better Gittins policy on all three met—
rics. Finally, the Gittins index with y :2 .8 shows
a much thicker tail than the other methods, illus~
trating the fact that you can lower your overall
total reward by too heavily discounting the fu—
ture. 5 Fractional Factorial Bandit The binomial bandit described in Section 4 fails
to take advantage of potential structure in the
arms. For example, suppose the arms are web—
sites differing in font family, font size, image lo»
cation, and background color. If each character
istic has 5‘levels then there are 54 2 625 possible
conﬁgurations to test. One could analyze this
problem using an unstructured binomial bandit
with 625 arms, but the size of the parameter es
timation problem can be dramatically reduced
by assuming additive structure. In the preced—
ing example, suppose each 5level factor is rep—
resented by four indicator variables in a probit
or logistic regression. If we include an intercept
term and assume a strictly additive structure
then there are only 1 + (5 —~ 1) X 4 :: 17 parame~
ters that need estimating. Interaction terms can
be included if the additivity assumption is too
restrictive. Choosing a set of interactions to al~
low into the bandit is analogous to choosing a
particular fractional factorial design in a classi
cal problem. Let X: denote the vector of indicator vari
ables describing the the characteristics of the arm played at time t. For the purposes of this
Section we will assume that the probability of a
reward depends on xt through a probit regres—
sion model, but any other model can be substi—
tuted as long as posterior draws of the model’s
parameters can be easily obtained. Let @(z) de
note the standard normal cumulative distribu—
tion function. The probit regression model as
sumes Pr(yt :: ) :2 @(QTXt). The probit regression model has no conjugate
prior distribution, but a well known data aug
mentation algorithm (Albert and Chib, 1993)
can be used to produce serially correlated draws
from 19(61y). The algorithm is described in Sec
tion 7.1 of the Appendix. Each iteration of Al—
bert and Chib’s algorithm requires a latent vari—
able to be imputed for each yt. This can cause
the posterior sampling algorithm to slow as more
observations are observed. However, if x is small
enough to permit all of its possible configurations
to be enumerated then Albert and Chib’s algo
rithm can be optimized to run much faster as
t > 00. Section 7.2 of the Appendix explains the
modified algorithm. Other modiﬁcations based
on large sample theory are possible. We conducted a simulation study to compare
the effectiveness of the fractional factorial and
binomial bandits. 1n the simulation, data were
drawn from a probit regression model based on
4 discrete factors with 2, 3, 4, and 5 levels. This
model has 120 possible conﬁgurations, but only
11 parameters, including an intercept term. The
intercept term was drawn from a normal dis
tribution with mean @‘1(.05) and variance 0.1.
The other coefficients were simulated indepen
dently from the N(0,0.1) distribution. These
levels were chosen to produce arms with a mean
success probability of around .05, and a standard
deviation of about .5 on the probit scale. Fig
ure 11 shows the 120 simulated success probabil—
ities for one of the simulated experiments. We
replicated this simulation 100 times, to produce
100 bandit processes on which to experiment. Let {xa : a : 1, . . .,120}, denote the possible 13 —~ Gimnsses M GininsB
_ r ,,,,, ,V DPM
3‘ w RPM 0020 0,015
1 German; 0.010
L 0005
l 0.000 —1 DO 0 1 00 200 Expected Loss (80 Method Mean SD %¢Correct
Gittins (.999) 49.0 47.0 63
Gittins (s) 84.0 94.2 48
DPM 51.9 58.4 58
RPM 87.3 21.7 76 (b) Figure 10: (a) Expected regret under real time sampling across 100 experiments, each simulated for 10, 000 time steps. (6) Mean and standard deviation of the expected losses plotted in panel (a), along with the percentage of experiments for which the optimal arm was selected at time 10, 000. 1 5 20 25 30
i l l .J Frequency 10
l 7 r—i
i“) II 000 0.05 0.10 OAS 0.20 Scucess probability Figure 11: True success probabilities on each arm of
the fractional factorial bandit. conﬁgurations of X. In each update period, 100
observations were randomly allocated to the dif—
ferent possible conﬁgurations according to wat.
At each time step we produced 1000 draws from
p(0lyt) using the algorithm in Section 7.2, as
suming independent N (0, 1) priors for all coefﬁ
cients. These draws were used to compute mat as
described in Section 2.1, with #140) :2 @(QTXQ).
Note that (D is a monotonic function, so the arm
with the largest aa(0) is the same as the arm
with the largest 0Txa. This allows us to speed up the computation by skipping the application
of (I). For every experiment we ran with the frac
tional factorial bandit, we ran a parallel exper—
iment under the binomial bandit with the same
“true” success probabilities. Figure 12 compares
the regret distributions across the 100 experi
ments for the fractional factorial and binomial
bandits. The mean regret for the binomial ban—
dit is 745 conversions out of 10,000 trials. The
mean regret for the fractional factorial bandit
14 m Probit
  ~ Binomial 0.006
L WY 0.002 ,
I
k I
i
.
. 1 l I I
0 500 i 000 l 500 0.000
1 Expecfed loss after 10,000 impressions Figure 12: Cumulative regret after 10K trials for
the fractional factorial (solid) and binomial (dashed)
bandits. is 166 conversions out of 10,000 trials, a fac»
tor of 4.5 improvement over the binomial ban
dit. Figure 13 compares the period~by~period
regret distributions for the fractional factorial
and binomial bandits. The regret falls to zero
much faster under the fractional factorial bandit
scheme. Figure 14 compares wet for the frac~
tional factorial and binomial bandits for one of
the 100 experiments. In this particular experi—
ment the top four success probabilities are within
2% of one another. The fractional factorial ban~
dit is able to identify the optimal arm within a
few hundred observations. The binomial bandit
remains confused after 10,000 observations. 6 Conclusion This paper has shown how randomized prob—
ability matching can be used to manage the
multi—armed bandit problem. The method is
easy to apply, assuming one can generate poste—
rior draws from 39091355) by Markov chain Monte
Carlo or other methods. Randomized probabil~ ity matching performs reasonably well compared
to optimal methods, when they are available,
and it is simple enough to generalize to situa»
tions that optimal methods cannot handle. It is
especially well suited to situations where learn—
ing occurs in batch updates, and it is robust in
the sense that it had the lowest standard deviaa
tion of any method in all of the simulation stud
ies we tried. Finally, it combines features from
several popular heuristics without the burden of
specifying artiﬁcial tuning parameters. We have illustrated the advantages of com—
bining sequential and classical experiments by
focusing on fractional factorial designs. Other
important design ideas are similarly easy to in
corporate. For example, randomized blocks for
controlling non—experimental variation (ie vari—
ables which cannot be set by the experimenter)
can be included in X, either as ﬁrst order factors
or as interactions. Thus it is straightforward to
control for temporal effects (eg. day of week),
or demographic characteristics of the customer
using the product. These techniques can be brought to several
generalizations of multi—armed bandits by sim—
ply modifying the model used for the reward dis—
tribution. For instance, our examples focus on
0/1 rewards, but continuous rewards can be han
dled by substituting regression models for logit
or probit regressions. Restless bandits (Whittle,
1988) that assume 6} varies slowly over time can
be handled by replacing fa(y16) with a dynamic
linear model (West and Harrison, 1997). Arm
acquiring bandits (Whittle, 1981) are handled
gracefully by randomized probability matching
by simply extending the design matrix used in
the probit regression. Finally, one can imag
ine a network of experiments sharing information
through a hierarchical model, practically begging
for the name “multi~armed maﬁa.” 15 40
__.__ Loss
30
I 10
 numlmlmm Test Period (8») Loss m. v'
.u‘ m. w'v ‘.
mun”. ' ....'."'n . .. ,
,‘1'.‘.< u~  ,.
.m HymanI « . .
A munH. 1..
n. .. u ..
. I. I m )AuY 4IIAIHIHHHHIHHIHIH" _
mummummmummm . __
mmmm:numumm.u..m..mm . u .
xtuuulutuIN.Innxluhuuulnu
.umummlmmmmmm.m.u. .. . . vv : .
., up V ; ‘
uh."
u”... ununuvuluuxu u r . .
"Hunnun“nun.HnuynunuullnI u ‘ .u. HunAIuA“IAIAuuAuluIAuInunnuuuhnunuIIInu1A1AA““In:Inuulnuuunnuuux I" 1 l
0 20 4O T951 Period (‘0) I
60 Ta
80 Figure 13: Regret distributions for (a) the fractional factorial bandit and (b) the binomial bandit, under
randomized probability matching, when the underlying process has probit structure, Probability of being the oplimal arm Time (a) Probability of being (he opﬁmal arm 0.8
.1.— Time (b) Figure 14: Evolution of mm for one of the 100 fractional factorial bandit experiments in Section 5. (a) fractional factorial bandit (b) binomial bandit. 16 7 Appendix: Details of the
Probit MCMC Algorithms 7.1 Posterior Sampling fer Probit Re
gression: Standard Case This Section describes algorithm introduced by
Albert and Chib (1993) to simulate draws from
p(9}y) in a probit regression model. Assume the
prior distribution (9 N N (b, 2), meaning the nor
mal distribution with mean vector ,a and vari
ance matrix 2. Let N? (a, a2) denote the normal
distribution with mean n and variance 02, trun~
cated to only have support on the half line 2 > a.
The complementary distribution N; (,a, 02) is
truncated to have support on 2 < a. For the
purpose of this Section, let yt be coded as 1/ ~ 1
instead of 1/0. The corresponding predictor vari—
ables are Xi, and X is a matrix with xt in row t. 1. For each t, simulate z; N Ng‘wTXt, 1). 2. Let z 2 (21,...,z.n). Sample (9 N N fl), where n“1 : 2‘1+XTX and (5 : n(XTz+
234(2). Repeatedly cycling through steps 1 and 2 pro
duces a sequence of draws (6,2)”), (6, z)(2), ...
from a Markov chain with p(6,zly) as its sta—
tionary distribution. Simply ignoring 2 yields
the desired marginal distribution p(01y). 7.2 Probit Posterior Sampling for De—
numerable Experimental Designs When xt contains only indicator variables it is
possible to list out all possible conﬁgurations
of X. Suppose there are k of them, stacked to
form the k X p matrix Now let 37 and n
be k~vectors with elements ya and na denot
ing the number of conversions and trials from
conﬁguration a. Then XTX :: XTdiag(ﬁ)X,
and we can approximate XTZ : thtxt :2
2a xa Etmtzxa 2t : Ea xaza using the central
limit theorem. Let 2., :2 z: + z; , where 2,“: is the sum of ya draws from NJWTXQ, 1), and z; is the sum of na ~ ya draws from NO"(9Txa, 1). Let /\(d) :
mama/<1 * so». If y. is large the
z: is approximately normal with mean ya {nu M
/\(“#a)l‘ and variance ya{1 * MMMaX/VWQ) ~
na)}. Likewise, let 5(a) 2: gb(oz)/€D(oz). If
na ~ ya, is large then z; is normal with meai
(nawya)(na~6(na)) and variance (na—ya){1
Ma6('llal “ (5(“t‘all2l For configurations with large (> 50) values of
ya or na «ya compute z; or 2; using its asymp—
totic distribution. If ya < 50 we compute a:
by directly summing ya draws from N01 (pa, 1).
Likewise if na — ya < 50 we compute z; by di—
rectly summing na ~ ya draws from N’O“ (na, 1). References Albert, J. H. and Chib, S. (1993). Bayesian analysis
of binary and polychotomons response data. Jour
nal of the American Statistical Association 88,
669~679 Bellman, R. E. (1956). A problem in the sequential
design of experiments. Sanhhya Series A 30, 221~
252. Berry, D. A. and Fristedt, B. (1985). Bandit Prob—
lems: Sequential Allocation of Experiments. Chap
man and Hall. Brezzi, M. and Lai, T. L. (2000). Incomplete leran
ing from endogenous data in dynamic allocation.
Econometrica 68, 15114516. Brezzi, M. and Lai, T. L. (2002). Optimal learning
and experimentation in bandit problems. Journal
of Economic Dynamics and Control 27, 87 i 108. Chaloner, K. and Verdinelli, I. (1995). Bayesian ex—
perimental design: A review. Statistical Science
10, 273*304. Cox, D. R. and Reid, N. (2000). The Theory of the
Design of Experiments. Chapman and Hall, CRC. Gittins, J. and Wang, Y.G. (1992). The learning
component of dynamic allocation indices. The An
nals of Statistics 20, 1625~1636. Gittins, J. C. (1979). Bandit processes and dynamic
allocation indices. Journal of the Royal Statistical
Society, Series B: Methodological 41, 148477. 17 Google (2010). www . google . com/websiteopt imizer.
Luce, D. (1959). Individual Choice Behavior. Wiley. Powell, W. B. (2007). Approximate Dynamic Pro
gramming: Solving the Curses of Dimensionality.
John Wiley & Sons, Inc. Sutton, R. S. and Barto, A. G. (1998). Reinforcement
Learning: an introduction. MIT Press. Thompson, W. R. (1933). On the likelihood that one
unknown probability exceeds another in View of the
evidence of two samples. Biometrika 25., 285»294. Thompson, W. R. (1935). On the theory of appor
tionment. American Journal of Mathematics 57 .
450»456. Tierney, L. (1994), Markov chains for exploring pos
terior distributions (disc: P1728—1762). The An—
nals of Statistics 22, 1701»1728. \Vest. M. and Harrison, .1. (1997). Bayesian Forecast»
ing and Dynamic Models. Springer. W’hittle, P. (1979). Discussion of “bandit processes
and dynamic allocation indices”. Journal of the
Royal Statistical Society. Series B: [Methodological
41, 165. Whittle, P. (1981). Arm~acquiring bandits. The An—
nals of Probability 9, 284—292. W’hittle, P. (1988). Restless bandits: Activity allo
cation in a changing world. Journal of Applied
Probability 25A, 287298. Yang, Y. and Zhu, D. (2002). Randomized allocation
with nonparametric estimation for a multi—armed
bandit problem with covariates. Statistics 30, 100421. The Annals of 18 ...
View Full
Document
 Fall '08
 Staff

Click to edit the document details