{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Modern Bayesian Look

Modern Bayesian Look - A Modern Bayesian Look at the...

Info iconThis preview shows pages 1–18. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 12
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 14
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 16
Background image of page 17

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 18
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: A Modern Bayesian Look at the Multi~Armed Bandit Steven L. Scott August 9, 2010 Abstract A multi—armed bandit is a particular type of experimental design where the goal is to ac- cumulate the largest possible reward. Rewards come from a payoff distribution with unknown parameters that are to be learned through sequential experimentation. This article describes a heuristic for managing multi—armed bandits called randomized probability matching, which ran- domly allocates observations to arms according the Bayesian posterior probability that each arm is optimal. Advances in Bayesian computation have made randomized probability matching easy to apply to virtually any payoff distribution. This flexibility frees the experimenter to work with pay‘ off distributions that correspond to certain classical experimental designs that have the potential to outperform methods which are “optimal” in simpler contexts. We summarize the relationships betWeen randomized probability matching and several related heuristics that have been used in the reinforcement learning literature. 1 Introduction A multi-drmed bandit is a sequential experiment with the goal of achieving the largest possible reward from a payoff distribution with unknown parameters. At each stage the experimenter must decide which arm of the experiment to ob- serve next. The choice involves a fundamental trade—off between the utility gain from exploit— ing arms that appear to be doing well (based on limited sample information) with the informa— tion gain from emploring arms that might poten— tially be optimal, but which appear to be inferior because of sampling variability. This article re— views several techniques that have been used to manage the multi'armed bandit problem. Par— ticular attention is paid to a technique known as randomized probability matching, which can be implemented quite simply in a modern Bayesian computing environment, and which can combine good ideas from both sequential and classical ex— perimental design. Multi-armed bandits have an important role to play in modern production environments that emphasize “continuous improvement,” where products remain in a perpetual state of fea— ture testing. Online software (such as a web site, an online advertisement, or a cloud service) is especially amenable to continuous improve- ment because experimental variation is easy to introduce, and because user responses to on— line stimuli are often quickly observed. Indeed, several frameworks for improving online services through experimentation have been develOped. Google’s website optimizer (Google, 2010) is one well known example. Designers provide website optimizer several versions of their website dif— fering in font, image choice, layout, and other design elements. Website Optimizer randomly diverts traffic to the different configurations in search of configurations that have a high proba— bility of producing successful outcomes, or con— versions, as defined by the website owner. One disincentive for website owners to engage in on~ line experiments is the fear that too much traffic will be diverted to inferior configurations in the name of experimental validity. Thus the explo~ ration/exploitation trade off arises because ex~ perimenters must weigh the potential gain of an increased conversion rate at the end of the exper- iment with the cost of a reduced conversion rate while it runs. Treating product improvement ex— periments as multi—arrned bandits can dramati— cally reduce the cost of experimentation. The name “multi—armed bandit” is an allusion to a “one~armed bandit,” a colloquial term for a slot machine. The straightforward analogy is to imagine different website configurations as a row of slot machines, each with its own probability of producing a reward (i.e. a conversion). The multi-armed bandit problem is notoriously resis— tant to analysis (Whittle, 1979), though optimal solutions are available in certain special cases (Gittins, 1979', Bellman, 1956). Even then, the optimal solutions are hard to compute, rely on artificial discount factors, and fail to generalize to realistic reward distributions. They can also exhibit incomplete learning, meaning that there is a positive probability of playing the wrong arm forever (Brezzi and Lai, 2000). Because of the drawbacks associated with op— timal solutions, analysts often turn to heuristics to manage the exploration/exploitation trade— off. Randomized probability matching is a par— ticularly appealing heuristic that plays each arm in proportion to its probability of being optimal. Randomized probability matching is easy to im— plement, broadly applicable, and combines sev~ eral attractive features of other popular heuris- tics. Randomized probability matching is an old idea (Thompson, 1933, 1935), but modern Bayesian computation has dramatically broad- ened the class of reward distributions to which it can be applied. The simplicity of randomized probability matching allows the multi—armed bandit to in- corporate powerful ideas from classical design. For example, both bandits and classical exper- iments must face the exploding number of pos— sible configurations as factors are added to the experiment. Classical experiments handle this problem using fractional factorial designs (see, e.g. Cox and Reid, 2000), which surrender the ability to fit certain complex interactions in or- der to reduce the number of experimental runs. These designs combat the curse of dimensionality by indirectly learning about an arm’s reward dis— tribution by examining rewards from other arms with similar characteristics. Bandits can use the fractional factorial idea by assuming that a model, such as a probit or logistic regression, de« termines the reward distributions of the different arms. Assuming a parametric model allows the bandit to focus on a lower dimensional param— eter space and thus potentially achieve greater rewards than “optimal” solutions that make no parametric assumptions. It is worth noting that there are also impor— tant differences between classical experiments and bandits. For example, the traditional op- timality criteria from classical experiments (D— optimality, Aooptimality, etc.) tend to produce balanced experiments where all treatment effects can be accurately estimated. In a multi-armed bandit it is actually undesirable to accurately es— timate treatment effects (i.e. the parameters the reward distribution) for inferior arms. Instead, the bandit aims to gather just enough, informa— tion about a sub—optimal arm to determine that it is sub~optimal, at which point further explo‘ ration becomes wasteful. A second difference is the importance placed on statistical signifi- cance. Classical experiments are designed to be analyzed using methods that tightly control the type~l error rate under a null hypothesis of no effect. But when the cost of switching between products is small (as with software testing), the type-I error rate is of little relevance to the ban- dit. A type—l error corresponds to switching to a different arm that provides no material. advan— tage over the current arm. By contrast, a typeeIl error means failing to switch to a superior arm, which could carry a substantial cost. Thus when switching costs are small almost all the costs lie in type~Il errors, which makes the usual notion of statistical significance largely irrelevant. Finally, classical experiments typically focus on designs for linear models because the information ma— trix in a linear model is a function of the design matrix. Designs for nonlinear models like probit or logistic regression are complicated by the fact the information matrix depends on unknown pa— rameters (Chaloner and Verdinelli, 1995). This complication presents no particular difficulty to the multi—armed bandit played under random- ized probability matching. The remainder of this paper is structured as follows. Section 2 describes the principle of ran— domized probability matching in greater detail. Section 3 reviews other approaches for multi~ armed bandits, including the Gittins index and several popular heuristics. Section 4 presents a simulation study that investigates the perfor— mance of randomized probability matching in the unstructured binomial bandit, where optimal solutions are available. Section 5 describes a sec- ond simulation study in which the reward distri- bution has low dimensional structure, where “op— timal” methods do poorly. There is an impor« tant symmetry between Sections 4 and 5. Sec— tion 4 illustrates the cost savings that sequen- tial learning can have over classic experiments. Section 5 illustrates the improvements that can be brought to sequential learning by incorporat— ing classical ideas like fractional factorial design. Section 6 concludes with observations about ex- tending multi~armed bandits to more elaborate settings. 2 Randomized Probability Matching Let yt : (3/1, . . . ,yt) denote the sequence of re wards observed up to time 15. Let at denote the arm of the bandit that was played at time t. We suppose that each yt was generated inde- pendently from the reward distribution fat(yf6), where 9 is an unknown parameter vector, and some components of 8 may be shared across the different arms. To make the notation concrete, consider two specific examples, both of which take gt 6 {0, 1}. Continuous rewards are also possible, of course, but we will focus on binary rewards because counts of clicks or conversions are the typical measure of success in e—commerce. The first example is the binomial bandit, in which 8 z (61, . . . , 6k), and fa(yt|9) is the Bernoulli distri— bution with success probability 9a. The binomial bandit is the canonical bandit problem appear~ ing most often in the literature. The second ex« ample is the fractional factorial bandit, where at corresponds to a set of levels for a group of ex- perimental factors (including potential interac' tions), coded as dummy variables in the vector xt. Let is: denote the number of possible con« figurations of xi, and let at E {l,...k} refer to a particular configuration according to some labeling scheme. The probability of success is fa(yt = 1|6) : 9(6Txt), where g is a binomial link function, such as probit or logistic. We refer to the case where g is the CDF of the standard normal distribution as the probit bandit. Let MAE) : E(yt|0,at z a) denote the ex» pected reward from fa(y{9). If 6 were known then the optimal long run strategy would be to always choose the arm with the largest M109). Let 39(6) denote a prior distribution on (9, from which we may compute (1) The computation in equation (1) can be ex- pressed as an integral of an indicator function. Let Ia(0) 2 1 if (raw) 2: max{n1(6), . . . ,nk(l9)}, and Ia(6) : 0 otherwise. Then was : Proud = maxwi; ‘ . - will we :- E<Ia<e>> = / new) are. <2) If a priori little is known about 9 then the im- plied distribution on M will be exchangeable, and thus wag will be uniform. As rewards from the bandit are observed, the parameters of the re— ward distribution are learned through Bayesian updating. At time t the posterior distribution of Bis 10(9lYt) o< 29(9) H fa.(yTl9), (3) 7:1 from which one may compute mat : PT‘(/,La 3: maiX{/J117 ‘ ' ' = E(Ia(9)1yt), as in equation Randomized probability matching allocates observation t + l to arm a with probability wat. Randomized probability matching is not known to optimize any specific utility function, but it is easy to apply in general settings, it balances ex— ploration and exploitation in a natural way, and it tends to allocate observations efficiently from both inferential and economic perspectives. It is compatible with batch updates of the posterior distribution, and the methods used to compute the allocation probabilities make it easy to corn- pute the expected amount of lost reward relative to playing the optimal arm of the bandit from the beginning. Finally, randomized probability matching is free of arbitrary tuning parameters that must be set by the analyst. (4) 2.1 Computing Allocation Probabili— ties For some families of reward distributions it is possible to compute wat either analytically or by quadrature. In any case it is easy to compute mm by simulation. Let 8(1), . . . , 6(a) be a sample of independent draws from p(91yt). Then by the law of large numbers, (5) w :2 at Gaoo G 1 0 lim —- 2146(9)). 9:1 Equation (5) simply says to estimate mm by the empirical proportion of Monte Carlo sam— ples in which pawigl) is maximal. If fa is in the exponential family and 39(6) is a conjugate prior distribution then independent draws of 6 are possible. Otherwise we may draw a sequence 9(1), 9(2), . . . from an ergodic Markov chain with p(61yt) as its stationary distribution (Tierney, 1994). In the latter case equation (5) remains unchanged, but it is justified by the ergodic the- orem rather than the law of large numbers. Posterior draws of 6 are all that is needed to apply randomized probability matching. Such draws are available for a very wide class of mod— els through Markov chain Monte Carlo and other sampling algorithms, which means randomized probability matching can be applied with almost any family of reward distributions. 2.2 Implicit Allocation The optimality probabilities do not need to be computed explicitly. It will usually be faster to simulate a N wat by simulating a single draw of (9“) from p(6|yt), then choosing a : arg max,L haw”). 2.3 Balancing Exploration and Ex— ploitation Randomized probability matching naturally in~ corporates uncertainty about 6 because rum; is defined as an integral over the entire posterior distribution p(6lyt). To illustrate, consider the binomial bandit with It :: 2 under independent beta priors. Figure 1(a) plots p(81,62|y) assum- ing we have observed 20 successes and 30 failures from the first arm, along with 2 success and 1 failure from the second arm. In panel (b) the second arm is replaced with an arm which has generated 20 successes and 10 failures. Thus it has the same empirical success rate as the sec— ond arm in panel (a), but with a larger sample size. In both plots the optimality probability for the first arm is the probability a dot lands below the 45—degree line (which equation (5) es— timates by counting simulated dots). ln panel (a) the probability that the first arm is optimal is around 18%, deSpite having a lower empiri- cal success rate than the second arm. In panel (b) the larger sample size causes the the poste— rior distribution to tighten, which lowers the first L0 _L 08 J 0.6 0.2 0.0 .L yS Figure 1: .7000 draws from the joint distribution of two independent beta distributions. In both cases the horizontal axis represents a beta (20,30) distribution The vertical axis is (a) beta(2,]), (b) beta(20,]0). arm’s optimality probability to 0.8%. This example demonstrates that the need to experiment decreases as more is learned about the parameters of the reward distribution. If the two largest values of paw) are distinct then maxa wat eventually converges to 1. If the 161 S k: largest values of paw) are identical then maxa wat need not converge, but wat may drift on a subset of the probability simplex that di- vides 100% of probability among the Is; Optimal alternatives. This is obviously just as good as convergence from the perspective of total reward accumulation. Notice the alignment between the inferential goal of finding the optimal arm, and the eco- nomic goal of accumulating reward. From both perspectives it is desirable for superior arms to rapidly accumulate observations. Downweight— ing inferior arms leads to larger economic rea wards, while larger sample sizes for superior arms means that the optimal arm can be more quickly distinguished from its close competitors. 3 Other Solutions This Section reviews a collection of strategies that have been used with multi—armed bandit problems. We discuss pure exploration strate— gies, purely greedy strategies, hybrid strategies, and Gittins indices. Of these, only Gittins in‘ dices carry any guarantee of optimality, and then only under a very particular scheme of discount— ing future rewards. 3.1 The Gittins Index Gittins (1979) provided a method of comput- ing the optimal strategy in certain bandit prob- lems. Gittins assumed a geometrically dis< counted stream of future rewards with present value PV 2 23:0 fly, for some 0 S y < 1. Git— tins provided an algorithm for computing the ex— pected discounted present value of playing arm (1, assuming optimal play in the future, a quantity that has since become known as the “Gittins in— dex.” Thus, by definition, playing the arm with the largest Gittins index maximizes the expected present value of discounted future rewards. The Gittins index has the further remarkable prop- erty that it can be computed separately for each arm, in ignorance of the other arms. A policy with this property is known as an indem policy. Logical and computational difficulties have prevented the widespread adoption of Gittins in— dices. Powell (2007) notes that “Unfortunately, at the time of this writing, there do not exist easy to use software utilities for computing standard Gittins indices.” Sutton and Barto (1998) add “Unfortunately, neither the theory nor the com- putational tractability of [Gittins indicesl appear to generalize to the full reinforcement learning problem .. 3” Although it is hard to compute Gittins in‘ dices exactly, Brezzi and Lai 2002) have devel- Oped an approximate Gittins index based on a normal approximation to p(6ly). Figure 2 plots the approximation for the binomial bandit with two different values of 7. Both sets of indices converge to a/(a + b) as a and b grow large, but the rate of convergence slows as y ~+ 1. The Brezzi and Lai approximation to the Git— tins index is as follows. Let 8,, :- E(61yn,an), van : Varmalyn), 03(9) 2 Var(yt|6,a,t 2 a), and c :: - log 7. Then ~ 1 2 2’cm 1/ z 6 + v / 6 a m n) a w (602%) < > where 3/2 ifs g 0.2, 0.49 — (111er2 if 0.1 < s g 1, 212(5) 2 0.63 — 0265*”2 if 1 < s g 5, 0.77~0.588_1/z if5 < .3 315, (2logs w log logs — log 16%)“2 if s > 15. (7) Computing aside, there are three logical is— sues that challenge the Gittins index (and the broader class of index polices). The first is that it requires the arms to have distinct parame~ ters. Thus Gittins indices cannot be applied to the problems involving covariates or struc~ tured experimental factors. The second prob‘ lem is the need to choose 7. Geometric dis~ counting only makes sense if arms are played at equally spaced time intervals. Otherwise a higher discount factor should be used for periods of higher traffic, but if the discounting scheme is anything other than geometric then Gittins in— dices are no longer Optimal (Gittins and Wang, 1992; Berry and Fristedt, 1985). A final issue is known as incomplete learn- ing, which means that the Gittins index is an inconsistent estimator of the location of the op- timal arm. This is because the Gittins policy eventually chooses one arm on which to continue forever, and there is a positive probability that the chosen arm is sub—Optimal (Brezzi and Lai, 2000). 3.2 Heuristic strategies 3.2.1 Equal Allocation One naive method of playing a multi~armed ban— dit is to equally allocate observations to arms un— til the maximum optimality probability exceeds some threshold,/and then play the winning arm afterward. This strategy leads to stable esti~ mates of 6 for all the arms, but Section 4 demon- strates that it is grossly inefficient with respect to the overall reward. Of the methods consid~ ered here, equal allocation most closely corre— sponds to a non—sequential classical experiment ’ (the full—factorial design). 3.2.2 Play-the-Winner In play—the-winner, if arm a results in a suc— cess at time t, then it will be played again at time t+ 1. If a failure is observed, then the next arm is either chosen at random or the arms are cycled through deterministically. Play‘the~ winne...
View Full Document

{[ snackBarMessage ]}