Modern Bayesian Look

Modern Bayesian Look - A Modern Bayesian Look at the...

Info iconThis preview shows pages 1–18. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 12
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 14
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 16
Background image of page 17

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 18
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: A Modern Bayesian Look at the Multi~Armed Bandit Steven L. Scott August 9, 2010 Abstract A multi—armed bandit is a particular type of experimental design where the goal is to ac- cumulate the largest possible reward. Rewards come from a payoff distribution with unknown parameters that are to be learned through sequential experimentation. This article describes a heuristic for managing multi—armed bandits called randomized probability matching, which ran- domly allocates observations to arms according the Bayesian posterior probability that each arm is optimal. Advances in Bayesian computation have made randomized probability matching easy to apply to virtually any payoff distribution. This flexibility frees the experimenter to work with pay‘ off distributions that correspond to certain classical experimental designs that have the potential to outperform methods which are “optimal” in simpler contexts. We summarize the relationships betWeen randomized probability matching and several related heuristics that have been used in the reinforcement learning literature. 1 Introduction A multi-drmed bandit is a sequential experiment with the goal of achieving the largest possible reward from a payoff distribution with unknown parameters. At each stage the experimenter must decide which arm of the experiment to ob- serve next. The choice involves a fundamental trade—off between the utility gain from exploit— ing arms that appear to be doing well (based on limited sample information) with the informa— tion gain from emploring arms that might poten— tially be optimal, but which appear to be inferior because of sampling variability. This article re— views several techniques that have been used to manage the multi'armed bandit problem. Par— ticular attention is paid to a technique known as randomized probability matching, which can be implemented quite simply in a modern Bayesian computing environment, and which can combine good ideas from both sequential and classical ex— perimental design. Multi-armed bandits have an important role to play in modern production environments that emphasize “continuous improvement,” where products remain in a perpetual state of fea— ture testing. Online software (such as a web site, an online advertisement, or a cloud service) is especially amenable to continuous improve- ment because experimental variation is easy to introduce, and because user responses to on— line stimuli are often quickly observed. Indeed, several frameworks for improving online services through experimentation have been develOped. Google’s website optimizer (Google, 2010) is one well known example. Designers provide website optimizer several versions of their website dif— fering in font, image choice, layout, and other design elements. Website Optimizer randomly diverts traffic to the different configurations in search of configurations that have a high proba— bility of producing successful outcomes, or con— versions, as defined by the website owner. One disincentive for website owners to engage in on~ line experiments is the fear that too much traffic will be diverted to inferior configurations in the name of experimental validity. Thus the explo~ ration/exploitation trade off arises because ex~ perimenters must weigh the potential gain of an increased conversion rate at the end of the exper- iment with the cost of a reduced conversion rate while it runs. Treating product improvement ex— periments as multi—arrned bandits can dramati— cally reduce the cost of experimentation. The name “multi—armed bandit” is an allusion to a “one~armed bandit,” a colloquial term for a slot machine. The straightforward analogy is to imagine different website configurations as a row of slot machines, each with its own probability of producing a reward (i.e. a conversion). The multi-armed bandit problem is notoriously resis— tant to analysis (Whittle, 1979), though optimal solutions are available in certain special cases (Gittins, 1979', Bellman, 1956). Even then, the optimal solutions are hard to compute, rely on artificial discount factors, and fail to generalize to realistic reward distributions. They can also exhibit incomplete learning, meaning that there is a positive probability of playing the wrong arm forever (Brezzi and Lai, 2000). Because of the drawbacks associated with op— timal solutions, analysts often turn to heuristics to manage the exploration/exploitation trade— off. Randomized probability matching is a par— ticularly appealing heuristic that plays each arm in proportion to its probability of being optimal. Randomized probability matching is easy to im— plement, broadly applicable, and combines sev~ eral attractive features of other popular heuris- tics. Randomized probability matching is an old idea (Thompson, 1933, 1935), but modern Bayesian computation has dramatically broad- ened the class of reward distributions to which it can be applied. The simplicity of randomized probability matching allows the multi—armed bandit to in- corporate powerful ideas from classical design. For example, both bandits and classical exper- iments must face the exploding number of pos— sible configurations as factors are added to the experiment. Classical experiments handle this problem using fractional factorial designs (see, e.g. Cox and Reid, 2000), which surrender the ability to fit certain complex interactions in or- der to reduce the number of experimental runs. These designs combat the curse of dimensionality by indirectly learning about an arm’s reward dis— tribution by examining rewards from other arms with similar characteristics. Bandits can use the fractional factorial idea by assuming that a model, such as a probit or logistic regression, de« termines the reward distributions of the different arms. Assuming a parametric model allows the bandit to focus on a lower dimensional param— eter space and thus potentially achieve greater rewards than “optimal” solutions that make no parametric assumptions. It is worth noting that there are also impor— tant differences between classical experiments and bandits. For example, the traditional op- timality criteria from classical experiments (D— optimality, Aooptimality, etc.) tend to produce balanced experiments where all treatment effects can be accurately estimated. In a multi-armed bandit it is actually undesirable to accurately es— timate treatment effects (i.e. the parameters the reward distribution) for inferior arms. Instead, the bandit aims to gather just enough, informa— tion about a sub—optimal arm to determine that it is sub~optimal, at which point further explo‘ ration becomes wasteful. A second difference is the importance placed on statistical signifi- cance. Classical experiments are designed to be analyzed using methods that tightly control the type~l error rate under a null hypothesis of no effect. But when the cost of switching between products is small (as with software testing), the type-I error rate is of little relevance to the ban- dit. A type—l error corresponds to switching to a different arm that provides no material. advan— tage over the current arm. By contrast, a typeeIl error means failing to switch to a superior arm, which could carry a substantial cost. Thus when switching costs are small almost all the costs lie in type~Il errors, which makes the usual notion of statistical significance largely irrelevant. Finally, classical experiments typically focus on designs for linear models because the information ma— trix in a linear model is a function of the design matrix. Designs for nonlinear models like probit or logistic regression are complicated by the fact the information matrix depends on unknown pa— rameters (Chaloner and Verdinelli, 1995). This complication presents no particular difficulty to the multi—armed bandit played under random- ized probability matching. The remainder of this paper is structured as follows. Section 2 describes the principle of ran— domized probability matching in greater detail. Section 3 reviews other approaches for multi~ armed bandits, including the Gittins index and several popular heuristics. Section 4 presents a simulation study that investigates the perfor— mance of randomized probability matching in the unstructured binomial bandit, where optimal solutions are available. Section 5 describes a sec- ond simulation study in which the reward distri- bution has low dimensional structure, where “op— timal” methods do poorly. There is an impor« tant symmetry between Sections 4 and 5. Sec— tion 4 illustrates the cost savings that sequen- tial learning can have over classic experiments. Section 5 illustrates the improvements that can be brought to sequential learning by incorporat— ing classical ideas like fractional factorial design. Section 6 concludes with observations about ex- tending multi~armed bandits to more elaborate settings. 2 Randomized Probability Matching Let yt : (3/1, . . . ,yt) denote the sequence of re wards observed up to time 15. Let at denote the arm of the bandit that was played at time t. We suppose that each yt was generated inde- pendently from the reward distribution fat(yf6), where 9 is an unknown parameter vector, and some components of 8 may be shared across the different arms. To make the notation concrete, consider two specific examples, both of which take gt 6 {0, 1}. Continuous rewards are also possible, of course, but we will focus on binary rewards because counts of clicks or conversions are the typical measure of success in e—commerce. The first example is the binomial bandit, in which 8 z (61, . . . , 6k), and fa(yt|9) is the Bernoulli distri— bution with success probability 9a. The binomial bandit is the canonical bandit problem appear~ ing most often in the literature. The second ex« ample is the fractional factorial bandit, where at corresponds to a set of levels for a group of ex- perimental factors (including potential interac' tions), coded as dummy variables in the vector xt. Let is: denote the number of possible con« figurations of xi, and let at E {l,...k} refer to a particular configuration according to some labeling scheme. The probability of success is fa(yt = 1|6) : 9(6Txt), where g is a binomial link function, such as probit or logistic. We refer to the case where g is the CDF of the standard normal distribution as the probit bandit. Let MAE) : E(yt|0,at z a) denote the ex» pected reward from fa(y{9). If 6 were known then the optimal long run strategy would be to always choose the arm with the largest M109). Let 39(6) denote a prior distribution on (9, from which we may compute (1) The computation in equation (1) can be ex- pressed as an integral of an indicator function. Let Ia(0) 2 1 if (raw) 2: max{n1(6), . . . ,nk(l9)}, and Ia(6) : 0 otherwise. Then was : Proud = maxwi; ‘ . - will we :- E<Ia<e>> = / new) are. <2) If a priori little is known about 9 then the im- plied distribution on M will be exchangeable, and thus wag will be uniform. As rewards from the bandit are observed, the parameters of the re— ward distribution are learned through Bayesian updating. At time t the posterior distribution of Bis 10(9lYt) o< 29(9) H fa.(yTl9), (3) 7:1 from which one may compute mat : PT‘(/,La 3: maiX{/J117 ‘ ' ' = E(Ia(9)1yt), as in equation Randomized probability matching allocates observation t + l to arm a with probability wat. Randomized probability matching is not known to optimize any specific utility function, but it is easy to apply in general settings, it balances ex— ploration and exploitation in a natural way, and it tends to allocate observations efficiently from both inferential and economic perspectives. It is compatible with batch updates of the posterior distribution, and the methods used to compute the allocation probabilities make it easy to corn- pute the expected amount of lost reward relative to playing the optimal arm of the bandit from the beginning. Finally, randomized probability matching is free of arbitrary tuning parameters that must be set by the analyst. (4) 2.1 Computing Allocation Probabili— ties For some families of reward distributions it is possible to compute wat either analytically or by quadrature. In any case it is easy to compute mm by simulation. Let 8(1), . . . , 6(a) be a sample of independent draws from p(91yt). Then by the law of large numbers, (5) w :2 at Gaoo G 1 0 lim —- 2146(9)). 9:1 Equation (5) simply says to estimate mm by the empirical proportion of Monte Carlo sam— ples in which pawigl) is maximal. If fa is in the exponential family and 39(6) is a conjugate prior distribution then independent draws of 6 are possible. Otherwise we may draw a sequence 9(1), 9(2), . . . from an ergodic Markov chain with p(61yt) as its stationary distribution (Tierney, 1994). In the latter case equation (5) remains unchanged, but it is justified by the ergodic the- orem rather than the law of large numbers. Posterior draws of 6 are all that is needed to apply randomized probability matching. Such draws are available for a very wide class of mod— els through Markov chain Monte Carlo and other sampling algorithms, which means randomized probability matching can be applied with almost any family of reward distributions. 2.2 Implicit Allocation The optimality probabilities do not need to be computed explicitly. It will usually be faster to simulate a N wat by simulating a single draw of (9“) from p(6|yt), then choosing a : arg max,L haw”). 2.3 Balancing Exploration and Ex— ploitation Randomized probability matching naturally in~ corporates uncertainty about 6 because rum; is defined as an integral over the entire posterior distribution p(6lyt). To illustrate, consider the binomial bandit with It :: 2 under independent beta priors. Figure 1(a) plots p(81,62|y) assum- ing we have observed 20 successes and 30 failures from the first arm, along with 2 success and 1 failure from the second arm. In panel (b) the second arm is replaced with an arm which has generated 20 successes and 10 failures. Thus it has the same empirical success rate as the sec— ond arm in panel (a), but with a larger sample size. In both plots the optimality probability for the first arm is the probability a dot lands below the 45—degree line (which equation (5) es— timates by counting simulated dots). ln panel (a) the probability that the first arm is optimal is around 18%, deSpite having a lower empiri- cal success rate than the second arm. In panel (b) the larger sample size causes the the poste— rior distribution to tighten, which lowers the first L0 _L 08 J 0.6 0.2 0.0 .L yS Figure 1: .7000 draws from the joint distribution of two independent beta distributions. In both cases the horizontal axis represents a beta (20,30) distribution The vertical axis is (a) beta(2,]), (b) beta(20,]0). arm’s optimality probability to 0.8%. This example demonstrates that the need to experiment decreases as more is learned about the parameters of the reward distribution. If the two largest values of paw) are distinct then maxa wat eventually converges to 1. If the 161 S k: largest values of paw) are identical then maxa wat need not converge, but wat may drift on a subset of the probability simplex that di- vides 100% of probability among the Is; Optimal alternatives. This is obviously just as good as convergence from the perspective of total reward accumulation. Notice the alignment between the inferential goal of finding the optimal arm, and the eco- nomic goal of accumulating reward. From both perspectives it is desirable for superior arms to rapidly accumulate observations. Downweight— ing inferior arms leads to larger economic rea wards, while larger sample sizes for superior arms means that the optimal arm can be more quickly distinguished from its close competitors. 3 Other Solutions This Section reviews a collection of strategies that have been used with multi—armed bandit problems. We discuss pure exploration strate— gies, purely greedy strategies, hybrid strategies, and Gittins indices. Of these, only Gittins in‘ dices carry any guarantee of optimality, and then only under a very particular scheme of discount— ing future rewards. 3.1 The Gittins Index Gittins (1979) provided a method of comput- ing the optimal strategy in certain bandit prob- lems. Gittins assumed a geometrically dis< counted stream of future rewards with present value PV 2 23:0 fly, for some 0 S y < 1. Git— tins provided an algorithm for computing the ex— pected discounted present value of playing arm (1, assuming optimal play in the future, a quantity that has since become known as the “Gittins in— dex.” Thus, by definition, playing the arm with the largest Gittins index maximizes the expected present value of discounted future rewards. The Gittins index has the further remarkable prop- erty that it can be computed separately for each arm, in ignorance of the other arms. A policy with this property is known as an indem policy. Logical and computational difficulties have prevented the widespread adoption of Gittins in— dices. Powell (2007) notes that “Unfortunately, at the time of this writing, there do not exist easy to use software utilities for computing standard Gittins indices.” Sutton and Barto (1998) add “Unfortunately, neither the theory nor the com- putational tractability of [Gittins indicesl appear to generalize to the full reinforcement learning problem .. 3” Although it is hard to compute Gittins in‘ dices exactly, Brezzi and Lai 2002) have devel- Oped an approximate Gittins index based on a normal approximation to p(6ly). Figure 2 plots the approximation for the binomial bandit with two different values of 7. Both sets of indices converge to a/(a + b) as a and b grow large, but the rate of convergence slows as y ~+ 1. The Brezzi and Lai approximation to the Git— tins index is as follows. Let 8,, :- E(61yn,an), van : Varmalyn), 03(9) 2 Var(yt|6,a,t 2 a), and c :: - log 7. Then ~ 1 2 2’cm 1/ z 6 + v / 6 a m n) a w (602%) < > where 3/2 ifs g 0.2, 0.49 — (111er2 if 0.1 < s g 1, 212(5) 2 0.63 — 0265*”2 if 1 < s g 5, 0.77~0.588_1/z if5 < .3 315, (2logs w log logs — log 16%)“2 if s > 15. (7) Computing aside, there are three logical is— sues that challenge the Gittins index (and the broader class of index polices). The first is that it requires the arms to have distinct parame~ ters. Thus Gittins indices cannot be applied to the problems involving covariates or struc~ tured experimental factors. The second prob‘ lem is the need to choose 7. Geometric dis~ counting only makes sense if arms are played at equally spaced time intervals. Otherwise a higher discount factor should be used for periods of higher traffic, but if the discounting scheme is anything other than geometric then Gittins in— dices are no longer Optimal (Gittins and Wang, 1992; Berry and Fristedt, 1985). A final issue is known as incomplete learn- ing, which means that the Gittins index is an inconsistent estimator of the location of the op- timal arm. This is because the Gittins policy eventually chooses one arm on which to continue forever, and there is a positive probability that the chosen arm is sub—Optimal (Brezzi and Lai, 2000). 3.2 Heuristic strategies 3.2.1 Equal Allocation One naive method of playing a multi~armed ban— dit is to equally allocate observations to arms un— til the maximum optimality probability exceeds some threshold,/and then play the winning arm afterward. This strategy leads to stable esti~ mates of 6 for all the arms, but Section 4 demon- strates that it is grossly inefficient with respect to the overall reward. Of the methods consid~ ered here, equal allocation most closely corre— sponds to a non—sequential classical experiment ’ (the full—factorial design). 3.2.2 Play-the-Winner In play—the-winner, if arm a results in a suc— cess at time t, then it will be played again at time t+ 1. If a failure is observed, then the next arm is either chosen at random or the arms are cycled through deterministically. Play‘the~ winner can be nearly optimal when the best arm has a very high success rate. Berry and Fristedt (1985) show that if arm a is optimal at time t, and it results in a success, then it is also op- timal to play arm a at time t+ 1. However, play—the-winner over-explores when the success rate of the optimal arm is low. Play—the—winner tends toward equal allocation as the success rate of the optimal arm tends to zero. (b) Figure 2: Brezzi and Lai’s approximation to the Gittins index for the binomial bandit problem with (a) 7 z .8, and (b) 7 z .999. 3.2.3 Deterministic Greedy Strategies An algorithm that focuses purely on exploita— tion is said to be greedy. A textbook exam— ple of a greedy algorithm is to always choose the arm with the highest sample mean reward. This approach has been shown to do a poor job of producing large overall rewards because it fails to adequately explore the other arms (Sutton and Barto, 1998). A somewhat bet- ter greedy algorithm is deterministic probability matching, which always chooses the arm with the highest optimality probability wat. One practicality to keep in mind with greedy strategies is that they can do a poor job when batch updating is employed. For logistical rea— sons the bandit may have to be played multi~ ple times before it can learn from recent activ- ity. This can occur if data arrive in batches in- stead of in real time (e.g. computer logs might be scraped once per hour, or once per day). Greedy algorithms can suffer in batch updating because they play the same arm for an entire update cy— cle. This can add substantial variance to the total reward relative to randomized probability matching. Greedy strategies perform especially poorly in the early phases of batch updating be- cause they only learn about one arm per update cycle. 3.2.4 Hybrid Strategies Hybrid strategies are greedy strategies that have been modified to force some amount of explo— ration. One example is an 6—greedy strategy. Here each allocation takes place according to a greedy algorithm with probability 1 — 6, oth- erwise a random allocation takes place. The equal allocation strategy is an e—greedy algo- rithm with e 2: 1. One can criticize an e—greedy strategy on the grounds that it has poor asymp- totic behavior, because it continues to explore long after the Optimal solution becomes appar— ent. This leads to the notion of an e—decreasing strategy, which is an 6~greedy strategy where 6 decreases over time. Both e—«greedy and emdecreasing strategies are wasteful in the sense that they use simple random sampling as the basis for exploration. A more fruitful approach would be to use stratified sampling that under—samples arms that are likely to be subeoptimal. Softmax learning (Luce, 1959) is one such an example. Softmax learning is a randomized strategy that allocates observa— tion 15 + 1 to arm 0. with probability 1, : expert/r) 8 at gleam/TY H where T is a tuning parameter to be chosen experimentally. Softmax learning with fixed 7' shares the same asymptotic inefficiency as e~greedy strategies, which can be eliminated by gradually decreasing 7' to zero. Randomized probability matching combines aspects of all the preceding strategies. It is e~greedy in the sense that it employs deter~ ministic probability matching with probability maxa mat, and a random (though nonuniform) exploration with probability 5 2 l — maxa mat. It is e~decreasing in the sense that in non— degenerate cases maxa wag ——> 1. However it should be noted that maxa wat can sometimes decrease in the short run if the data warrant. The stratified exploration provided by softmax learning matches that used by randomized prob— ability matching to the extent that wat is de— termined by pm through a multinomial logistic regression with coefficient 1/7“. The benefit of randomized probability matching is that the tun~ ing parameters and their decay schedules evolve in principled, data determined ways rather than being arbitrarily set by the analyst. 3.3 The Importance of Randomiza— tion Randomization is an important component of a bandit allocation strategy because, like the Git— tins index described in Section 3.1, greedy strate- gies can fail to consistently estimate which arm is optimal. Yang and Zhu (2002) proved that an e—decreasing randomization can restore consis— tency to a greedy policy. We conjecture that ran- domized probability matching does consistently estimate the optimal arm. If a* denotes the in— dex of the optimal arm, then a sufficient condi— tion for the conjecture to be true is for wag > 6 for some 6 > 0, where e is independent of t. 4 The Binomial Bandit This Section describes a collection of simula— tion studies comparing randomized probability matching to various other learning algorithms in the context of the binomial bandit, Where there are ample competitors. The binomial bane dit assumes that the rewards in each configu~ ration are independent Bernoulli random vari— ables with success probabilities 91, . . . ,Qk). For simplicity we assume the uniform prior distri~ bution (93 N U (0, 1), independently across a. Let Yat and Nat denote the cumulative num- ber of successes and trials observed for arm a up to time t. Then the posterior distribution of 9 Z (91,...,0k) iS k pwlyt) = H Beware + 1.2a — at + 1), <9) a:1 where Brawler, /3) denotes the density of the beta distribution for random variable 6 with parame« ters Oz and d. The optimality probability 1 ma: / Bereairat+1.Nat~i/a+n 0 H P7“(l9j < 6am. + 1, Nfi _ rfi + 1) dea j¢¢1 (10) can easily be computed either by quadrature or simulation (see Figures 3 and 4). Our simulation studies focus on regret, the cu~ mulative expected lost reward, relative to play- ing the optimal arm from the beginning of the experiment. Let M09) : maxa{ua(6)}, the ex- pected reward under the truly optimal arm, and let mat denote the number of observations that compute.probopt <~ function(y, n){ k <v~ lengthCy) ans <~ numeriCCk) for(i in 1:k){ indx <- (1:k)[‘i] f <— function(x){ r <~ dbeta(x, yfi]+1, nii]~y[i]+1) foer in indx) r <- r * pbeta(x, y[j]+1, n[j]~y[j]+1) return(r) } ans[i] = integrate(f,0,1)$value } return<ans) } Figure 3: R code for computing equation {10) by quadnnuna sim.post <- function(y, n, ndraws){ k <— lengthfly) ans <— matrix<nrow=ndraws, ncol=k) no <— n~y forCi in 11k) ans[,i]<-rbeta(ndraws,y[i]+1,no[i]+1) return(ans)} prob.winner <- function(post){ k <— ncol(y) w <~ table(factor(max.col(post), 1evels=1:k)) return(w/sum(W))} compute.win.prob <~ functionCy, n, ndraws){ return(prob.winner(sim.post(y, n, ndraws)))} Figure 4: R code for computing equation {10} by sinialotiort were allocated to arm a at time t. Then the ex- pected regret at time t is Lt :Znat(fl*(6) —#a(6))a (11) and the cumulative regret at time T is L :2 23121 Lt. The units of Lt are the units of reward, so with 0/ l rewards Lt is the expected number of lost successes relative to the unknown optimal strategy. The simulation study consisted of 100 experi— ments, each with k 2 10 “true” values of :9 inde~ pendently generated from the 21(0, T16) distribu- tion. The first such value was assumed to be the current configuration and was assigned 106 prior observations, so that it was effectively a known “champion” with the remaining 9 arms being new “challengers.” We consider both batch and real time updates. 4.1 Batch Updating In this study, each update contains a Pois— son(1000) number of observations allocated to the different arms based on one of several allo— cation schemes. We ran each experiment until either the maximal wat exceeded a threshold of .957 or else 100 time periods had elapsed. 4.1.1 RPM vs. Equal Allocation Our first example highlights the substantial gains that can be had from sequential learn— ing relative to a classical non—sequential exper— iment. Figure 5 compares the cumulative re— gret for the randomized probability matching and equal allocation strategies across 100 sim— ulation experiments. The regret from the equal allocation strategy is more than an order of mag« nitude greater than under probability matching. Figure 6 shows why it is so large. Each box— plot in Figure 6 represents the distribution of L; for experiments that were still active at time t. The regret distribution at time 1 is the same in both panels, but the expected regret under randomized probability matching quickly falls to zero as sub-Optimal arms are identified and ex— cluded from further study. The equal alloca“ tion strategy continues allocating observations to sub—optimal arms for the duration of the experi- ment. This produces more accurate estimates of 6 for the sub-optimal arms but at great expense. The inferential consequences of the multi- armed bandit can be seen in Figure 7, which shows how the posterior distribution of 6a evolves over time for a single experiment. No- tice how the 95% credibility interval for the op— timal arm (in the upper left plot of each panel) is smaller in panel (a) than in panel (b), despite the fact that the experiment in panel (b) ran longer. Likewise, notice that the 95% credible in— tervals for the sub-optimal arms are much wider H’)... 9—1 >‘ u c a.) 3 :7 93 LL m- o- Expected Number of Lost Conversions 50 100 l: [ElTflllM I H 150 200 (a) ll 1 LIL“ 250 Frequency 30 25 20 15 10 Expected Number of Lost Conversions W7 3000 4000 (b) I 5000 K 6000 Figure 5: Cumulative regret in each of 100 simulation experiments under (a) randomized probability match- ing) and (b) equal allocation. Note that the scales difi‘er by an order of magnitude. For comparison purposes, both panels include a rug—plot showing the regret distribution under probability matching. 4O 50 1 L055 10 L v s x l u K « x F 20 Test Period (81) Loss 60 50 4O 30 20 10 n l. . V an: « n ' m. y .l c n. x . . . . l .. n 3: “Que ~;« m“, Mu. . x' .' v mm -. v‘. ., , x u'vn u, v: . l l .I‘ “Mn. mm, “Hummu. s";.HI‘|n‘ .x V . “mun”, . “mum-u. “ \lullu‘ul u, u. l, . . l u “mum-n n. u I H‘ v u.” v .V v v H- .0 ' l y; I'M" | I"I v‘ ' ll ,H.“ u "‘3 IV: h 3m, “1...” :l c :H "H"’” "‘x' ‘H ‘n « m u -' x v m . v v . .v .,u ’ H. “mg _ [rm m.” H . ,: n H... “mm.” l... “Wm,” ..Hli§”|1".‘x‘#v‘ll‘hl‘fl ,'.‘...u|.nu.,~.,ul ,. ' n:".»‘~,'v.x‘n'u u l. A h It ,. w H i .u .. u K “II t' v m...qu . . l n. .. .u Wavy” H I: m .. u." 'n . .u. um .-;.;.n.;;;w,;:‘:‘..umm A “ ‘ A, 4‘ {Au . 1 fit v e H a. ‘ w l. , ., l i” n 4m m mun” 'mm'n m'u .,-.A , l i .9 . . nun“. m, u. U‘ \ “Hum. HAM I. L Y Y " 'I . m . . .. . ;.. . ‘ ... _.. m. .... n. mun-mm u. “H'u'unn u 1n.4ln.nuu ~ m'. , . """n|‘4uu mu.” m.“ m‘mm. A u 0 20 I 60 Test Period 03) 80 Figure 6: Expected regret per time period under (a) randomized probability matching, and { b ) equal allocation. Bomplots show variation across 100 simulated experiments. 10 0 Ln 6 D “'7 O m .— -— N m - ’ N ‘ l I 1 A O 045-: mo 0.05 0.00- ~ov15 rms 3 -o.10 3 ~01!) g , . . . . _ , V I ‘ . > . . , , , , _ . l . V . . V , . A v V l . . i , . . V, .93 ' ‘ ' ‘ ' ‘ ‘ ' ‘ ' _ . . y . V . V V V . V . . V . . , V r V . A r V . i ' V I ‘ l I l b ‘ ' ‘ V i l v V l l ‘ y V r V L N k R .005 Rx ................. ,, 1 x I h ‘ eom N v a m e g olIs~ 0-15" 005— om..‘—'*'—..,,,.,m..‘*“_ (a) (b) Figure 7: Evolution of the posterior distribution (means and upper and lower 95% credibility bounds) of6 under (a) randomized probability matching, (b) equal allocation. Each panel corresponds to an arm. Arms are sorted according to optimality probability when the experiment ends. Loss Test Period Figure 8: Regret from deterministic probability matching. The first ten periods were spent learning the parameters of each arm. 11 Expected Less 600 800 i 000 i 200 i l 1 1 400 l 00 Method Mean SD RPM 50.1 62.2 DPM 54.7 122.3 Largest Sample Mean 81.4 225.7 (b) Figure 9: (a) Stacked stripcha’rt showing cumulative regret after excluding the first 10 test periods. Panel (:5) shows means and standard deviations of the expected losses plotted in panel Greedy methods have a higher chance of zero regret. Randomized probability matching has a lower variance. under probability matching than under equal a1— location. 4.1.2 RPM vs. Greedy Algorithms The experiment described above was also run un~ der the purely greedy strategies of deterministic probability matching and playing the arm with the largest mean. Neither purely greedy strategy is suitable for the batch updating used in our experiment. .Because a 21(0, 1) prior was used for each 9a, and because the “true” values were simulated from 21(0, 1—10), an arm with no obser~ vations will have a higher mean and higher opti— mality probability than an armwith observed re- wards. Thus a purely greedy strategy will spend the first In batches cycling through the arms3 as- signing all the observations to each one in turn (see Figure 8). This form of exploration has the same expected regret as equal allocation, but with a larger variance, because all bets are placed on a single arm. Figure 9 compares the cumulative expected regret for the two greedy strategies to random probability matching after excluding the first 10 time steps, after the greedy algorithms have ex— plored all 10 arms. The reward under random- ized probability matching has a much lower stan- dard deviation than under either of the greedy strategies. It also has the lowest sample mean, though the difference between its mean and that of deterministic probability matching is not sta— tistically significant. Notice that the frequency of exact zero regret is lowest for randomized probability matching, but its positive losses are less than those suffered by the greedy methods. 4.2 Real Time Updating A third simulation study pitted randomized probability matching against the Gittins index, in the setting where Gittins is optimal. This time the experiment was run for 10,000 time steps, each with a single play of the bandit. Again the simulation uses It 2 10 and independently draws the true success probabilities from Um, Fig- 12 ure 10 compares randomized and deterministic probability matching with Gittins index strate— gies where y r: .999 and y x .8, Of the four methods, randomized probability matching did the worst job of accumulating total reward, but it had the smallest Standard deviation, and it se- lected the optimal arm at the end of the exper~ iment the largest number of times. The Gittins index with y 2 .999 gathered the largest reward, but with a larger standard deviation and lower probability of selecting the optimal arm. Deter- ministic probability matching did slightly worse than the better Gittins policy on all three met— rics. Finally, the Gittins index with y :2 .8 shows a much thicker tail than the other methods, illus~ trating the fact that you can lower your overall total reward by too heavily discounting the fu— ture. 5 Fractional Factorial Bandit The binomial bandit described in Section 4 fails to take advantage of potential structure in the arms. For example, suppose the arms are web— sites differing in font family, font size, image lo» cation, and background color. If each character- istic has 5‘levels then there are 54 2 625 possible configurations to test. One could analyze this problem using an unstructured binomial bandit with 625 arms, but the size of the parameter es- timation problem can be dramatically reduced by assuming additive structure. In the preced— ing example, suppose each 5-level factor is rep— resented by four indicator variables in a probit or logistic regression. If we include an intercept term and assume a strictly additive structure then there are only 1 + (5 —~ 1) X 4 :: 17 parame~ ters that need estimating. Interaction terms can be included if the additivity assumption is too restrictive. Choosing a set of interactions to al~ low into the bandit is analogous to choosing a particular fractional factorial design in a classi- cal problem. Let X: denote the vector of indicator vari- ables describing the the characteristics of the arm played at time t. For the purposes of this Section we will assume that the probability of a reward depends on xt through a probit regres— sion model, but any other model can be substi— tuted as long as posterior draws of the model’s parameters can be easily obtained. Let @(z) de- note the standard normal cumulative distribu— tion function. The probit regression model as- sumes Pr(yt :: ) :2 @(QTXt). The probit regression model has no conjugate prior distribution, but a well known data aug- mentation algorithm (Albert and Chib, 1993) can be used to produce serially correlated draws from 19(61y). The algorithm is described in Sec- tion 7.1 of the Appendix. Each iteration of Al— bert and Chib’s algorithm requires a latent vari— able to be imputed for each yt. This can cause the posterior sampling algorithm to slow as more observations are observed. However, if x is small enough to permit all of its possible configurations to be enumerated then Albert and Chib’s algo- rithm can be optimized to run much faster as t -> 00. Section 7.2 of the Appendix explains the modified algorithm. Other modifications based on large sample theory are possible. We conducted a simulation study to compare the effectiveness of the fractional factorial and binomial bandits. 1n the simulation, data were drawn from a probit regression model based on 4 discrete factors with 2, 3, 4, and 5 levels. This model has 120 possible configurations, but only 11 parameters, including an intercept term. The intercept term was drawn from a normal dis- tribution with mean @‘1(.05) and variance 0.1. The other coefficients were simulated indepen- dently from the N(0,0.1) distribution. These levels were chosen to produce arms with a mean success probability of around .05, and a standard deviation of about .5 on the probit scale. Fig- ure 11 shows the 120 simulated success probabil— ities for one of the simulated experiments. We replicated this simulation 100 times, to produce 100 bandit processes on which to experiment. Let {xa : a : 1, . . .,120}, denote the possible 13 —~ Gimnsses M GininsB _ r ,,,,, ,V DPM 3‘ w RPM 0020 0,015 1 German; 0.010 L 0005 l 0.000 —1 DO 0 1 00 200 Expected Loss (80 Method Mean SD %¢Correct Gittins (.999) 49.0 47.0 63 Gittins (s) 84.0 94.2 48 DPM 51.9 58.4 58 RPM 87.3 21.7 76 (b) Figure 10: (a) Expected regret under real time sampling across 100 experiments, each simulated for 10, 000 time steps. (6) Mean and standard deviation of the expected losses plotted in panel (a), along with the percentage of experiments for which the optimal arm was selected at time 10, 000. 1 5 20 25 30 i l l .J Frequency 10 l 7 r—i i“) II 000 0.05 0.10 OAS 0.20 Scucess probability Figure 11: True success probabilities on each arm of the fractional factorial bandit. configurations of X. In each update period, 100 observations were randomly allocated to the dif— ferent possible configurations according to wat. At each time step we produced 1000 draws from p(0lyt) using the algorithm in Section 7.2, as- suming independent N (0, 1) priors for all coeffi- cients. These draws were used to compute mat as described in Section 2.1, with #140) :2 @(QTXQ). Note that (D is a monotonic function, so the arm with the largest aa(0) is the same as the arm with the largest 0Txa. This allows us to speed up the computation by skipping the application of (I). For every experiment we ran with the frac- tional factorial bandit, we ran a parallel exper— iment under the binomial bandit with the same “true” success probabilities. Figure 12 compares the regret distributions across the 100 experi- ments for the fractional factorial and binomial bandits. The mean regret for the binomial ban— dit is 745 conversions out of 10,000 trials. The mean regret for the fractional factorial bandit 14 m Probit - - ~ Binomial 0.006 L WY 0.002 , I k I i . . 1 l I I 0 500 i 000 l 500 0.000 1 Expecfed loss after 10,000 impressions Figure 12: Cumulative regret after 10K trials for the fractional factorial (solid) and binomial (dashed) bandits. is 166 conversions out of 10,000 trials, a fac» tor of 4.5 improvement over the binomial ban- dit. Figure 13 compares the period~by~period regret distributions for the fractional factorial and binomial bandits. The regret falls to zero much faster under the fractional factorial bandit scheme. Figure 14 compares wet for the frac~ tional factorial and binomial bandits for one of the 100 experiments. In this particular experi— ment the top four success probabilities are within 2% of one another. The fractional factorial ban~ dit is able to identify the optimal arm within a few hundred observations. The binomial bandit remains confused after 10,000 observations. 6 Conclusion This paper has shown how randomized prob— ability matching can be used to manage the multi—armed bandit problem. The method is easy to apply, assuming one can generate poste— rior draws from 39091355) by Markov chain Monte Carlo or other methods. Randomized probabil~ ity matching performs reasonably well compared to optimal methods, when they are available, and it is simple enough to generalize to situa» tions that optimal methods cannot handle. It is especially well suited to situations where learn— ing occurs in batch updates, and it is robust in the sense that it had the lowest standard deviaa tion of any method in all of the simulation stud ies we tried. Finally, it combines features from several popular heuristics without the burden of specifying artificial tuning parameters. We have illustrated the advantages of com— bining sequential and classical experiments by focusing on fractional factorial designs. Other important design ideas are similarly easy to in- corporate. For example, randomized blocks for controlling non—experimental variation (ie vari— ables which cannot be set by the experimenter) can be included in X, either as first order factors or as interactions. Thus it is straightforward to control for temporal effects (eg. day of week), or demographic characteristics of the customer using the product. These techniques can be brought to several generalizations of multi—armed bandits by sim— ply modifying the model used for the reward dis— tribution. For instance, our examples focus on 0/1 rewards, but continuous rewards can be han- dled by substituting regression models for logit or probit regressions. Restless bandits (Whittle, 1988) that assume 6} varies slowly over time can be handled by replacing fa(y16) with a dynamic linear model (West and Harrison, 1997). Arm acquiring bandits (Whittle, 1981) are handled gracefully by randomized probability matching by simply extending the design matrix used in the probit regression. Finally, one can imag- ine a network of experiments sharing information through a hierarchical model, practically begging for the name “multi~armed mafia.” 15 40 __.|__ Loss 30 I 10 | numlmlmm Test Period (8») Loss m. v' .u‘ m. w'v ‘. mun”. ' ....'.|"'n . -.. , ,‘1'.‘.< u~ - ,. .m Hyman-I « . . A- mun-H. 1.. n. .. u .. . I. I m )AuY 4IIAIHIHHHHIHHIHIH" _ mummummmummm . __ mmmm:-numumm.u..m..mm . u . xtuuulutuIN.Innxluhuuulnu .umummlmmmmmm.m.u. .. . . vv : . ., up V ; ‘ uh." u”... ununuvuluuxu u r- .- . "Hun-nun“nun.Hnuynunuulln-I u ‘ .u. HunAIuA“IAIAuuAuluIAuInunnuuuhnunuIIInu1A1AA““In:Inuulnuuunnuuux I" 1 l 0 20 4O T951 Period (‘0) I 60 Ta 80 Figure 13: Regret distributions for (a) the fractional factorial bandit and (b) the binomial bandit, under randomized probability matching, when the underlying process has probit structure, Probability of being the oplimal arm Time (a) Probability of being (he opfimal arm 0.8 .1.— Time (b) Figure 14: Evolution of mm for one of the 100 fractional factorial bandit experiments in Section 5. (a) fractional factorial bandit (b) binomial bandit. 16 7 Appendix: Details of the Probit MCMC Algorithms 7.1 Posterior Sampling fer Probit Re- gression: Standard Case This Section describes algorithm introduced by Albert and Chib (1993) to simulate draws from p(9}y) in a probit regression model. Assume the prior distribution (9 N N (b, 2), meaning the nor- mal distribution with mean vector ,a and vari- ance matrix 2. Let N? (a, a2) denote the normal distribution with mean n and variance 02, trun~ cated to only have support on the half line 2 > a. The complementary distribution N; (,a, 02) is truncated to have support on 2 < a. For the purpose of this Section, let yt be coded as 1/ ~ 1 instead of 1/0. The corresponding predictor vari— ables are Xi, and X is a matrix with xt in row t. 1. For each t, simulate z; N Ng‘wTXt, 1). 2. Let z 2 (21,...,z.n). Sample (9 N N fl), where n“1 : 2‘1+XTX and (5 : n(XTz+ 234(2). Repeatedly cycling through steps 1 and 2 pro duces a sequence of draws (6,2)”), (6, z)(2), ... from a Markov chain with p(6,zly) as its sta— tionary distribution. Simply ignoring 2 yields the desired marginal distribution p(01y). 7.2 Probit Posterior Sampling for De— numerable Experimental Designs When xt contains only indicator variables it is possible to list out all possible configurations of X. Suppose there are k of them, stacked to form the k X p matrix Now let 37 and n be k~vectors with elements ya and na denot- ing the number of conversions and trials from configuration a. Then XTX :: XTdiag(fi)X, and we can approximate XTZ : thtxt :2 2a xa Etmtzxa 2t : Ea xaza using the central limit theorem. Let 2., :2 z: + z; , where 2,“: is the sum of ya draws from NJWTXQ, 1), and z; is the sum of na ~ ya draws from NO"(9Txa, 1). Let /\(d) : mama/<1 * so». If y. is large the z: is approximately normal with mean ya {nu M /\(“#a)l‘ and variance ya{1 * MMMaX/VWQ) ~ na)}. Likewise, let 5(a) 2: gb(oz)/€D(oz). If na ~ ya, is large then z; is normal with meai (nawya)(na~6(-na)) and variance (na—ya){1 Ma6('llal “ (5(“t‘all2l For configurations with large (> 50) values of ya or na «ya compute z; or 2; using its asymp— totic distribution. If ya < 50 we compute a: by directly summing ya draws from N01 (pa, 1). Likewise if na — ya < 50 we compute z; by di— rectly summing na ~ ya draws from N’O“ (na, 1). References Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomons response data. Jour- nal of the American Statistical Association 88, 669~679 Bellman, R. E. (1956). A problem in the sequential design of experiments. Sanhhya Series A 30, 221~ 252. Berry, D. A. and Fristedt, B. (1985). Bandit Prob— lems: Sequential Allocation of Experiments. Chap- man and Hall. Brezzi, M. and Lai, T. L. (2000). Incomplete leran- ing from endogenous data in dynamic allocation. Econometrica 68, 15114516. Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27, 87 i 108. Chaloner, K. and Verdinelli, I. (1995). Bayesian ex— perimental design: A review. Statistical Science 10, 273*304. Cox, D. R. and Reid, N. (2000). The Theory of the Design of Experiments. Chapman and Hall, CRC. Gittins, J. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. The An- nals of Statistics 20, 1625~1636. Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B: Methodological 41, 148477. 17 Google (2010). www . google . com/websiteopt imizer. Luce, D. (1959). Individual Choice Behavior. Wiley. Powell, W. B. (2007). Approximate Dynamic Pro- gramming: Solving the Curses of Dimensionality. John Wiley & Sons, Inc. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: an introduction. MIT Press. Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in View of the evidence of two samples. Biometrika 25., 285»294. Thompson, W. R. (1935). On the theory of appor- tionment. American Journal of Mathematics 57 . 450»456. Tierney, L. (1994), Markov chains for exploring pos- terior distributions (disc: P1728—1762). The An— nals of Statistics 22, 1701»1728. \Vest. M. and Harrison, .1. (1997). Bayesian Forecast» ing and Dynamic Models. Springer. W’hittle, P. (1979). Discussion of “bandit processes and dynamic allocation indices”. Journal of the Royal Statistical Society. Series B: [Methodological 41, 165. Whittle, P. (1981). Arm~acquiring bandits. The An— nals of Probability 9, 284—292. W’hittle, P. (1988). Restless bandits: Activity allo- cation in a changing world. Journal of Applied Probability 25A, 287-298. Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi—armed bandit problem with covariates. Statistics 30, 100421. The Annals of 18 ...
View Full Document

Page1 / 18

Modern Bayesian Look - A Modern Bayesian Look at the...

This preview shows document pages 1 - 18. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online