LG MLDec 23, 2023

Short-lived High-volume Multi-A(rmed)/B(andits) Testing

Su Jia, Andrew Li, R. Ravi, Nishant Oli, Paul Duff, Ian Anderson

arXiv:2312.15356v12.0

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient decision-making in platforms with rapidly changing content, offering incremental improvements over existing methods.

The paper tackles the problem of multi-armed bandit testing with high-volume, short-lived arms, such as in content recommendation platforms, by proposing a Bayesian policy that achieves a loss bound of ω(n^{-ρ}) and demonstrates a 4.32% improvement in duration and 7.48% in click-throughs in a field experiment.

Modern platforms leverage randomized experiments to make informed decisions from a given set of items (``treatments''). As a particularly challenging scenario, these items may (i) arrive in high volume, with thousands of new items being released per hour, and (ii) have short lifetime, say, due to the item's transient nature or underlying non-stationarity that impels the platform to perceive the same item as distinct copies over time. Thus motivated, we study a Bayesian multiple-play bandit problem that encapsulates the key features of the multivariate testing (or ``multi-A/B testing'') problem with a high volume of short-lived arms. In each round, a set of $k$ arms arrive, each available for $w$ rounds. Without knowing the mean reward for each arm, the learner selects a multiset of $n$ arms and immediately observes their realized rewards. We aim to minimize the loss due to not knowing the mean rewards, averaged over instances generated from a given prior distribution. We show that when $k = O(n^ρ)$ for some constant $ρ>0$, our proposed policy has $\tilde O(n^{-\min \{ρ, \frac 12 (1+\frac 1w)^{-1}\}})$ loss on a sufficiently large class of prior distributions. We complement this result by showing that every policy suffers $Ω(n^{-\min \{ρ, \frac 12\}})$ loss on the same class of distributions. We further validate the effectiveness of our policy through a large-scale field experiment on {\em Glance}, a content-card-serving platform that faces exactly the above challenge. A simple variant of our policy outperforms the platform's current recommender by 4.32\% in total duration and 7.48\% in total number of click-throughs.

View on arXiv PDF

Similar