LGApr 24, 2024
Long-term Off-Policy Evaluation and LearningYuta Saito, Himan Abdollahpouri, Jesse Anderton et al.
Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow. This work thus studies the problem of feasibly yet accurately estimating the long-term outcome of an algorithm using only historical and short-term experiment data. Existing approaches to this problem either need a restrictive assumption about the short-term outcomes called surrogacy or cannot effectively use short-term outcomes, which is inefficient. Therefore, we propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition. LOPE works under a more relaxed assumption than surrogacy and effectively leverages short-term rewards to substantially reduce the variance. Synthetic experiments show that LOPE outperforms existing approaches particularly when surrogacy is severely violated and the long-term reward is noisy. In addition, real-world experiments on large-scale A/B test data collected on a music streaming platform show that LOPE can estimate the long-term outcome of actual algorithms more accurately than existing feasible methods.
EMApr 24, 2020
A Comparison of Methods for Treatment Assignment with an Application to Playlist GenerationCarlos Fernández-Loría, Foster Provost, Jesse Anderton et al.
This study presents a systematic comparison of methods for individual treatment assignment, a general problem that arises in many applications and has received significant attention from economists, computer scientists, and social scientists. We group the various methods proposed in the literature into three general classes of algorithms (or metalearners): learning models to predict outcomes (the O-learner), learning models to predict causal effects (the E-learner), and learning models to predict optimal treatment assignments (the A-learner). We compare the metalearners in terms of (1) their level of generality and (2) the objective function they use to learn models from data; we then discuss the implications that these characteristics have for modeling and decision making. Notably, we demonstrate analytically and empirically that optimizing for the prediction of outcomes or causal effects is not the same as optimizing for treatment assignments, suggesting that in general the A-learner should lead to better treatment assignments than the other metalearners. We demonstrate the practical implications of our findings in the context of choosing, for each user, the best algorithm for playlist generation in order to optimize engagement. This is the first comparison of the three different metalearners on a real-world application at scale (based on more than half a billion individual treatment assignments). In addition to supporting our analytical findings, the results show how large A/B tests can provide substantial value for learning treatment assignment policies, rather than simply choosing the variant that performs best on average.
LGMay 19, 2018
Adaptively Pruning Features for Boosted Decision TreesMaryam Aziz, Jesse Anderton, Javed Aslam
Boosted decision trees enjoy popularity in a variety of applications; however, for large-scale datasets, the cost of training a decision tree in each round can be prohibitively expensive. Inspired by ideas from the multi-arm bandit literature, we develop a highly efficient algorithm for computing exact greedy-optimal decision trees, outperforming the state-of-the-art Quick Boost method. We further develop a framework for deriving lower bounds on the problem that applies to a wide family of conceivable algorithms for the task (including our algorithm and Quick Boost), and we demonstrate empirically on a wide variety of data sets that our algorithm is near-optimal within this family of algorithms. We also derive a lower bound applicable to any algorithm solving the task, and we demonstrate that our algorithm empirically achieves performance close to this best-achievable lower bound.
MLMar 13, 2018
Pure Exploration in Infinitely-Armed Bandit Models with Fixed-ConfidenceMaryam Aziz, Jesse Anderton, Emilie Kaufmann et al.
We consider the problem of near-optimal arm identification in the fixed confidence setting of the infinitely armed bandit problem when nothing is known about the arm reservoir distribution. We (1) introduce a PAC-like framework within which to derive and cast results; (2) derive a sample complexity lower bound for near-optimal arm identification; (3) propose an algorithm that identifies a nearly-optimal arm with high probability and derive an upper bound on its sample complexity which is within a log factor of our lower bound; and (4) discuss whether our log^2(1/delta) dependence is inescapable for "two-phase" (select arms first, identify the best later) algorithms in the infinite setting. This work permits the application of bandit models to a broader class of problems where fewer assumptions hold.
AIFeb 16, 2018
Measuring Human-perceived Similarity in Heterogeneous CollectionsJesse Anderton, Pavel Metrikov, Virgil Pavlu et al.
We present a technique for estimating the similarity between objects such as movies or foods whose proper representation depends on human perception. Our technique combines a modest number of human similarity assessments to infer a pairwise similarity function between the objects. This similarity function captures some human notion of similarity which may be difficult or impossible to automatically extract, such as which movie from a collection would be a better substitute when the desired one is unavailable. In contrast to prior techniques, our method does not assume that all similarity questions on the collection can be answered or that all users perceive similarity in the same way. When combined with a user model, we find how each assessor's tastes vary, affecting their perception of similarity.