LGIRMLApr 21, 2023

A Common Misassumption in Online Experiments with Machine Learning Models

arXiv:2304.10900v18 citationsh-index: 14
Originality Incremental advance
AI Analysis

This highlights a critical flaw in evaluation practices for practitioners and researchers relying on online experiments to compare ML models, though it is incremental in addressing statistical assumptions.

The paper identifies that A/B-tests for machine learning models often rely on unmet assumptions, such as lack of model interference due to pooled data learning, which leads to biased causal effect estimates and undermines decision-making in online experiments.

Online experiments such as Randomised Controlled Trials (RCTs) or A/B-tests are the bread and butter of modern platforms on the web. They are conducted continuously to allow platforms to estimate the causal effect of replacing system variant "A" with variant "B", on some metric of interest. These variants can differ in many aspects. In this paper, we focus on the common use-case where they correspond to machine learning models. The online experiment then serves as the final arbiter to decide which model is superior, and should thus be shipped. The statistical literature on causal effect estimation from RCTs has a substantial history, which contributes deservedly to the level of trust researchers and practitioners have in this "gold standard" of evaluation practices. Nevertheless, in the particular case of machine learning experiments, we remark that certain critical issues remain. Specifically, the assumptions that are required to ascertain that A/B-tests yield unbiased estimates of the causal effect, are seldom met in practical applications. We argue that, because variants typically learn using pooled data, a lack of model interference cannot be guaranteed. This undermines the conclusions we can draw from online experiments with machine learning models. We discuss the implications this has for practitioners, and for the research literature.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes