Early Stopping Based on Repeated Significance
This work addresses statistical confidence issues in A/B testing for practitioners, though it appears incremental as it builds on existing correction methods like Bonferroni.
The paper tackles the challenge of early stopping in bucket tests with multiple criteria by proposing a method that requires criteria to be successful at multiple decision points, avoiding overly strict p-value requirements.
For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a $p$-value less than a specified value of $α$ for the success criterion produces statistical confidence at level $1 - α$. For multiple criteria, a Bonferroni correction that partitions $α$ among the criteria produces statistical confidence, at the cost of requiring lower $p$-values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for $p$-values. We show how to address that challenge by requiring criteria to be successful at multiple decision points.