Efficient Multi-Cohort Inference for Long-Term Effects and Lifetime Value in A/B Testing with User Learning
This work addresses the challenge of making reliable product decisions in streaming platforms by providing a more accurate method for evaluating long-term user engagement and value, though it is incremental as it builds on existing A/B testing frameworks with specific improvements.
The paper tackles the problem of accurately estimating long-term treatment effects and residual lifetime value in A/B testing for streaming platforms, where short-term metrics can mislead decisions due to user churn. It introduces a multi-cohort inference method that improves precision in these estimates and identifies scenarios where relying on short-term or long-term metrics alone leads to incorrect product decisions.
In streaming platforms churn is extremely costly, yet A/B tests are typically evaluated using outcomes observed within a limited experimental horizon. Even when both short- and predicted long-term engagement metrics are considered, they may fail to capture how a treatment affects users' retention. Consequently, an intervention may appear beneficial in the short term and neutral in the long term while still generating lower total value than the control due to users churn. To address this limitation, we introduce a method that estimates long-term treatment effects (LTE) and residual lifetime value change ($ΔERLV$) in short multi-cohort A/B tests under user learning. To estimate time-varying treatment effects efficiently, we introduce an inverse-variance weighted estimator that combines multiple cohorts estimates, reducing variance relative to standard approaches in the literature. The estimated treatment trajectory is then modeled as a parametric decay to recover both the asymptotic treatment effect and the cumulative value generated over time. Our framework enables simultaneous evaluation of steady-state impact and residual user value within a single experiment. Empirical results show improved precision in estimating LTE and $ΔERLV$ and identify scenarios in which relying on either short-term or long-term metrics alone would lead to incorrect product decisions.