MESep 14, 2023
Choosing a Proxy Metric from Past ExperimentsNilesh Tripuraneni, Lee Richardson, Alexander D'Amour et al.
In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines.
MEJul 3, 2023
Pareto optimal proxy metricsAlessandro Zito, Dylan Greaves, Jacopo Soriano et al.
North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.
MEApr 17, 2017
Mixture modeling on related samples by $ψ$-stick breaking and kernel perturbationJacopo Soriano, Li Ma
There has been great interest recently in applying nonparametric kernel mixtures in a hierarchical manner to model multiple related data samples jointly. In such settings several data features are commonly present: (i) the related samples often share some, if not all, of the mixture components but with differing weights, (ii) only some, not all, of the mixture components vary across the samples, and (iii) often the shared mixture components across samples are not aligned perfectly in terms of their location and spread, but rather display small misalignments either due to systematic cross-sample difference or more often due to uncontrolled, extraneous causes. Properly incorporating these features in mixture modeling will enhance the efficiency of inference, whereas ignoring them not only reduces efficiency but can jeopardize the validity of the inference due to issues such as confounding. We introduce two techniques for incorporating these features in modeling related data samples using kernel mixtures. The first technique, called $ψ$-stick breaking, is a joint generative process for the mixing weights through the breaking of both a stick shared by all the samples for the components that do not vary in size across samples and an idiosyncratic stick for each sample for those components that do vary in size. The second technique is to imbue random perturbation into the kernels, thereby accounting for cross-sample misalignment. These techniques can be used either separately or together in both parametric and nonparametric kernel mixtures. We derive efficient Bayesian inference recipes based on MCMC sampling for models featuring these techniques, and illustrate their work through both simulated data and a real flow cytometry data set in prediction/estimation, cross-sample calibration, and testing multi-sample differences.
MEFeb 12, 2016
Efficient functional ANOVA through wavelet-domain Markov grovesLi Ma, Jacopo Soriano
We introduce a wavelet-domain functional analysis of variance (fANOVA) method based on a Bayesian hierarchical model. The factor effects are modeled through a spike-and-slab mixture at each location-scale combination along with a normal-inverse-Gamma (NIG) conjugate setup for the coefficients and errors. A graphical model called the Markov grove (MG) is designed to jointly model the spike-and-slab statuses at all location-scale combinations, which incorporates the clustering of each factor effect in the wavelet-domain thereby allowing borrowing of strength across location and scale. The posterior of this NIG-MG model is analytically available through a pyramid algorithm of the same computational complexity as Mallat's pyramid algorithm for discrete wavelet transform, i.e., linear in both the number of observations and the number of locations. Posterior probabilities of factor contributions can also be computed through pyramid recursion, and exact samples from the posterior can be drawn without MCMC. We investigate the performance of our method through extensive simulation and show that it outperforms existing wavelet-domain fANOVA methods in a variety of common settings. We apply the method to analyzing the orthosis data.