IRMay 30, 2022
Uncertainty Quantification for Fairness in Two-Stage Recommender SystemsLequn Wang, Thorsten Joachims
Many large-scale recommender systems consist of two stages. The first stage efficiently screens the complete pool of items for a small subset of promising candidates, from which the second-stage model curates the final recommendations. In this paper, we investigate how to ensure group fairness to the items in this two-stage architecture. In particular, we find that existing first-stage recommenders might select an irrecoverably unfair set of candidates such that there is no hope for the second-stage recommender to deliver fair recommendations. To this end, motivated by recent advances in uncertainty quantification, we propose two threshold-policy selection rules that can provide distribution-free and finite-sample guarantees on fairness in first-stage recommenders. More concretely, given any relevance model of queries and items and a point-wise lower confidence bound on the expected number of relevant items for each threshold-policy, the two rules find near-optimal sets of candidates that contain enough relevant items in expectation from each group of items. To instantiate the rules, we demonstrate how to derive such confidence bounds from potentially partial and biased user feedback data, which are abundant in many large-scale recommender systems. In addition, we provide both finite-sample and asymptotic analyses of how close the two threshold selection rules are to the optimal thresholds. Beyond this theoretical analysis, we show empirically that these two rules can consistently select enough relevant items from each group while minimizing the size of the candidate sets for a wide range of settings.
LGJun 13, 2023
Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual BanditsLequn Wang, Akshay Krishnamurthy, Aleksandrs Slivkins
We consider offline policy optimization (OPO) in contextual bandits, where one is given a fixed dataset of logged interactions. While pessimistic regularizers are typically used to mitigate distribution shift, prior implementations thereof are either specialized or computationally inefficient. We present the first general oracle-efficient algorithm for pessimistic OPO: it reduces to supervised learning, leading to broad applicability. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We instantiate our approach for both discrete and continuous actions and perform experiments in both settings, showing advantage over unregularized OPO across a wide range of configurations.
LGOct 24, 2023
Off-Policy Evaluation for Large Action Spaces via Policy ConvolutionNoveen Sachdeva, Lequn Wang, Dawen Liang et al.
Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance, even in the simpler case of one-step contextual bandits. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators. These methods leverage latent structure within actions -- made available through action embeddings -- to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, which can be controlled by adjusting the amount of convolution. Our experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC, especially when either the action space or policy mismatch becomes large, with gains of up to 5 - 6 orders of magnitude over existing estimators.
CLFeb 28, 2019Code
FastFusionNet: New State-of-the-Art for DAWNBench SQuADFelix Wu, Boyi Li, Lequn Wang et al.
In this technical report, we introduce FastFusionNet, an efficient variant of FusionNet [12]. FusionNet is a high performing reading comprehension architecture, which was designed primarily for maximum retrieval accuracy with less regard towards computational requirements. For FastFusionNets we remove the expensive CoVe layers [21] and substitute the BiLSTMs with far more efficient SRU layers [19]. The resulting architecture obtains state-of-the-art results on DAWNBench [5] while achieving the lowest training and inference time on SQuAD [25] to-date. The code is available at https://github.com/felixgwu/FastFusionNet.
CVMay 22, 2018Code
Resource Aware Person Re-identification across Multiple ResolutionsYan Wang, Lequn Wang, Yurong You et al.
Not all people are equally easy to identify: color statistics might be enough for some cases while others might require careful reasoning about high- and low-level details. However, prevailing person re-identification(re-ID) methods use one-size-fits-all high-level embeddings from deep convolutional networks for all cases. This might limit their accuracy on difficult examples or makes them needlessly expensive for the easy ones. To remedy this, we present a new person re-ID model that combines effective embeddings built on multiple convolutional network layers, trained with deep-supervision. On traditional re-ID benchmarks, our method improves substantially over the previous state-of-the-art results on all five datasets that we evaluate on. We then propose two new formulations of the person re-ID problem under resource-constraints, and show how our model can be used to effectively trade off accuracy and computation in the presence of resource constraints. Code and pre-trained models are available at https://github.com/mileyan/DARENet.
LGSep 30, 2025
Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exitingXi Wang, James McInerney, Lequn Wang et al.
Large reasoning models show improved performance with longer chains of thought. However, recent work has highlighted (qualitatively) their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency by tracking Pass@1 for answers averaged over a large number of rollouts and find that the model often begins to always produce the correct answer early in the reasoning, making extra reasoning a waste of tokens. To detect and prevent overthinking, we propose a simple and inexpensive novel signal -- Entropy After </Think> (EAT) -- for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 13 - 21% without harming accuracy, and it remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models.
LGFeb 2, 2022
Improving Screening Processes via Calibrated Subset SelectionLequn Wang, Thorsten Joachims, Manuel Gomez Rodriguez
Many selection processes such as finding patients qualifying for a medical trial or retrieval pipelines in search engines consist of multiple stages, where an initial screening stage focuses the resources on shortlisting the most promising candidates. In this paper, we investigate what guarantees a screening classifier can provide, independently of whether it is constructed manually or trained. We find that current solutions do not enjoy distribution-free theoretical guarantees -- we show that, in general, even for a perfectly calibrated classifier, there always exist specific pools of candidates for which its shortlist is suboptimal. Then, we develop a distribution-free screening algorithm -- called Calibrated Subset Selection (CSS) -- that, given any classifier and some amount of calibration data, finds near-optimal shortlists of candidates that contain a desired number of qualified candidates in expectation. Moreover, we show that a variant of CSS that calibrates a given classifier multiple times across specific groups can create shortlists with provable diversity guarantees. Experiments on US Census survey data validate our theoretical results and show that the shortlists provided by our algorithm are superior to those provided by several competitive baselines.
LGJan 28, 2022
Improving Expert Predictions with Conformal PredictionEleni Straitouri, Lequn Wang, Nastaran Okati et al.
Automated decision support systems promise to help human experts solve multiclass classification tasks more efficiently and accurately. However, existing systems typically require experts to understand when to cede agency to the system or when to exercise their own agency. Otherwise, the experts may be better off solving the classification tasks on their own. In this work, we develop an automated decision support system that, by design, does not require experts to understand when to trust the system to improve performance. Rather than providing (single) label predictions and letting experts decide when to trust these predictions, our system provides sets of label predictions constructed using conformal prediction$\unicode{x2014}$prediction sets$\unicode{x2014}$and forcefully asks experts to predict labels from these sets. By using conformal prediction, our system can precisely trade-off the probability that the true label is not in the prediction set, which determines how frequently our system will mislead the experts, and the size of the prediction set, which determines the difficulty of the classification task the experts need to solve using our system. In addition, we develop an efficient and near-optimal search method to find the conformal predictor under which the experts benefit the most from using our system. Simulation experiments using synthetic and real expert predictions demonstrate that our system may help experts make more accurate predictions and is robust to the accuracy of the classifier the conformal predictor relies on.
LGMar 3, 2021
Fairness of Exposure in Stochastic BanditsLequn Wang, Yiwei Bai, Wen Sun et al.
Contextual bandit algorithms have become widely used for recommendation in online systems (e.g. marketplaces, music streaming, news), where they now wield substantial influence on which items get exposed to the users. This raises questions of fairness to the items -- and to the sellers, artists, and writers that benefit from this exposure. We argue that the conventional bandit formulation can lead to an undesirable and unfair winner-takes-all allocation of exposure. To remedy this problem, we propose a new bandit objective that guarantees merit-based fairness of exposure to the items while optimizing utility to the users. We formulate fairness regret and reward regret in this setting, and present algorithms for both stochastic multi-armed bandits and stochastic linear bandits. We prove that the algorithms achieve sub-linear fairness regret and reward regret. Beyond the theoretical analysis, we also provide empirical evidence that these algorithms can fairly allocate exposure to different arms effectively.
IROct 4, 2020
User Fairness, Item Fairness, and Diversity for Rankings in Two-Sided MarketsLequn Wang, Thorsten Joachims
Ranking items by their probability of relevance has long been the goal of conventional ranking systems. While this maximizes traditional criteria of ranking performance, there is a growing understanding that it is an oversimplification in online platforms that serve not only a diverse user population, but also the producers of the items. In particular, ranking algorithms are expected to be fair in how they serve all groups of users -- not just the majority group -- and they also need to be fair in how they divide exposure among the items. These fairness considerations can partially be met by adding diversity to the rankings, as done in several recent works. However, we show in this paper that user fairness, item fairness and diversity are fundamentally different concepts. In particular, we find that algorithms that consider only one of the three desiderata can fail to satisfy and even harm the other two. To overcome this shortcoming, we present the first ranking algorithm that explicitly enforces all three desiderata. The algorithm optimizes user and item fairness as a convex optimization problem which can be solved optimally. From its solution, a ranking policy can be derived via a novel Birkhoff-von Neumann decomposition algorithm that optimizes diversity. Beyond the theoretical analysis, we investigate empirically on a new benchmark dataset how effectively the proposed ranking algorithm can control user fairness, item fairness and diversity, as well as the trade-offs between them.
CLSep 28, 2019
Integrated Triaging for Fast Reading ComprehensionFelix Wu, Boyi Li, Lequn Wang et al.
Although according to several benchmarks automatic machine reading comprehension (MRC) systems have recently reached super-human performance, less attention has been paid to their computational efficiency. However, efficiency is of crucial importance for training and deployment in real world applications. This paper introduces Integrated Triaging, a framework that prunes almost all context in early layers of a network, leaving the remaining (deep) layers to scan only a tiny fraction of the full corpus. This pruning drastically increases the efficiency of MRC models and further prevents the later layers from overfitting to prevalent short paragraphs in the training set. Our framework is extremely flexible and naturally applicable to a wide variety of models. Our experiment on doc-SQuAD and TriviaQA tasks demonstrates its effectiveness in consistently improving both speed and quality of several diverse MRC models.
LGNov 6, 2018
CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and LearningYi Su, Lequn Wang, Michele Santacatterina et al.
The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and off-policy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. In this paper, we identify a family of counterfactual estimators which subsumes most such estimators proposed to date. Our analysis of this family identifies a new estimator - called Continuous Adaptive Blending (CAB) - which enjoys many advantageous theoretical and practical properties. In particular, it can be substantially less biased than clipped Inverse Propensity Score (IPS) weighting and the Direct Method, and it can have less variance than Doubly Robust and IPS estimators. In addition, it is sub-differentiable such that it can be used for learning, unlike the SWITCH estimator. Experimental results show that CAB provides excellent evaluation accuracy and outperforms other counterfactual estimators in terms of learning performance.