h-index43
31papers
496citations
Novelty60%
AI Score58

31 Papers

LGOct 16, 2022
Loss Minimization through the Lens of Outcome Indistinguishability

Parikshit Gopalan, Lunjia Hu, Michael P. Kim et al.

We present a new perspective on loss minimization and the recent notion of Omniprediction through the lens of Outcome Indistingusihability. For a collection of losses and hypothesis class, omniprediction requires that a predictor provide a loss-minimization guarantee simultaneously for every loss in the collection compared to the best (loss-specific) hypothesis in the class. We present a generic template to learn predictors satisfying a guarantee we call Loss Outcome Indistinguishability. For a set of statistical tests--based on a collection of losses and hypothesis class--a predictor is Loss OI if it is indistinguishable (according to the tests) from Nature's true probabilities over outcomes. By design, Loss OI implies omniprediction in a direct and intuitive manner. We simplify Loss OI further, decomposing it into a calibration condition plus multiaccuracy for a class of functions derived from the loss and hypothesis classes. By careful analysis of this class, we give efficient constructions of omnipredictors for interesting classes of loss functions, including non-convex losses. This decomposition highlights the utility of a new multi-group fairness notion that we call calibrated multiaccuracy, which lies in between multiaccuracy and multicalibration. We show that calibrated multiaccuracy implies Loss OI for the important set of convex losses arising from Generalized Linear Models, without requiring full multicalibration. For such losses, we show an equivalence between our computational notion of Loss OI and a geometric notion of indistinguishability, formulated as Pythagorean theorems in the associated Bregman divergence. We give an efficient algorithm for calibrated multiaccuracy with computational complexity comparable to that of multiaccuracy. In all, calibrated multiaccuracy offers an interesting tradeoff point between efficiency and generality in the omniprediction landscape.

LGOct 24, 2022
Subspace Recovery from Heterogeneous Data with Non-isotropic Noise

John Duchi, Vitaly Feldman, Lunjia Hu et al.

Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distribution with mean $μ_i$. Our goal is to recover the linear subspace shared by $μ_1,\ldots,μ_n$ using the data points from all users, where every data point from user $i$ is formed by adding an independent mean-zero noise vector to $μ_i$. If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work. We avoid these assumptions by leveraging at least two data points from each user, which allows us to design an efficiently-computable estimator under non-spherical and user-dependent noise. We prove an upper bound for the estimation error of our estimator in general scenarios where the number of data points and amount of noise can vary across users, and prove an information-theoretic error lower bound that not only matches the upper bound up to a constant factor, but also holds even for spherical Gaussian noise. This implies that our estimator does not introduce additional estimation error (up to a constant factor) due to irregularity in the noise. We show additional results for a linear regression problem in a similar setup.

LGSep 15, 2022
Omnipredictors for Constrained Optimization

Lunjia Hu, Inbal Livni-Navon, Omer Reingold et al.

The notion of omnipredictors (Gopalan, Kalai, Reingold, Sharan and Wieder ITCS 2021), suggested a new paradigm for loss minimization. Rather than learning a predictor based on a known loss function, omnipredictors can easily be post-processed to minimize any one of a rich family of loss functions compared with the loss of hypotheses in a class $\mathcal C$. It has been shown that such omnipredictors exist and are implied (for all convex and Lipschitz loss functions) by the notion of multicalibration from the algorithmic fairness literature. In this paper, we introduce omnipredictors for constrained optimization and study their complexity and implications. The notion that we introduce allows the learner to be unaware of the loss function that will be later assigned as well as the constraints that will be later imposed, as long as the subpopulations that are used to define these constraints are known. We show how to obtain omnipredictors for constrained optimization problems, relying on appropriate variants of multicalibration. We also investigate the implications of this notion when the constraints used are so-called group fairness notions.

LGApr 19, 2023
Loss Minimization Yields Multicalibration for Large Neural Networks

Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu et al.

Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions. In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the predictors are neural networks of size $n > k$. We show that minimizing the squared loss over all neural nets of size $n$ implies multicalibration for all but a bounded number of unlucky values of $n$. We also give evidence that our bound on the number of unlucky values is tight, given our proof technique. Previously, results of the flavor that loss minimization yields multicalibration were known only for predictors that were near the ground truth, hence were rather limited in applicability. Unlike these, our results rely on the expressivity of neural nets and utilize the representation of the predictor.

LGNov 30, 2022
A Unifying Theory of Distance from Calibration

Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu et al.

We study the fundamental question of how to define and measure the distance from calibration for probabilistic predictors. While the notion of perfect calibration is well-understood, there is no consensus on how to quantify the distance from perfect calibration. Numerous calibration measures have been proposed in the literature, but it is unclear how they compare to each other, and many popular measures such as Expected Calibration Error (ECE) fail to satisfy basic properties like continuity. We present a rigorous framework for analyzing calibration measures, inspired by the literature on property testing. We propose a ground-truth notion of distance from calibration: the $\ell_1$ distance to the nearest perfectly calibrated predictor. We define a consistent calibration measure as one that is polynomially related to this distance. Applying our framework, we identify three calibration measures that are consistent and can be estimated efficiently: smooth calibration, interval calibration, and Laplace kernel calibration. The former two give quadratic approximations to the ground truth distance, which we show is information-theoretically optimal in a natural model for measuring calibration which we term the prediction-only access model. Our work thus establishes fundamental lower and upper bounds on measuring the distance to calibration, and also provides theoretical justification for preferring certain metrics (like Laplace kernel calibration) in practice.

LGMar 9, 2022
Metric Entropy Duality and the Sample Complexity of Outcome Indistinguishability

Lunjia Hu, Charlotte Peale, Omer Reingold

We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class $D$ of distinguishers examining the outcomes generated according to the predictors' predictions. In the distribution-specific and realizable setting where the learner is given the data distribution together with a predictor class $P$ containing the target predictor, we show that the sample complexity of outcome indistinguishability is characterized by the metric entropy of $P$ w.r.t. the dual Minkowski norm defined by $D$, and equivalently by the metric entropy of $D$ w.r.t. the dual Minkowski norm defined by $P$. This equivalence makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry. Our sample complexity characterization implies a variant of metric entropy duality, which we show is nearly tight. In the distribution-free setting, we focus on the case considered by Dwork et al. where $P$ contains all possible predictors, hence the sample complexity only depends on $D$. In this setting, we show that the sample complexity of outcome indistinguishability is characterized by the fat-shattering dimension of $D$. We also show a strong sample complexity separation between realizable and agnostic outcome indistinguishability in both the distribution-free and the distribution-specific settings. This is in contrast to distribution-free (resp. distribution-specific) PAC learning where the sample complexity in both the realizable and the agnostic settings can be characterized by the VC dimension (resp. metric entropy).

LGOct 25, 2023
Simple, Scalable and Effective Clustering via One-Dimensional Projections

Moses Charikar, Monika Henzinger, Lunjia Hu et al.

Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and $k$-means++ can take $Ω(ndk)$ time when clustering $n$ points in a $d$-dimensional space (represented by an $n\times d$ matrix $X$) into $k$ clusters. In applications with moderate to large $k$, the multiplicative $k$ factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time $O(\mathrm{nnz}(X) + n\log n)$ for arbitrary $k$. Here $\mathrm{nnz}(X)$ is the total number of non-zero entries in the input dataset $X$, which is upper bounded by $nd$ and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio $\smash{\widetilde{O}(k^4)}$ on any input dataset for the $k$-means objective. We also believe that our theoretical analysis is of independent interest, as we show that the approximation ratio of a $k$-means algorithm is approximately preserved under a class of projections and that $k$-means++ seeding can be implemented in expected $O(n \log n)$ time in one dimension. Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks.

CCFeb 24, 2023
Generative Models of Huge Objects

Lunjia Hu, Inbal Livni-Navon, Omer Reingold

This work initiates the systematic study of explicit distributions that are indistinguishable from a single exponential-size combinatorial object. In this we extend the work of Goldreich, Goldwasser and Nussboim (SICOMP 2010) that focused on the implementation of huge objects that are indistinguishable from the uniform distribution, satisfying some global properties (which they coined truthfulness). Indistinguishability from a single object is motivated by the study of generative models in learning theory and regularity lemmas in graph theory. Problems that are well understood in the setting of pseudorandomness present significant challenges and at times are impossible when considering generative models of huge objects. We demonstrate the versatility of this study by providing a learning algorithm for huge indistinguishable objects in several natural settings including: dense functions and graphs with a truthfulness requirement on the number of ones in the function or edges in the graphs, and a version of the weak regularity lemma for sparse graphs that satisfy some global properties. These and other results generalize basic pseudorandom objects as well as notions introduced in algorithmic fairness. The results rely on notions and techniques from a variety of areas including learning theory, complexity theory, cryptography, and game theory.

LGNov 16, 2022
Comparative Learning: A Sample Complexity Theory for Two Hypothesis Classes

Lunjia Hu, Charlotte Peale

In many learning theory problems, a central role is played by a hypothesis class: we might assume that the data is labeled according to a hypothesis in the class (usually referred to as the realizable setting), or we might evaluate the learned model by comparing it with the best hypothesis in the class (the agnostic setting). Taking a step beyond these classic setups that involve only a single hypothesis class, we introduce comparative learning as a combination of the realizable and agnostic settings in PAC learning: given two binary hypothesis classes $S$ and $B$, we assume that the data is labeled according to a hypothesis in the source class $S$ and require the learned model to achieve an accuracy comparable to the best hypothesis in the benchmark class $B$. Even when both $S$ and $B$ have infinite VC dimensions, comparative learning can still have a small sample complexity. We show that the sample complexity of comparative learning is characterized by the mutual VC dimension $\mathsf{VC}(S,B)$ which we define to be the maximum size of a subset shattered by both $S$ and $B$. We also show a similar result in the online setting, where we give a regret characterization in terms of the mutual Littlestone dimension $\mathsf{Ldim}(S,B)$. These results also hold for partial hypotheses. We additionally show that the insights necessary to characterize the sample complexity of comparative learning can be applied to characterize the sample complexity of realizable multiaccuracy and multicalibration using the mutual fat-shattering dimension, an analogue of the mutual VC dimension for real-valued hypotheses. This not only solves an open problem proposed by Hu, Peale, Reingold (2022), but also leads to independently interesting results extending classic ones about regression, boosting, and covering number to our two-hypothesis-class setting.

LGNov 7, 2025
Efficient Swap Multicalibration of Elicitable Properties

Lunjia Hu, Haipeng Luo, Spandan Senapati et al.

Multicalibration [HJKRR18] is an algorithmic fairness perspective that demands that the predictions of a predictor are correct conditional on themselves and membership in a collection of potentially overlapping subgroups of a population. The work of [NR23] established a surprising connection between multicalibration for an arbitrary property $Γ$ (e.g., mean or median) and property elicitation: a property $Γ$ can be multicalibrated if and only if it is elicitable, where elicitability is the notion that the true property value of a distribution can be obtained by solving a regression problem over the distribution. In the online setting, [NR23] proposed an inefficient algorithm that achieves $\sqrt T$ $\ell_2$-multicalibration error for a hypothesis class of group membership functions and an elicitable property $Γ$, after $T$ rounds of interaction between a forecaster and adversary. In this paper, we generalize multicalibration for an elicitable property $Γ$ from group membership functions to arbitrary bounded hypothesis classes and introduce a stronger notion -- swap multicalibration, following [GKR23]. Subsequently, we propose an oracle-efficient algorithm which, when given access to an online agnostic learner, achieves $T^{1/(r+1)}$ $\ell_r$-swap multicalibration error with high probability (for $r\ge2$) for a hypothesis class with bounded sequential Rademacher complexity and an elicitable property $Γ$. For the special case of $r=2$, this implies an oracle-efficient algorithm that achieves $T^{1/3}$ $\ell_2$-swap multicalibration error, which significantly improves on the previously established bounds for the problem [NR23, GMS25, LSS25a], and completely resolves an open question raised in [GJRR24] on the possibility of an oracle-efficient algorithm that achieves $\sqrt{T}$ $\ell_2$-mean multicalibration error by answering it in a strongly affirmative sense.

86.8LGMay 18
Testable and Actionable Calibration for Full Swap Regret

Konstantina Bairaktari, Lunjia Hu, Huy L. Nguyen et al.

AI generated predictions increasingly inform decision making in critical tasks, and therefore must be trustworthy. One widely used measure of trustworthiness is calibration, which requires that the predictions match the true frequencies and can be treated like real probabilities of a given outcome. However, defining calibration is subtle, and designing good measures of calibration error has been an active topic of recent research. The first goal is to find calibration measures that are actionable, meaning they can inform decision makers about their utility loss when predictions are treated as true probabilities, which is known as swap regret. The second goal is to find calibration measures that are testable, meaning that calibration error can be measured from a small sample of predictions and outcomes. Although these are very basic requirements, there is no existing calibration measure that fully satisfies both properties, and all existing measures relax actionability by bounding a weaker notion of swap regret, or relax testability by having suboptimal estimation error. We introduce a new calibration measure, Soft-Binned Calibration Decision Loss (SCDL), which we prove is fully actionable without weakening either requirement, and testable with nearly optimal error rate. In addition, SCDL satisfies other desired properties such as continuity and consistency. We also provide a set of experiments confirming that the theoretical advantages of SCDL compared to other measures lead to better performance in practice.

LGFeb 24
Equitable Evaluation via Elicitation

Elbert Du, Cynthia Dwork, Lunjia Hu et al.

Individuals with similar qualifications and skills may vary in their demeanor, or outward manner: some tend toward self-promotion while others are modest to the point of omitting crucial information. Comparing the self-descriptions of equally qualified job-seekers with different self-presentation styles is therefore problematic. We build an interactive AI for skill elicitation that provides accurate determination of skills while simultaneously allowing individuals to speak in their own voice. Such a system can be deployed, for example, when a new user joins a professional networking platform, or when matching employees to needs during a company reorganization. To obtain sufficient training data, we train an LLM to act as synthetic humans. Elicitation mitigates endogenous bias arising from individuals' own self-reports. To address systematic model bias we enforce a mathematically rigorous notion of equitability ensuring that the covariance between self-presentation manner and skill evaluation error is small.

LGFeb 9
Near-optimal Swap Regret Minimization for Convex Losses

Lunjia Hu, Jon Schneider, Yifan Wu

We give a randomized online algorithm that guarantees near-optimal $\widetilde O(\sqrt T)$ expected swap regret against any sequence of $T$ adaptively chosen Lipschitz convex losses on the unit interval. This improves the previous best bound of $\widetilde O(T^{2/3})$ and answers an open question of Fishelson et al. [2025b]. In addition, our algorithm is efficient: it runs in $\mathsf{poly}(T)$ time. A key technical idea we develop to obtain this result is to discretize the unit interval into bins at multiple scales of granularity and simultaneously use all scales to make randomized predictions, which we call multi-scale binning and may be of independent interest. A direct corollary of our result is an efficient online algorithm for minimizing the calibration error for general elicitable properties. This result does not require the Lipschitzness assumption of the identification function needed in prior work, making it applicable to median calibration, for which we achieve the first $\widetilde O(\sqrt T)$ calibration error guarantee.

LGFeb 12, 2024
On Computationally Efficient Multi-Class Calibration

Parikshit Gopalan, Lunjia Hu, Guy N. Rothblum

Consider a multi-class labelling problem, where the labels can take values in $[k]$, and a predictor predicts a distribution over the labels. In this work, we study the following foundational question: Are there notions of multi-class calibration that give strong guarantees of meaningful predictions and can be achieved in time and sample complexities polynomial in $k$? Prior notions of calibration exhibit a tradeoff between computational efficiency and expressivity: they either suffer from having sample complexity exponential in $k$, or needing to solve computationally intractable problems, or give rather weak guarantees. Our main contribution is a notion of calibration that achieves all these desiderata: we formulate a robust notion of projected smooth calibration for multi-class predictions, and give new recalibration algorithms for efficiently calibrating predictors under this definition with complexity polynomial in $k$. Projected smooth calibration gives strong guarantees for all downstream decision makers who want to use the predictor for binary classification problems of the form: does the label belong to a subset $T \subseteq [k]$: e.g. is this an image of an animal? It ensures that the probabilities predicted by summing the probabilities assigned to labels in $T$ are close to some perfectly calibrated binary predictor for that task. We also show that natural strengthenings of our definition are computationally hard to achieve: they run into information theoretic barriers or computational intractability. Underlying both our upper and lower bounds is a tight connection that we prove between multi-class calibration and the well-studied problem of agnostic learning in the (standard) binary prediction setting.

DSFeb 19
Simultaneous Blackwell Approachability and Applications to Multiclass Omniprediction

Lunjia Hu, Kevin Tian, Chutong Yang

Omniprediction is a learning problem that requires suboptimality bounds for each of a family of losses $\mathcal{L}$ against a family of comparator predictors $\mathcal{C}$. We initiate the study of omniprediction in a multiclass setting, where the comparator family $\mathcal{C}$ may be infinite. Our main result is an extension of the recent binary omniprediction algorithm of [OKK25] to the multiclass setting, with sample complexity (in statistical settings) or regret horizon (in online settings) $\approx \varepsilon^{-(k+1)}$, for $\varepsilon$-omniprediction in a $k$-class prediction problem. En route to proving this result, we design a framework of potential broader interest for solving Blackwell approachability problems where multiple sets must simultaneously be approached via coupled actions.

LGFeb 20, 2024
Testing Calibration in Nearly-Linear Time

Lunjia Hu, Arun Jambulapati, Kevin Tian et al.

In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on $(predictions, binary outcomes)$, our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time $O(n\log^2(n))$, and solves the calibration testing problem information-theoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring $Ω(n^ω)$ time, where $ω> 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.

LGApr 21, 2024
Calibration Error for Decision Making

Lunjia Hu, Yifan Wu

Calibration allows predictions to be reliably interpreted as probabilities by decision makers. We propose a decision-theoretic calibration error, the Calibration Decision Loss (CDL), defined as the maximum improvement in decision payoff obtained by calibrating the predictions, where the maximum is over all payoff-bounded decision tasks. Vanishing CDL guarantees the payoff loss from miscalibration vanishes simultaneously for all downstream decision tasks. We show separations between CDL and existing calibration error metrics, including the most well-studied metric Expected Calibration Error (ECE). Our main technical contribution is a new efficient algorithm for online calibration that achieves near-optimal $O(\frac{\log T}{\sqrt{T}})$ expected CDL, bypassing the $Ω(T^{-0.472})$ lower bound for ECE by Qiao and Valiant (2021).

CCJul 8, 2025
Generalized and Unified Equivalences between Hardness and Pseudoentropy

Lunjia Hu, Salil Vadhan

Pseudoentropy characterizations provide a quantitatively precise demonstration of the close relationship between computational hardness and computational randomness. We prove a unified pseudoentropy characterization that generalizes and strengthens previous results for both uniform and non-uniform models of computation. Our characterization holds for a general family of entropy notions that encompasses the common notions of Shannon entropy and min entropy as special cases. Moreover, we show that the characterizations for different entropy notions can be simultaneously achieved by a single, universal function that simultaneously witnesses computational hardness and computational randomness. A key technical insight of our work is that the notion of weight-restricted calibration from the recent literature on algorithm fairness, along with standard computational indistinguishability (known as multiaccuracy in the fairness literature), suffices for proving pseudoentropy characterizations for general entropy notions. This demonstrates the power of weight-restricted calibration to enhance the classic Complexity-Theoretic Regularity Lemma (Trevisan, Tulsiani, and Vadhan, 2009) and Leakage Simulation Lemma (Jetchev and Pietrzak, 2014) and allows us to achieve an exponential improvement in the complexity dependency on the alphabet size compared to the pseudoentropy characterizations by Casacuberta, Dwork, and Vadhan (2024) based on the much stronger notion of multicalibration. We show that the exponential dependency on the alphabet size is inevitable for multicalibration as well as for the weaker notion of calibrated multiaccuracy.

LGNov 20, 2024
Omnipredicting Single-Index Models with Multi-Index Models

Lunjia Hu, Kevin Tian, Chutong Yang

Recent work on supervised learning [GKR+22] defined the notion of omnipredictors, i.e., predictor functions $p$ over features that are simultaneously competitive for minimizing a family of loss functions $\mathcal{L}$ against a comparator class $\mathcal{C}$. Omniprediction requires approximating the Bayes-optimal predictor beyond the loss minimization paradigm, and has generated significant interest in the learning theory community. However, even for basic settings such as agnostically learning single-index models (SIMs), existing omnipredictor constructions require impractically-large sample complexities and runtimes, and output complex, highly-improper hypotheses. Our main contribution is a new, simple construction of omnipredictors for SIMs. We give a learner outputting an omnipredictor that is $\varepsilon$-competitive on any matching loss induced by a monotone, Lipschitz link function, when the comparator class is bounded linear predictors. Our algorithm requires $\approx \varepsilon^{-4}$ samples and runs in nearly-linear time, and its sample complexity improves to $\approx \varepsilon^{-2}$ if link functions are bi-Lipschitz. This significantly improves upon the only prior known construction, due to [HJKRR18, GHK+23], which used $\gtrsim \varepsilon^{-10}$ samples. We achieve our construction via a new, sharp analysis of the classical Isotron algorithm [KS09, KKKS11] in the challenging agnostic learning setting, of potential independent interest. Previously, Isotron was known to properly learn SIMs in the realizable setting, as well as constant-factor competitive hypotheses under the squared loss [ZWDD24]. As they are based on Isotron, our omnipredictors are multi-index models with $\approx \varepsilon^{-2}$ prediction heads, bringing us closer to the tantalizing goal of proper omniprediction for general loss families and comparators.

LGMay 1, 2024
Multigroup Robustness

Lunjia Hu, Charlotte Peale, Judy Hanwen Shen · stanford

To address the shortcomings of real-world datasets, robust learning algorithms have been designed to overcome arbitrary and indiscriminate data corruption. However, practical processes of gathering data may lead to patterns of data corruption that are localized to specific partitions of the training dataset. Motivated by critical applications where the learned model is deployed to make predictions about people from a rich collection of overlapping subpopulations, we initiate the study of multigroup robust algorithms whose robustness guarantees for each subpopulation only degrade with the amount of data corruption inside that subpopulation. When the data corruption is not distributed uniformly over subpopulations, our algorithms provide more meaningful robustness guarantees than standard guarantees that are oblivious to how the data corruption and the affected subpopulations are related. Our techniques establish a new connection between multigroup fairness and robustness.

LGOct 7, 2025
Making and Evaluating Calibrated Forecasts

Yuxuan Lu, Yifan Wu, Jason Hartline et al.

Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting. We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.

LGSep 2, 2025
Calibration through the Lens of Indistinguishability

Parikshit Gopalan, Lunjia Hu

Calibration is a classical notion from the forecasting literature which aims to address the question: how should predicted probabilities be interpreted? In a world where we only get to observe (discrete) outcomes, how should we evaluate a predictor that hypothesizes (continuous) probabilities over possible outcomes? The study of calibration has seen a surge of recent interest, given the ubiquity of probabilistic predictions in machine learning. This survey describes recent work on the foundational questions of how to define and measure calibration error, and what these measures mean for downstream decision makers who wish to use the predictions to make decisions. A unifying viewpoint that emerges is that of calibration as a form of indistinguishability, between the world hypothesized by the predictor and the real world (governed by nature or the Bayes optimal predictor). In this view, various calibration measures quantify the extent to which the two worlds can be told apart by certain classes of distinguishers or statistical measures.

LGAug 18, 2025
A Perfectly Truthful Calibration Measure

Jason Hartline, Lunjia Hu, Yifan Wu

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. A calibration measure quantifies how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Predicting the true probabilities guarantees perfect calibration, but in reality, when calibration is evaluated on a random sample, all known calibration measures incentivize predictors to lie in order to appear more calibrated. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

LGJun 20, 2025
How Many Domains Suffice for Domain Generalization? A Tight Characterization via the Domain Shattering Dimension

Cynthia Dwork, Lunjia Hu, Han Shao

We study a fundamental question of domain generalization: given a family of domains (i.e., data distributions), how many randomly sampled domains do we need to collect data from in order to learn a model that performs reasonably well on every seen and unseen domain in the family? We model this problem in the PAC framework and introduce a new combinatorial measure, which we call the domain shattering dimension. We show that this dimension characterizes the domain sample complexity. Furthermore, we establish a tight quantitative relationship between the domain shattering dimension and the classic VC dimension, demonstrating that every hypothesis class that is learnable in the standard PAC setting is also learnable in our setting.

LGMay 30, 2023
When Does Optimizing a Proper Loss Yield Calibration?

Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu et al.

Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the ground truth. Under what circumstances does optimizing proper loss over a restricted family yield calibrated models? What precise calibration guarantees does it give? In this work, we provide a rigorous answer to these questions. We replace the global optimality with a local optimality condition stipulating that the (proper) loss of the predictor cannot be reduced much by post-processing its predictions with a certain family of Lipschitz functions. We show that any predictor with this local optimality satisfies smooth calibration as defined in Kakade-Foster (2008), Błasiok et al. (2023). Local optimality is plausibly satisfied by well-trained DNNs, which suggests an explanation for why they are calibrated from proper loss minimization alone. Finally, we show that the connection between local optimality and calibration error goes both ways: nearly calibrated predictors are also nearly locally optimal.

LGJun 29, 2021
Near-Optimal Explainable $k$-Means for All Dimensions

Moses Charikar, Lunjia Hu

Many clustering algorithms are guided by certain cost functions such as the widely-used $k$-means cost. These algorithms divide data points into clusters with often complicated boundaries, creating difficulties in explaining the clustering decision. In a recent work, Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020) introduced explainable clustering, where the cluster boundaries are axis-parallel hyperplanes and the clustering is obtained by applying a decision tree to the data. The central question here is: how much does the explainability constraint increase the value of the cost function? Given $d$-dimensional data points, we show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k^{1 - 2/d}\,\mathrm{poly}(d\log k)$ times the minimum cost achievable by a clustering without the explainability constraint, assuming $k,d\ge 2$. Taking the minimum of this bound and the $k\,\mathrm{polylog} (k)$ bound in independent work by Makarychev-Shan (ICML 2021), Gamlath-Jia-Polak-Svensson (2021), or Esfandiari-Mirrokni-Narayanan (2021), we get an improved bound of $k^{1 - 2/d}\,\mathrm{polylog}(k)$, which we show is optimal for every choice of $k,d\ge 2$ up to a poly-logarithmic factor in $k$. For $d = 2$ in particular, we show an $O(\log k\log\log k)$ bound, improving near-exponentially over the previous best bound of $O(k\log k)$ by Laber and Murtinho (ICML 2021).

DSAug 18, 2020
Robust Mean Estimation on Highly Incomplete Data with Arbitrary Outliers

Lunjia Hu, Omer Reingold

We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples, where most coordinates of every example may be missing and $\varepsilon N$ examples may be arbitrarily corrupted. Assuming each coordinate appears in a constant factor more than $\varepsilon N$ examples, we show algorithms that estimate the mean of the distribution with information-theoretically optimal dimension-independent error guarantees in nearly-linear time $\widetilde O(Nd)$. Our results extend recent work on computationally-efficient robust estimation to a more widely applicable incomplete-data setting.

LGMar 24, 2020
Robust and On-the-fly Dataset Denoising for Image Classification

Jiaming Song, Lunjia Hu, Michael Auli et al.

Memorization in over-parameterized neural networks could severely hurt generalization in the presence of mislabeled examples. However, mislabeled examples are hard to avoid in extremely large datasets collected with weak supervision. We address this problem by reasoning counterfactually about the loss distribution of examples with uniform random labels had they were trained with the real examples, and use this information to remove noisy examples from the training set. First, we observe that examples with uniform random labels have higher losses when trained with stochastic gradient descent under large learning rates. Then, we propose to model the loss distribution of the counterfactual examples using only the network parameters, which is able to model such examples with remarkable success. Finally, we propose to remove examples whose loss exceeds a certain quantile of the modeled loss distribution. This leads to On-the-fly Data Denoising (ODD), a simple yet effective algorithm that is robust to mislabeled examples, while introducing almost zero computational overhead compared to standard training. ODD is able to achieve state-of-the-art results on a wide range of datasets including real-world ones such as WebVision and Clothing1M.

LGOct 28, 2018
Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation

Liwei Wang, Lunjia Hu, Jiayuan Gu et al.

It is widely believed that learning good representations is one of the main reasons for the success of deep neural networks. Although highly intuitive, there is a lack of theory and systematic approach quantitatively characterizing what representations do deep neural networks learn. In this work, we move a tiny step towards a theory and better understanding of the representations. Specifically, we study a simpler problem: How similar are the representations learned by two networks with identical architecture but trained from different initializations. We develop a rigorous theory based on the neuron activation subspace match model. The theory gives a complete characterization of the structure of neuron activation subspace matches, where the core concepts are maximum match and simple match which describe the overall and the finest similarity between sets of neurons in two networks respectively. We also propose efficient algorithms to find the maximum match and simple matches. Finally, we conduct extensive experiments using our algorithms. Experimental results suggest that, surprisingly, representations learned by the same convolutional layers of networks trained from different initializations are not as similar as prevalently expected, at least in terms of subspace match.

MLNov 1, 2017
Active Tolerant Testing

Avrim Blum, Lunjia Hu

In this work, we give the first algorithms for tolerant testing of nontrivial classes in the active model: estimating the distance of a target function to a hypothesis class C with respect to some arbitrary distribution D, using only a small number of label queries to a polynomial-sized pool of unlabeled examples drawn from D. Specifically, we show that for the class D of unions of d intervals on the line, we can estimate the error rate of the best hypothesis in the class to an additive error epsilon from only $O(\frac{1}{ε^6}\log \frac{1}ε)$ label queries to an unlabeled pool of size $O(\frac{d}{ε^2}\log \frac{1}ε)$. The key point here is the number of labels needed is independent of the VC-dimension of the class. This extends the work of Balcan et al. [2012] who solved the non-tolerant testing problem for this class (distinguishing the zero-error case from the case that the best hypothesis in the class has error greater than epsilon). We also consider the related problem of estimating the performance of a given learning algorithm A in this setting. That is, given a large pool of unlabeled examples drawn from distribution D, can we, from only a few label queries, estimate how well A would perform if the entire dataset were labeled? We focus on k-Nearest Neighbor style algorithms, and also show how our results can be applied to the problem of hyperparameter tuning (selecting the best value of k for the given learning problem).

LGFeb 18, 2017
Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes

Lunjia Hu, Ruihan Wu, Tianhong Li et al.

In this work we study the quantitative relation between the recursive teaching dimension (RTD) and the VC dimension (VCD) of concept classes of finite sizes. The RTD of a concept class $\mathcal C \subseteq \{0, 1\}^n$, introduced by Zilles et al. (2011), is a combinatorial complexity measure characterized by the worst-case number of examples necessary to identify a concept in $\mathcal C$ according to the recursive teaching model. For any finite concept class $\mathcal C \subseteq \{0,1\}^n$ with $\mathrm{VCD}(\mathcal C)=d$, Simon & Zilles (2015) posed an open problem $\mathrm{RTD}(\mathcal C) = O(d)$, i.e., is RTD linearly upper bounded by VCD? Previously, the best known result is an exponential upper bound $\mathrm{RTD}(\mathcal C) = O(d \cdot 2^d)$, due to Chen et al. (2016). In this paper, we show a quadratic upper bound: $\mathrm{RTD}(\mathcal C) = O(d^2)$, much closer to an answer to the open problem. We also discuss the challenges in fully solving the problem.