Siddartha Devic

LG
h-index55
15papers
118citations
Novelty55%
AI Score56

15 Papers

LGSep 24, 2023
Regularization and Optimal Multiclass Learning

Julian Asilis, Siddartha Devic, Shaddin Dughmi et al.

The quintessential learning algorithm of empirical risk minimization (ERM) is known to fail in various settings for which uniform convergence does not characterize learning. It is therefore unsurprising that the practice of machine learning is rife with considerably richer algorithmic techniques for successfully controlling model capacity. Nevertheless, no such technique or principle has broken away from the pack to characterize optimal learning in these more general settings. The purpose of this work is to characterize the role of regularization in perhaps the simplest setting for which ERM fails: multiclass learning with arbitrary label sets. Using one-inclusion graphs (OIGs), we exhibit optimal learning algorithms that dovetail with tried-and-true algorithmic principles: Occam's Razor as embodied by structural risk minimization (SRM), the principle of maximum entropy, and Bayesian reasoning. Most notably, we introduce an optimal learner which relaxes structural risk minimization on two dimensions: it allows the regularization function to be "local" to datapoints, and uses an unsupervised learning stage to learn this regularizer at the outset. We justify these relaxations by showing that they are necessary: removing either dimension fails to yield a near-optimal learner. We also extract from OIGs a combinatorial sequence we term the Hall complexity, which is the first to characterize a problem's transductive error rate exactly. Lastly, we introduce a generalization of OIGs and the transductive learning setting to the agnostic case, where we show that optimal orientations of Hamming graphs -- judged using nodes' outdegrees minus a system of node-dependent credits -- characterize optimal learners exactly. We demonstrate that an agnostic version of the Hall complexity again characterizes error rates exactly, and exhibit an optimal learner using maximum entropy programs.

LGFeb 8, 2023
Fairness in Matching under Uncertainty

Siddartha Devic, David Kempe, Vatsal Sharan et al.

The prevalence and importance of algorithmic two-sided marketplaces has drawn attention to the issue of fairness in such settings. Algorithmic decisions are used in assigning students to schools, users to advertisers, and applicants to job interviews. These decisions should heed the preferences of individuals, and simultaneously be fair with respect to their merits (synonymous with fit, future performance, or need). Merits conditioned on observable features are always \emph{uncertain}, a fact that is exacerbated by the widespread use of machine learning algorithms to infer merit from the observables. As our key contribution, we carefully axiomatize a notion of individual fairness in the two-sided marketplace setting which respects the uncertainty in the merits; indeed, it simultaneously recognizes uncertainty as the primary potential cause of unfairness and an approach to address it. We design a linear programming framework to find fair utility-maximizing distributions over allocations, and we show that the linear program is robust to perturbations in the estimated parameters of the uncertain merit distributions, a key property in combining the approach with machine learning techniques.

LGJan 12
Are LLM Decisions Faithful to Verbal Confidence?

Jiawei Wang, Yanfei Zhou, Siddartha Devic et al.

Large Language Models (LLMs) can produce surprisingly sophisticated estimates of their own uncertainty. However, it remains unclear to what extent this expressed confidence is tied to the reasoning, knowledge, or decision making of the model. To test this, we introduce $\textbf{RiskEval}$: a framework designed to evaluate whether models adjust their abstention policies in response to varying error penalties. Our evaluation of several frontier models reveals a critical dissociation: models are neither cost-aware when articulating their verbal confidence, nor strategically responsive when deciding whether to engage or abstain under high-penalty conditions. Even when extreme penalties render frequent abstention the mathematically optimal strategy, models almost never abstain, resulting in utility collapse. This indicates that calibrated verbal confidence scores may not be sufficient to create trustworthy and interpretable AI systems, as current models lack the strategic agency to convert uncertainty signals into optimal and risk-sensitive decisions.

64.1LGMay 8
Flexible Routing via Uncertainty Decomposition

Charlotte Peale, Siddartha Devic, Parikshit Gopalan et al.

A key strategy for balancing performance and cost in modern machine learning systems is to dynamically route queries to either a low-cost model or a more expensive oracle (such as a large pretrained model or human expert), an approach known as model routing. In this work we present a new uncertainty-aware router that (1) avoids unnecessary oracle calls on inherently ambiguous queries, and (2) adapts dynamically to different loss functions and cost parameters through simple hyperparameter changes, without retraining. Our method, applicable to any classification setting where multiple independent annotations per input are available, is based on decomposing total uncertainty into irreducible and reducible components using higher-order predictors [Ahdritz et al., 2025]. This enables a unified approach to both routing and abstention: predict with the weak model when uncertainty is low, route to the oracle when reducible uncertainty is high, and abstain when irreducible uncertainty is high. Our router comes with strong theoretical guarantees bounding regret relative to optimal task-specific routers. We conduct experiments on both synthetic and real-world datasets that demonstrate the benefits of our approach in suitable regimes -- in particular, whenever reducible and irreducible uncertainty are not too correlated.

LGFeb 14, 2024
Stability and Multigroup Fairness in Ranking with Uncertain Predictions

Siddartha Devic, Aleksandra Korolova, David Kempe et al.

Rankings are ubiquitous across many applications, from search engines to hiring committees. In practice, many rankings are derived from the output of predictors. However, when predictors trained for classification tasks have intrinsic uncertainty, it is not obvious how this uncertainty should be represented in the derived rankings. Our work considers ranking functions: maps from individual predictions for a classification task to distributions over rankings. We focus on two aspects of ranking functions: stability to perturbations in predictions and fairness towards both individuals and subgroups. Not only is stability an important requirement for its own sake, but -- as we show -- it composes harmoniously with individual fairness in the sense of Dwork et al. (2012). While deterministic ranking functions cannot be stable aside from trivial scenarios, we show that the recently proposed uncertainty aware (UA) ranking functions of Singh et al. (2021) are stable. Our main result is that UA rankings also achieve multigroup fairness through successful composition with multiaccurate or multicalibrated predictors. Our work demonstrates that UA rankings naturally interpolate between group and individual level fairness guarantees, while simultaneously satisfying stability guarantees important whenever machine-learned predictions are used.

CLJun 9, 2025
From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered

Siddartha Devic, Tejas Srinivasan, Jesse Thomason et al.

Large Language Models (LLMs) are increasingly assisting users in the real world, yet their reliability remains a concern. Uncertainty quantification (UQ) has been heralded as a tool to enhance human-LLM collaboration by enabling users to know when to trust LLM predictions. We argue that current practices for uncertainty quantification in LLMs are not optimal for developing useful UQ for human users making decisions in real-world tasks. Through an analysis of 40 LLM UQ methods, we identify three prevalent practices hindering the community's progress toward its goal of benefiting downstream users: 1) evaluating on benchmarks with low ecological validity; 2) considering only epistemic uncertainty; and 3) optimizing metrics that are not necessarily indicative of downstream utility. For each issue, we propose concrete user-centric practices and research directions that LLM UQ researchers should consider. Instead of hill-climbing on unrepresentative tasks using imperfect metrics, we argue that the community should adopt a more human-centered approach to LLM uncertainty quantification.

LGFeb 15, 2024
Transductive Learning Is Compact

Julian Asilis, Siddartha Devic, Shaddin Dughmi et al.

We demonstrate a compactness result holding broadly across supervised learning with a general class of loss functions: Any hypothesis class $H$ is learnable with transductive sample complexity $m$ precisely when all of its finite projections are learnable with sample complexity $m$. We prove that this exact form of compactness holds for realizable and agnostic learning with respect to any proper metric loss function (e.g., any norm on $\mathbb{R}^d$) and any continuous loss on a compact space (e.g., cross-entropy, squared loss). For realizable learning with improper metric losses, we show that exact compactness of sample complexity can fail, and provide matching upper and lower bounds of a factor of 2 on the extent to which such sample complexities can differ. We conjecture that larger gaps are possible for the agnostic case. Furthermore, invoking the equivalence between sample complexities in the PAC and transductive models (up to lower order factors, in the realizable case) permits us to directly port our results to the PAC model, revealing an almost-exact form of compactness holding broadly in PAC learning.

AIOct 12, 2025
Trace Length is a Simple Uncertainty Signal in Reasoning Models

Siddartha Devic, Charlotte Peale, Arwen Bradley et al.

Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., "overthinking"). We investigate the mechanisms behind trace length's performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or "forking" tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.

LGSep 21, 2025
Auditability and the Landscape of Distance to Multicalibration

Nathan Derhake, Siddartha Devic, Dutch Hansen et al.

Calibration is a critical property for establishing the trustworthiness of predictors that provide uncertainty estimates. Multicalibration is a strengthening of calibration which requires that predictors be calibrated on a potentially overlapping collection of subsets of the domain. As multicalibration grows in popularity with practitioners, an essential question is: how do we measure how multicalibrated a predictor is? Błasiok et al. (2023) considered this question for standard calibration by introducing the distance to calibration framework (dCE) to understand how calibration metrics relate to each other and the ground truth. Building on the dCE framework, we consider the auditability of the distance to multicalibration of a predictor $f$. We begin by considering two natural generalizations of dCE to multiple subgroups: worst group dCE (wdMC), and distance to multicalibration (dMC). We argue that there are two essential properties of any multicalibration error metric: 1) the metric should capture how much $f$ would need to be modified in order to be perfectly multicalibrated; and 2) the metric should be auditable in an information theoretic sense. We show that wdMC and dMC each fail to satisfy one of these two properties, and that similar barriers arise when considering the auditability of general distance to multigroup fairness notions. We then propose two (equivalent) multicalibration metrics which do satisfy these requirements: 1) a continuized variant of dMC; and 2) a distance to intersection multicalibration, which leans on intersectional fairness desiderata. Along the way, we shed light on the loss-landscape of distance to multicalibration and the geometry of the set of perfectly multicalibrated predictors. Our findings may have implications for the development of stronger multicalibration algorithms as well as multigroup auditing more generally.

LGMar 3, 2025
An Efficient Plugin Method for Metric Optimization of Black-Box Models

Siddartha Devic, Nurendra Choudhary, Anirudh Srinivasan et al.

Many machine learning algorithms and classifiers are available only via API queries as a ``black-box'' -- that is, the downstream user has no ability to change, re-train, or fine-tune the model on a particular target distribution. Indeed, the downstream user may not even have knowledge of the \emph{original} training distribution or performance metric used to construct and optimize the black-box model. We propose a simple and efficient method, Plugin, which \emph{post-processes} arbitrary multiclass predictions from any black-box classifier in order to simultaneously (1) adapt these predictions to a target distribution; and (2) optimize a particular metric of the confusion matrix. Importantly, Plugin is a completely \textit{post-hoc} method which does not rely on feature information, only requires a small amount of probabilistic predictions along with their corresponding true label, and optimizes metrics by querying. We empirically demonstrate that Plugin is both broadly applicable and has performance competitive with related methods on a variety of tabular and language tasks.

LGFeb 14, 2025
Proper Learnability and the Role of Unlabeled Data

Julian Asilis, Siddartha Devic, Shaddin Dughmi et al.

Proper learning refers to the setting in which learners must emit predictors in the underlying hypothesis class $H$, and often leads to learners with simple algorithmic forms (e.g. empirical risk minimization (ERM), structural risk minimization (SRM)). The limitation of proper learning, however, is that there exist problems which can only be learned improperly, e.g. in multiclass classification. Thus, we ask: Under what assumptions on the hypothesis class or the information provided to the learner is a problem properly learnable? We first demonstrate that when the unlabeled data distribution is given, there always exists an optimal proper learner governed by distributional regularization, a randomized generalization of regularization. We refer to this setting as the distribution-fixed PAC model, and continue to evaluate the learner on its worst-case performance over all distributions. Our result holds for all metric loss functions and any finite learning problem (with no dependence on its size). Further, we demonstrate that sample complexities in the distribution-fixed PAC model can shrink by only a logarithmic factor from the classic PAC model, strongly refuting the role of unlabeled data in PAC learning (from a worst-case perspective). We complement this with impossibility results which obstruct any characterization of proper learnability in the realizable PAC model. First, we observe that there are problems whose proper learnability is logically undecidable, i.e., independent of the ZFC axioms. We then show that proper learnability is not a monotone property of the underlying hypothesis class, and that it is not a local property (in a precise sense). Our impossibility results all hold even for the fundamental setting of multiclass classification, and go through a reduction of EMX learning (Ben-David et al., 2019) to proper classification which may be of independent interest.

LGJun 10, 2024
When is Multicalibration Post-Processing Necessary?

Dutch Hansen, Siddartha Devic, Preetum Nakkiran et al.

Calibration is a well-studied property of predictors which guarantees meaningful uncertainty estimates. Multicalibration is a related notion -- originating in algorithmic fairness -- which requires predictors to be simultaneously calibrated over a potentially complex and overlapping collection of protected subpopulations (such as groups defined by ethnicity, race, or income). We conduct the first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs. Our findings can be summarized as follows: (1) models which are calibrated out of the box tend to be relatively multicalibrated without any additional post-processing; (2) multicalibration post-processing can help inherently uncalibrated models and large vision and language models; and (3) traditional calibration measures may sometimes provide multicalibration implicitly. More generally, we also distill many independent observations which may be useful for practical and effective applications of multicalibration post-processing in real-world contexts. We also release a python package implementing multicalibration algorithms, available via `pip install multicalibration'.

LGJul 12, 2021
Polynomial Time Reinforcement Learning in Factored State MDPs with Linear Value Functions

Zihao Deng, Siddartha Devic, Brendan Juba

Many reinforcement learning (RL) environments in practice feature enormous state spaces that may be described compactly by a "factored" structure, that may be modeled by Factored Markov Decision Processes (FMDPs). We present the first polynomial-time algorithm for RL in Factored State MDPs (generalizing FMDPs) that neither relies on an oracle planner nor requires a linear transition model; it only requires a linear value function with a suitable local basis with respect to the factorization, permitting efficient variable elimination. With this assumption, we can solve this family of Factored State MDPs in polynomial time by constructing an efficient separation oracle for convex optimization. Importantly, and in contrast to prior work on FMDPs, we do not assume that the transitions on various factors are conditionally independent.

LGFeb 18, 2020
ResiliNet: Failure-Resilient Inference in Distributed Neural Networks

Ashkan Yousefpour, Brian Q. Nguyen, Siddartha Devic et al.

Federated Learning aims to train distributed deep models without sharing the raw data with the centralized server. Similarly, in distributed inference of neural networks, by partitioning the network and distributing it across several physical nodes, activations and gradients are exchanged between physical nodes, rather than raw data. Nevertheless, when a neural network is partitioned and distributed among physical nodes, failure of physical nodes causes the failure of the neural units that are placed on those nodes, which results in a significant performance drop. Current approaches focus on resiliency of training in distributed neural networks. However, resiliency of inference in distributed neural networks is less explored. We introduce ResiliNet, a scheme for making inference in distributed neural networks resilient to physical node failures. ResiliNet combines two concepts to provide resiliency: skip hyperconnection, a concept for skipping nodes in distributed neural networks similar to skip connection in resnets, and a novel technique called failout, which is introduced in this paper. Failout simulates physical node failure conditions during training using dropout, and is specifically designed to improve the resiliency of distributed neural networks. The results of the experiments and ablation studies using three datasets confirm the ability of ResiliNet to provide inference resiliency for distributed neural networks.

NISep 3, 2019
Guardians of the Deep Fog: Failure-Resilient DNN Inference from Edge to Cloud

Ashkan Yousefpour, Siddartha Devic, Brian Q. Nguyen et al.

Partitioning and distributing deep neural networks (DNNs) over physical nodes such as edge, fog, or cloud nodes, could enhance sensor fusion, and reduce bandwidth and inference latency. However, when a DNN is distributed over physical nodes, failure of the physical nodes causes the failure of the DNN units that are placed on these nodes. The performance of the inference task will be unpredictable, and most likely, poor, if the distributed DNN is not specifically designed and properly trained for failures. Motivated by this, we introduce deepFogGuard, a DNN architecture augmentation scheme for making the distributed DNN inference task failure-resilient. To articulate deepFogGuard, we introduce the elements and a model for the resiliency of distributed DNN inference. Inspired by the concept of residual connections in DNNs, we introduce skip hyperconnections in distributed DNNs, which are the basis of deepFogGuard's design to provide resiliency. Next, our extensive experiments using two existing datasets for the sensing and vision applications confirm the ability of deepFogGuard to provide resiliency for distributed DNNs in edge-cloud networks.