LGSep 24, 2023
Regularization and Optimal Multiclass LearningJulian Asilis, Siddartha Devic, Shaddin Dughmi et al.
The quintessential learning algorithm of empirical risk minimization (ERM) is known to fail in various settings for which uniform convergence does not characterize learning. It is therefore unsurprising that the practice of machine learning is rife with considerably richer algorithmic techniques for successfully controlling model capacity. Nevertheless, no such technique or principle has broken away from the pack to characterize optimal learning in these more general settings. The purpose of this work is to characterize the role of regularization in perhaps the simplest setting for which ERM fails: multiclass learning with arbitrary label sets. Using one-inclusion graphs (OIGs), we exhibit optimal learning algorithms that dovetail with tried-and-true algorithmic principles: Occam's Razor as embodied by structural risk minimization (SRM), the principle of maximum entropy, and Bayesian reasoning. Most notably, we introduce an optimal learner which relaxes structural risk minimization on two dimensions: it allows the regularization function to be "local" to datapoints, and uses an unsupervised learning stage to learn this regularizer at the outset. We justify these relaxations by showing that they are necessary: removing either dimension fails to yield a near-optimal learner. We also extract from OIGs a combinatorial sequence we term the Hall complexity, which is the first to characterize a problem's transductive error rate exactly. Lastly, we introduce a generalization of OIGs and the transductive learning setting to the agnostic case, where we show that optimal orientations of Hamming graphs -- judged using nodes' outdegrees minus a system of node-dependent credits -- characterize optimal learners exactly. We demonstrate that an agnostic version of the Hall complexity again characterizes error rates exactly, and exhibit an optimal learner using maximum entropy programs.
LGMar 2
Relatively Smart: A New Approach for Instance-Optimal LearningShaddin Dughmi, Alireza F. Pour
We revisit the framework of Smart PAC learning, which seeks supervised learners which compete with semi-supervised learners that are provided full knowledge of the marginal distribution on unlabeled data. Prior work has shown that such marginal-by-marginal guarantees are possible for "most" marginals, with respect to an arbitrary fixed and known measure, but not more generally. We discover that this failure can be attributed to an "indistinguishability" phenomenon: There are marginals which cannot be statistically distinguished from other marginals that require different learning approaches. In such settings, semi-supervised learning cannot certify its guarantees from unlabeled data, rendering them arguably non-actionable. We propose relatively smart learning, a new framework which demands that a supervised learner compete only with the best "certifiable" semi-supervised guarantee. We show that such modest relaxation suffices to bypass the impossibility results from prior work. In the distribution-free setting, we show that the OIG learner is relatively smart up to squaring the sample complexity, and show that no supervised learning algorithm can do better. For distribution-family settings, we show that relatively smart learning can be impossible or can require idiosyncratic learning approaches, and its difficulty can be non-monotone in the inclusion order on distribution families.
LGMay 17
Adaptive Generate-Rank-Verify: Inference-Time Search with Costly VerificationShaddin Dughmi, Mahdi Haghifam, Yusuf Hakan Kalayci
Many inference-time language-model pipelines combine a cheap reward signal with an expensive verifier, such as exact answer checking in mathematical reasoning or hidden-test execution in code generation. We formalize this setting using a learning-theoretic lens as generative active search: a cost-sensitive first-positive search problem in which a policy adaptively samples candidates from an unknown distribution, observes cheap scores, and pays for verifier labels until it finds a positive example. For a fixed prompt, the generator and reward model induce two unknown objects: a distribution over reward scores and a score-conditioned success function. When these quantities are known, we characterize the distribution-aware optimal policy using a dynamic programming approach. In the realistic and practical setting where both the score distribution and success function are unknown, we propose ADAP, a shellwise adaptive generate-rank-verify algorithm that progressively increases the number of sampled responses and top-ranked verifications. Under the monotonicity assumption that higher reward scores are no less likely to pass verification, we show that ADAP achieves expected cost within a constant factor of the distribution-aware optimum. We complement this result with learning-theoretic lower bounds, based on a centered star number, showing that structural assumptions on the score--label relationship are necessary. Experiments on mathematical reasoning and competitive programming validate the predicted advantage over both fixed non-adaptive policies and difficulty-adaptive baselines.
LGMay 11
A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode CollapseAtul Ganju, Travis McVoy, Shaddin Dughmi et al.
We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As in [arXiv:2504.14370, arXiv:2511.05295], we aim for \emph{breadth}, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.
MLMay 8, 2024
Is Transductive Learning Equivalent to PAC Learning?Shaddin Dughmi, Yusuf Kalayci, Grayson York
Much of learning theory is concerned with the design and analysis of probably approximately correct (PAC) learners. The closely related transductive model of learning has recently seen more scrutiny, with its learners often used as precursors to PAC learners. Our goal in this work is to understand and quantify the exact relationship between these two models. First, we observe that modest extensions of existing results show the models to be essentially equivalent for realizable learning for most natural loss functions, up to low order terms in the error and sample complexity. The situation for agnostic learning appears less straightforward, with sample complexities potentially separated by a $\frac{1}ε$ factor. This is therefore where our main contributions lie. Our results are two-fold: 1. For agnostic learning with bounded losses (including, for example, multiclass classification), we show that PAC learning reduces to transductive learning at the cost of low-order terms in the error and sample complexity via an adaptation of the reduction of arXiv:2304.09167 to the agnostic setting. 2. For agnostic binary classification, we show the converse: transductive learning is essentially no more difficult than PAC learning. Together with our first result this implies that the PAC and transductive models are essentially equivalent for agnostic binary classification. This is our most technical result, and involves two steps: A symmetrization argument on the agnostic one-inclusion graph (OIG) of arXiv:2309.13692 to derive the worst-case agnostic transductive instance, and expressing the error of the agnostic OIG algorithm for this instance in terms of the empirical Rademacher complexity of the class. We leave as an intriguing open question whether our second result can be extended beyond binary classification to show the transductive and PAC models equivalent more broadly.
LGFeb 15, 2024
Transductive Learning Is CompactJulian Asilis, Siddartha Devic, Shaddin Dughmi et al.
We demonstrate a compactness result holding broadly across supervised learning with a general class of loss functions: Any hypothesis class $H$ is learnable with transductive sample complexity $m$ precisely when all of its finite projections are learnable with sample complexity $m$. We prove that this exact form of compactness holds for realizable and agnostic learning with respect to any proper metric loss function (e.g., any norm on $\mathbb{R}^d$) and any continuous loss on a compact space (e.g., cross-entropy, squared loss). For realizable learning with improper metric losses, we show that exact compactness of sample complexity can fail, and provide matching upper and lower bounds of a factor of 2 on the extent to which such sample complexities can differ. We conjecture that larger gaps are possible for the agnostic case. Furthermore, invoking the equivalence between sample complexities in the PAC and transductive models (up to lower order factors, in the realizable case) permits us to directly port our results to the PAC model, revealing an almost-exact form of compactness holding broadly in PAC learning.
LGOct 1, 2025
Optimal Stopping vs Best-of-$N$ for Inference Time OptimizationYusuf Kalayci, Vinod Raman, Shaddin Dughmi
Large language model (LLM) generation often requires balancing output quality against inference cost, especially when using multiple generations. We introduce a new framework for inference-time optimization based on the classical Pandora's Box problem. Viewing each generation as opening a costly "box" with random reward, we develop algorithms that decide when to stop generating without knowing the underlying reward distribution. Our first contribution is a UCB-style Pandora's Box algorithm, which achieves performance that is provably close to Weitzman's algorithm, the optimal strategy when the distribution is known. We further adapt this method to practical LLM settings by addressing reward scaling across prompts via a Bradley-Terry inspired transformation. This leads to an adaptive inference-time optimization method that normalizes rewards and learns stopping thresholds on the fly. Experiments on the AlpacaFarm and HH-RLHF datasets, using multiple LLM-reward model pairs, show that our adaptive strategy can obtain the same performance as non-adaptive Best-of-N sampling while requiring 15-35 percent fewer generations on average. Our results establish a principled bridge between optimal stopping theory and inference-time scaling, providing both theoretical performance bounds and practical efficiency gains for LLM deployment.
LGFeb 14, 2025
Proper Learnability and the Role of Unlabeled DataJulian Asilis, Siddartha Devic, Shaddin Dughmi et al.
Proper learning refers to the setting in which learners must emit predictors in the underlying hypothesis class $H$, and often leads to learners with simple algorithmic forms (e.g. empirical risk minimization (ERM), structural risk minimization (SRM)). The limitation of proper learning, however, is that there exist problems which can only be learned improperly, e.g. in multiclass classification. Thus, we ask: Under what assumptions on the hypothesis class or the information provided to the learner is a problem properly learnable? We first demonstrate that when the unlabeled data distribution is given, there always exists an optimal proper learner governed by distributional regularization, a randomized generalization of regularization. We refer to this setting as the distribution-fixed PAC model, and continue to evaluate the learner on its worst-case performance over all distributions. Our result holds for all metric loss functions and any finite learning problem (with no dependence on its size). Further, we demonstrate that sample complexities in the distribution-fixed PAC model can shrink by only a logarithmic factor from the classic PAC model, strongly refuting the role of unlabeled data in PAC learning (from a worst-case perspective). We complement this with impossibility results which obstruct any characterization of proper learnability in the realizable PAC model. First, we observe that there are problems whose proper learnability is logically undecidable, i.e., independent of the ZFC axioms. We then show that proper learnability is not a monotone property of the underlying hypothesis class, and that it is not a local property (in a precise sense). Our impossibility results all hold even for the fundamental setting of multiclass classification, and go through a reduction of EMX learning (Ben-David et al., 2019) to proper classification which may be of independent interest.
LGFeb 11, 2025
Local Regularizers Are Not Transductive LearnersSky Jafar, Julian Asilis, Shaddin Dughmi
We partly resolve an open question raised by Asilis et al. (COLT 2024): whether the algorithmic template of local regularization -- an intriguing generalization of explicit regularization, a.k.a. structural risk minimization -- suffices to learn all learnable multiclass problems. Specifically, we provide a negative answer to this question in the transductive model of learning. We exhibit a multiclass classification problem which is learnable in both the transductive and PAC models, yet cannot be learned transductively by any local regularizer. The corresponding hypothesis class, and our proof, are based on principles from cryptographic secret sharing. We outline challenges in extending our negative result to the PAC model, leaving open the tantalizing possibility of a PAC/transductive separation with respect to local regularization.
LGFeb 2, 2025
PAC Learning is just Bipartite Matching (Sort of)Shaddin Dughmi
The main goal of this article is to convince you, the reader, that supervised learning in the Probably Approximately Correct (PAC) model is closely related to -- of all things -- bipartite matching! En-route from PAC learning to bipartite matching, I will overview a particular transductive model of learning, and associated one-inclusion graphs, which can be viewed as a generalization of some of the hat puzzles that are popular in recreational mathematics. Whereas this transductive model is far from new, it has recently seen a resurgence of interest as a tool for tackling deep questions in learning theory. A secondary purpose of this article could be as a (biased) tutorial on the connections between the PAC and transductive models of learning.
GTNov 21, 2017
On the Distortion of Voting with Multiple Representative CandidatesYu Cheng, Shaddin Dughmi, David Kempe
We study positional voting rules when candidates and voters are embedded in a common metric space, and cardinal preferences are naturally given by distances in the metric space. In a positional voting rule, each candidate receives a score from each ballot based on the ballot's rank order; the candidate with the highest total score wins the election. The cost of a candidate is his sum of distances to all voters, and the distortion of an election is the ratio between the cost of the elected candidate and the cost of the optimum candidate. We consider the case when candidates are representative of the population, in the sense that they are drawn i.i.d. from the population of the voters, and analyze the expected distortion of positional voting rules. Our main result is a clean and tight characterization of positional voting rules that have constant expected distortion (independent of the number of candidates and the metric space). Our characterization result immediately implies constant expected distortion for Borda Count and elections in which each voter approves a constant fraction of all candidates. On the other hand, we obtain super-constant expected distortion for Plurality, Veto, and approving a constant number of candidates. These results contrast with previous results on voting with metric preferences: When the candidates are chosen adversarially, all of the preceding voting rules have distortion linear in the number of candidates or voters. Thus, the model of representative candidates allows us to distinguish voting rules which seem equally bad in the worst case.
GTMay 4, 2017
Of the People: Voting Is More Effective with Representative CandidatesYu Cheng, Shaddin Dughmi, David Kempe
In light of the classic impossibility results of Arrow and Gibbard and Satterthwaite regarding voting with ordinal rules, there has been recent interest in characterizing how well common voting rules approximate the social optimum. In order to quantify the quality of approximation, it is natural to consider the candidates and voters as embedded within a common metric space, and to ask how much further the chosen candidate is from the population as compared to the socially optimal one. We use this metric preference model to explore a fundamental and timely question: does the social welfare of a population improve when candidates are representative of the population? If so, then by how much, and how does the answer depend on the complexity of the metric space? We restrict attention to the most fundamental and common social choice setting: a population of voters, two independently drawn candidates, and a majority rule election. When candidates are not representative of the population, it is known that the candidate selected by the majority rule can be thrice as far from the population as the socially optimal one. We examine how this ratio improves when candidates are drawn independently from the population of voters. Our results are two-fold: When the metric is a line, the ratio improves from $3$ to $4-2\sqrt{2}$, roughly $1.1716$; this bound is tight. When the metric is arbitrary, we show a lower bound of $1.5$ and a constant upper bound strictly better than $2$ on the approximation ratio of the majority rule. The positive result depends in part on the assumption that candidates are independent and identically distributed. However, we show that independence alone is not enough to achieve the upper bound: even when candidates are drawn independently, if the population of candidates can be different from the voters, then an upper bound of $2$ on the approximation is tight.
GTMar 11, 2017
Mitigating the Curse of Correlation in Security Games by Entropy MaximizationHaifeng Xu, Milind Tambe, Shaddin Dughmi et al.
In Stackelberg security games, a defender seeks to randomly allocate limited security resources to protect critical targets from an attack. In this paper, we study a fundamental, yet underexplored, phenomenon in security games, which we term the \emph{Curse of Correlation} (CoC). Specifically, we observe that there are inevitable correlations among the protection status of different targets. Such correlation is a crucial concern, especially in \emph{spatio-temporal} domains like conservation area patrolling, where attackers can surveil patrollers at certain areas and then infer their patrolling routes using such correlations. To mitigate this issue, we propose to design entropy-maximizing defending strategies for spatio-temporal security games, which frequently suffer from CoC. We prove that the problem is \#P-hard in general. However, it admits efficient algorithms in well-motivated special settings. Our experiments show significant advantages of max-entropy algorithms over previous algorithms. A scalable implementation of our algorithm is currently under pre-deployment testing for integration into FAMS software to improve the scheduling of US federal air marshals.
GTApr 23, 2015
Security Games with Information Leakage: Modeling and ComputationHaifeng Xu, Albert X. Jiang, Arunesh Sinha et al.
Most models of Stackelberg security games assume that the attacker only knows the defender's mixed strategy, but is not able to observe (even partially) the instantiated pure strategy. Such partial observation of the deployed pure strategy -- an issue we refer to as information leakage -- is a significant concern in practical applications. While previous research on patrolling games has considered the attacker's real-time surveillance, our settings, therefore models and techniques, are fundamentally different. More specifically, after describing the information leakage model, we start with an LP formulation to compute the defender's optimal strategy in the presence of leakage. Perhaps surprisingly, we show that a key subproblem to solve this LP (more precisely, the defender oracle) is NP-hard even for the simplest of security game models. We then approach the problem from three possible directions: efficient algorithms for restricted cases, approximation algorithms, and heuristic algorithms for sampling that improves upon the status quo. Our experiments confirm the necessity of handling information leakage and the advantage of our algorithms.