David McAllester

h-index55

20papers

4,544citations

Novelty40%

AI Score48

Ranked #26,774 of 194,257 authors (top 14%)#5,577 in CL (top 18%)

20 Papers

14.9LGJan 25, 2023

On the Mathematics of Diffusion Models

David McAllester

This paper gives direct derivations of the differential equations and likelihood formulas of diffusion models assuming only knowledge of Gaussian distributions. A VAE analysis derives both forward and backward stochastic differential equations (SDEs) as well as non-variational integral expressions for likelihood formulas. A score-matching analysis derives the reverse diffusion ordinary differential equation (ODE) and a family of reverse-diffusion SDEs parameterized by noise level. The paper presents the mathematics directly with attributions saved for a final section.

13.7AIApr 28

Training Transformers as a Universal Computer

Ruize Xu, Chenxiao Yang, Yanhong Li et al.

We demonstrate that a small transformer can learn to execute programs in MicroPy, a simplified yet computationally universal programming language. Given procedure definitions together with an expression to evaluate, the transformer predicts small-step execution using PENCIL scaffolding for space-efficient execution within a bounded context window. After training on randomly generated, meaningless MicroPy programs, the learned transformer generalizes to various human-written programs including bit copying and flipping, binary addition and multiplication, and SAT verification and solving. We note that the trained model can achieve out-of-distribution generalization; i.e., evaluate novel programs from distribution on programs. Since MicroPy can express any computation, our results provide empirical evidence that a standard transformer can be trained to act as a universal computer.

12.0CLOct 31, 2025Code

OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

Yanhong Li, Tianyang Xu, Kenan Tang et al.

Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.

11.1CVDec 14, 2020Code

Information-Theoretic Segmentation by Inpainting Error Maximization

Pedro Savarese, Sunnie S. Y. Kim, Michael Maire et al.

We study image segmentation from an information-theoretic perspective, proposing a novel adversarial method that performs unsupervised segmentation by partitioning images into maximally independent sets. More specifically, we group image pixels into foreground and background, with the goal of minimizing predictability of one set from the other. An easily computed loss drives a greedy search process to maximize inpainting error over these partitions. Our method does not involve training deep networks, is computationally cheap, class-agnostic, and even applicable in isolation to a single unlabeled image. Experiments demonstrate that it achieves a new state-of-the-art in unsupervised segmentation quality, while being substantially faster and more general than competing approaches.

31.1CLJul 3, 2020

On-The-Fly Information Retrieval Augmentation for Language Models

Hai Wang, David McAllester

Here we experiment with the use of information retrieval as an augmentation for pre-trained language models. The text corpus used in information retrieval can be viewed as form of episodic memory which grows over time. By augmenting GPT 2.0 with information retrieval we achieve a zero shot 15% relative reduction in perplexity on Gigaword corpus without any re-training. We also validate our IR augmentation on an event co-reference task.

3.3LOMay 12, 2020

MathZero, The Classification Problem, and Set-Theoretic Type Theory

David McAllester

AlphaZero learns to play go, chess and shogi at a superhuman level through self play given only the rules of the game. This raises the question of whether a similar thing could be done for mathematics -- a MathZero. MathZero would require a formal foundation and an objective. We propose the foundation of set-theoretic dependent type theory and an objective defined in terms of the classification problem -- the problem of classifying concept instances up to isomorphism. The natural numbers arise as the solution to the classification problem for finite sets. Here we generalize classical Bourbaki set-theoretic isomorphism to set-theoretic dependent type theory. To our knowledge we give the first isomorphism inference rules for set-theoretic dependent type theory with propositional set-theoretic equality. The presentation is intended to be accessible to mathematicians with no prior exposure to type theory.

9.1LGDec 4, 2019Code

Domain-independent Dominance of Adaptive Methods

Pedro Savarese, David McAllester, Sudarshan Babu et al.

From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. We observe that the power of our method is partially explained by a decoupling of learning rate and adaptability, greatly simplifying hyperparameter search. In light of this observation, we demonstrate that, against conventional wisdom, Adam can also outperform SGD on vision tasks, as long as the coupling between its learning rate and adaptability is taken into account. In practice, AvaGrad matches the best results, as measured by generalization accuracy, delivered by any existing optimizer (SGD or adaptive) across image classification (CIFAR, ImageNet) and character-level language modelling (Penn Treebank) tasks.

30.6CLFeb 23, 2019Code

Evidence Sentence Extraction for Machine Reading Comprehension

Hai Wang, Dian Yu, Kai Sun et al.

Remarkable success has been achieved in the last few years on some limited machine reading comprehension (MRC) tasks. However, it is still difficult to interpret the predictions of existing MRC models. In this paper, we focus on extracting evidence sentences that can explain or support the answers of multiple-choice MRC tasks, where the majority of answer options cannot be directly extracted from reference documents. Due to the lack of ground truth evidence sentence labels in most cases, we apply distant supervision to generate imperfect labels and then use them to train an evidence sentence extractor. To denoise the noisy labels, we apply a recently proposed deep probabilistic logic learning framework to incorporate both sentence-level and cross-sentence linguistic indicators for indirect supervision. We feed the extracted evidence sentences into existing MRC models and evaluate the end-to-end performance on three challenging multiple-choice MRC datasets: MultiRC, RACE, and DREAM, achieving comparable or better performance than the same models that take as input the full reference document. To the best of our knowledge, this is the first work extracting evidence sentences for multiple-choice MRC.

37.3ITNov 10, 2018Code

Formal Limitations on the Measurement of Mutual Information

David McAllester, Karl Stratos

Measuring mutual information from finite data is difficult. Recent work has considered variational methods maximizing a lower bound. In this paper, we prove that serious statistical limitations are inherent to any method of measuring mutual information. More specifically, we show that any distribution-free high-confidence lower bound on mutual information estimated from N samples cannot be larger than O(ln N ).

12.0LGFeb 21, 2018

Information Theoretic Co-Training

David McAllester

This paper introduces an information theoretic co-training objective for unsupervised learning. We consider the problem of predicting the future. Rather than predict future sensations (image pixels or sound waves) we predict "hypotheses" to be confirmed by future sensations. More formally, we assume a population distribution on pairs $(x,y)$ where we can think of $x$ as a past sensation and $y$ as a future sensation. We train both a predictor model $P_Φ(z|x)$ and a confirmation model $P_Ψ(z|y)$ where we view $z$ as hypotheses (when predicted) or facts (when confirmed). For a population distribution on pairs $(x,y)$ we focus on the problem of measuring the mutual information between $x$ and $y$. By the data processing inequality this mutual information is at least as large as the mutual information between $x$ and $z$ under the distribution on triples $(x,z,y)$ defined by the confirmation model $P_Ψ(z|y)$. The information theoretic training objective for $P_Φ(z|x)$ and $P_Ψ(z|y)$ can be viewed as a form of co-training where we want the prediction from $x$ to match the confirmation from $y$.

47.3LGJun 27, 2017Code

Exploring Generalization in Deep Learning

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester et al.

With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.

11.3CLNov 23, 2016

Emergent Predication Structure in Hidden State Vectors of Neural Readers

Hai Wang, Takeshi Onishi, Kevin Gimpel et al.

A significant number of neural architectures for reading comprehension have recently been developed and evaluated on large cloze-style datasets. We present experiments supporting the emergence of "predication structure" in the hidden state vectors of these readers. More specifically, we provide evidence that the hidden state vectors represent atomic formulas $Φ[c]$ where $Φ$ is a semantic property (predicate) and $c$ is a constant symbol entity identifier.

13.3CLOct 26, 2016

Broad Context Language Modeling as Reading Comprehension

Zewei Chu, Hai Wang, Kevin Gimpel et al.

Progress in text understanding has been driven by large datasets that test particular capabilities, like recent datasets for reading comprehension (Hermann et al., 2015). We focus here on the LAMBADA dataset (Paperno et al., 2016), a word prediction task requiring broader context than the immediate sentence. We view LAMBADA as a reading comprehension problem and apply comprehension models based on neural networks. Though these models are constrained to choose a word from the context, they improve the state of the art on LAMBADA from 7.3% to 49%. We analyze 100 instances, finding that neural network readers perform well in cases that involve selecting a name from the context based on dialogue or discourse cues but struggle when coreference resolution or external knowledge is needed.

22.7CLAug 19, 2016

Who did What: A Large-Scale Person-Centered Cloze Dataset

Takeshi Onishi, Hai Wang, Mohit Bansal et al.

We have constructed a new "Who-did-What" dataset of over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. The WDW dataset has a variety of novel features. First, in contrast with the CNN and Daily Mail datasets (Hermann et al., 2015) we avoid using article summaries for question formation. Instead, each problem is formed from two independent articles --- an article given as the passage to be read and a separate article on the same events used to form the question. Second, we avoid anonymization --- each choice is a person named entity. Third, the problems have been filtered to remove a fraction that are easily solved by simple baselines, while remaining 84% solvable by humans. We report performance benchmarks of standard systems and propose the WDW dataset as a challenge task for the community.

28.4LGJul 8, 2013

A PAC-Bayesian Tutorial with A Dropout Bound

David McAllester

This tutorial gives a concise overview of existing PAC-Bayesian theory focusing on three generalization bounds. The first is an Occam bound which handles rules with finite precision parameters and which states that generalization loss is near training loss when the number of bits needed to write the rule is small compared to the sample size. The second is a PAC-Bayesian bound providing a generalization guarantee for posterior distributions rather than for individual rules. The PAC-Bayesian bound naturally handles infinite precision rule parameters, $L_2$ regularization, {\em provides a bound for dropout training}, and defines a natural notion of a single distinguished PAC-Bayesian posterior distribution. The third bound is a training-variance bound --- a kind of bias-variance analysis but with bias replaced by expected training loss. The training-variance bound dominates the other bounds but is more difficult to interpret. It seems to suggest variance reduction methods such as bagging and may ultimately provide a more meaningful analysis of dropouts.

20.5AIJan 23, 2013

Approximate Planning for Factored POMDPs using Belief State Simplification

David A. McAllester, Satinder Singh

We are interested in the problem of planning for factored POMDPs. Building on the recent results of Kearns, Mansour and Ng, we provide a planning algorithm for factored POMDPs that exploits the accuracy-efficiency tradeoff in the belief state simplification introduced by Boyen and Koller.

1.4LGOct 8, 2012

Blending Learning and Inference in Structured Prediction

Tamir Hazan, Alexander Schwing, David McAllester et al.

In this paper we derive an efficient algorithm to learn the parameters of structured predictors in general graphical models. This algorithm blends the learning and inference tasks, which results in a significant speedup over traditional approaches, such as conditional random fields and structured support vector machines. For this purpose we utilize the structures of the predictors to describe a low dimensional structured prediction task which encourages local consistencies within the different structures while learning the parameters of the model. Convexity of the learning task provides the means to enforce the consistencies between the different parts. The inference-learning blending algorithm that we propose is guaranteed to converge to the optimum of the low dimensional primal and dual programs. Unlike many of the existing approaches, the inference-learning blending allows us to learn efficiently high-order graphical models, over regions of any size, and very large number of parameters. We demonstrate the effectiveness of our approach, while presenting state-of-the-art results in stereo estimation, semantic segmentation, shape reconstruction, and indoor scene understanding.

2.4AIAug 25, 2012

Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (2008)

David McAllester, Petri Myllymaki

This is the Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, which was held in Helsinki, Finland, July 9 - 12 2008.

18.1AIJul 11, 2012

Case-Factor Diagrams for Structured Probabilistic Modeling

David A. McAllester, Michael Collins, Fernando Pereira

We introduce a probabilistic formalism subsuming Markov random fields of bounded tree width and probabilistic context free grammars. Our models are based on a representation of Boolean formulas that we call case-factor diagrams (CFDs). CFDs are similar to binary decision diagrams (BDDs) but are concise for circuits of bounded tree width (unlike BDDs) and can concisely represent the set of parse trees over a given string undera given context free grammar (also unlike BDDs). A probabilistic model consists of aCFD defining a feasible set of Boolean assignments and a weight (or cost) for each individual Boolean variable. We give an insideoutside algorithm for simultaneously computing the marginal of each Boolean variable, and a Viterbi algorithm for finding the mininum cost variable assignment. Both algorithms run in time proportional to the size of the CFD.

19.4CVApr 6, 2012

Continuous Markov Random Fields for Robust Stereo Estimation

Koichiro Yamaguchi, Tamir Hazan, David McAllester et al.

In this paper we present a novel slanted-plane MRF model which reasons jointly about occlusion boundaries as well as depth. We formulate the problem as the one of inference in a hybrid MRF composed of both continuous (i.e., slanted 3D planes) and discrete (i.e., occlusion boundaries) random variables. This allows us to define potentials encoding the ownership of the pixels that compose the boundary between segments, as well as potentials encoding which junctions are physically possible. Our approach outperforms the state-of-the-art on Middlebury high resolution imagery as well as in the more challenging KITTI dataset, while being more efficient than existing slanted plane MRF-based methods, taking on average 2 minutes to perform inference on high resolution imagery.