Kevin Leyton-Brown

AI
h-index97
43papers
4,216citations
Novelty52%
AI Score53

43 Papers

CLJul 13, 2023Code
Generating Benchmarks for Factuality Evaluation of Language Models

Dor Muhlgay, Ori Ram, Inbal Magar et al.

Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing methods for factuality evaluation of LLM generation focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent domain specific or rare facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create three benchmarks: Wiki-FACTOR, News-FACTOR and Expert-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators. We make our data and code publicly available in https://github.com/AI21Labs/factor.

CLDec 21, 2022Code
Parallel Context Windows for Large Language Models

Nir Ratner, Yoav Levine, Yonatan Belinkov et al.

When applied to processing long text, Large Language Models (LLMs) are limited by their context window. Existing efforts to address this limitation involve training specialized architectures, and cannot be easily applied to off-the-shelf LLMs. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (``windows''), restrict the attention mechanism to apply only within each window, and re-use the positional embeddings across the windows. Our main results test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters, and show substantial improvements for tasks with diverse input and output spaces. We show additional benefits in other settings where long context windows may be beneficial: multi-hop questions and retrieval-augmented question answering with multiple retrieved documents. Our results highlight Parallel Context Windows as a promising method for applying off-the-shelf LLMs in a range of settings that require long text sequences. We make our code publicly available at https://github.com/ai21labs/parallel-context-windows.

CLJan 31, 2023
In-Context Retrieval-Augmented Language Models

Ori Ram, Yoav Levine, Itay Dalmedigos et al.

Retrieval-Augmented Language Modeling (RALM) methods, which condition a language model (LM) on relevant documents from a grounding corpus during generation, were shown to significantly improve language modeling performance. In addition, they can mitigate the problem of factually inaccurate text generation and provide natural source attribution mechanism. Existing RALM approaches focus on modifying the LM architecture in order to facilitate the incorporation of external information, significantly complicating deployment. This paper considers a simple alternative, which we dub In-Context RALM: leaving the LM architecture unchanged and prepending grounding documents to the input, without any further training of the LM. We show that In-Context RALM that builds on off-the-shelf general purpose retrievers provides surprisingly large LM gains across model sizes and diverse corpora. We also demonstrate that the document retrieval and ranking mechanism can be specialized to the RALM setting to further boost performance. We conclude that In-Context RALM has considerable potential to increase the prevalence of LM grounding, particularly in settings where a pretrained LM must be used without modification or even via API access.

CYOct 31, 2022
Artificial Intelligence and Life in 2030: The One Hundred Year Study on Artificial Intelligence

Peter Stone, Rodney Brooks, Erik Brynjolfsson et al.

In September 2016, Stanford's "One Hundred Year Study on Artificial Intelligence" project (AI100) issued the first report of its planned long-term periodic assessment of artificial intelligence (AI) and its impact on society. It was written by a panel of 17 study authors, each of whom is deeply rooted in AI research, chaired by Peter Stone of the University of Texas at Austin. The report, entitled "Artificial Intelligence and Life in 2030," examines eight domains of typical urban settings on which AI is likely to have impact over the coming years: transportation, home and service robots, healthcare, education, public safety and security, low-resource communities, employment and workplace, and entertainment. It aims to provide the general public with a scientifically and technologically accurate portrayal of the current state of AI and its potential and to help guide decisions in industry and governments, as well as to inform research and development in the field. The charge for this report was given to the panel by the AI100 Standing Committee, chaired by Barbara Grosz of Harvard University.

CLMay 1, 2022
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov et al.

Huge language models (LMs) have ushered in a new era for AI, serving as a gateway to natural-language-based knowledge tasks. Although an essential element of modern AI, LMs are also inherently limited in a number of ways. We discuss these limitations and how they can be avoided by adopting a systems approach. Conceptualizing the challenge as one that involves knowledge and reasoning in addition to linguistic processing, we define a flexible architecture with multiple neural models, complemented by discrete knowledge and reasoning modules. We describe this neuro-symbolic architecture, dubbed the Modular Reasoning, Knowledge and Language (MRKL, pronounced "miracle") system, some of the technical challenges in implementing it, and Jurassic-X, AI21 Labs' MRKL system implementation.

CLApr 21, 2022
Standing on the Shoulders of Giant Frozen Language Models

Yoav Levine, Itay Dalmedigos, Ori Ram et al.

Huge pretrained language models (LMs) have demonstrated surprisingly good zero-shot capabilities on a wide variety of tasks. This gives rise to the appealing vision of a single, versatile model with a wide range of functionalities across disparate applications. However, current leading techniques for leveraging a "frozen" LM -- i.e., leaving its weights untouched -- still often underperform fine-tuning approaches which modify these weights in a task-dependent way. Those, in turn, suffer forgetfulness and compromise versatility, suggesting a tradeoff between performance and versatility. The main message of this paper is that current frozen-model techniques such as prompt tuning are only the tip of the iceberg, and more powerful methods for leveraging frozen LMs can do just as well as fine tuning in challenging domains without sacrificing the underlying model's versatility. To demonstrate this, we introduce three novel methods for leveraging frozen models: input-dependent prompt tuning, frozen readers, and recursive LMs, each of which vastly improves on current frozen-model approaches. Indeed, some of our methods even outperform fine-tuning approaches in domains currently dominated by the latter. The computational cost of each method is higher than that of existing frozen model methods, but still negligible relative to a single pass through a huge frozen LM. Each of these methods constitutes a meaningful contribution in its own right, but by presenting these contributions together we aim to convince the reader of a broader message that goes beyond the details of any given method: that frozen models have untapped potential and that fine-tuning is often unnecessary.

AIMay 25, 2022
Formalizing Preferences Over Runtime Distributions

Devon R. Graham, Kevin Leyton-Brown, Tim Roughgarden

When trying to solve a computational problem, we are often faced with a choice between algorithms that are guaranteed to return the right answer but differ in their runtime distributions (e.g., SAT solvers, sorting algorithms). This paper aims to lay theoretical foundations for such choices by formalizing preferences over runtime distributions. It might seem that we should simply prefer the algorithm that minimizes expected runtime. However, such preferences would be driven by exactly how slow our algorithm is on bad inputs, whereas in practice we are typically willing to cut off occasional, sufficiently long runs before they finish. We propose a principled alternative, taking a utility-theoretic approach to characterize the scoring functions that describe preferences over algorithms. These functions depend on the way our value for solving our problem decreases with time and on the distribution from which captimes are drawn. We describe examples of realistic utility functions and show how to leverage a maximum-entropy approach for modeling underspecified captime distributions. Finally, we show how to efficiently estimate an algorithm's expected utility from runtime samples.

AISep 2, 2022
Better Peer Grading through Bayesian Inference

Hedayat Zarkoob, Greg d'Eon, Lena Podina et al.

Peer grading systems aggregate noisy reports from multiple students to approximate a true grade as closely as possible. Most current systems either take the mean or median of reported grades; others aim to estimate students' grading accuracy under a probabilistic model. This paper extends the state of the art in the latter approach in three key ways: (1) recognizing that students can behave strategically (e.g., reporting grades close to the class average without doing the work); (2) appropriately handling censored data that arises from discrete-valued grading rubrics; and (3) using mixed integer programming to improve the interpretability of the grades assigned to students. We show how to make Bayesian inference practical in this model and evaluate our approach on both synthetic and real-world data obtained by using our implemented system in four large classes. These extensive experiments show that grade aggregation using our model accurately estimates true grades, students' likelihood of submitting uninformative grades, and the variation in their inherent grading error; we also characterize our models' robustness.

AINov 22, 2022
UNSAT Solver Synthesis via Monte Carlo Forest Search

Chris Cameron, Jason Hartford, Taylor Lundy et al.

We introduce Monte Carlo Forest Search (MCFS), a class of reinforcement learning (RL) algorithms for learning policies in {tree MDPs}, for which policy execution involves traversing an exponential-sized tree. Examples of such problems include proving unsatisfiability of a SAT formula; counting the number of solutions of a satisfiable SAT formula; and finding the optimal solution to a mixed-integer program. MCFS algorithms can be seen as extensions of Monte Carlo Tree Search (MCTS) to cases where, rather than finding a good path (solution) within a tree, the problem is to find a small tree within a forest of candidate trees. We instantiate and evaluate our ideas in an algorithm that we dub Knuth Synthesis, an MCFS algorithm that learns DPLL branching policies for solving the Boolean satisfiability (SAT) problem, with the objective of achieving good average-case performance on a given distribution of unsatisfiable problem instances. Knuth Synthesis is the first RL approach to avoid the prohibitive costs of policy evaluations in an exponentially-sized tree, leveraging two key ideas: first, we estimate tree size by randomly sampling paths and measuring their lengths, drawing on an unbiased approximation due to Knuth (1975); second, we query a strong solver at a user-defined depth rather than learning a policy across the whole tree, to focus our policy search on early decisions that offer the greatest potential for reducing tree size. We matched or exceeded the performance of a strong baseline on three well-known SAT distributions, facing problems that were two orders of magnitude more challenging than those addressed in previous RL studies.

AIOct 31, 2023
Utilitarian Algorithm Configuration

Devon R. Graham, Kevin Leyton-Brown, Tim Roughgarden

We present the first nontrivial procedure for configuring heuristic algorithms to maximize the utility provided to their end users while also offering theoretical guarantees about performance. Existing procedures seek configurations that minimize expected runtime. However, very recent theoretical work argues that expected runtime minimization fails to capture algorithm designers' preferences. Here we show that the utilitarian objective also confers significant algorithmic benefits. Intuitively, this is because mean runtime is dominated by extremely long runs even when they are incredibly rare; indeed, even when an algorithm never gives rise to such long runs, configuration procedures that provably minimize mean runtime must perform a huge number of experiments to demonstrate this fact. In contrast, utility is bounded and monotonically decreasing in runtime, allowing for meaningful empirical bounds on a configuration's performance. This paper builds on this idea to describe effective and theoretically sound configuration procedures. We prove upper bounds on the runtime of these procedures that are similar to theoretical lower bounds, while also demonstrating their performance empirically.

LGJun 7, 2023
How to Evaluate Behavioral Models

Greg d'Eon, Sophie Greenwood, Kevin Leyton-Brown et al.

Researchers building behavioral models, such as behavioral game theorists, use experimental data to evaluate predictive models of human behavior. However, there is little agreement about which loss function should be used in evaluations, with error rate, negative log-likelihood, cross-entropy, Brier score, and squared L2 error all being common choices. We attempt to offer a principled answer to the question of which loss functions should be used for this task, formalizing axioms that we argue loss functions should satisfy. We construct a family of loss functions, which we dub "diagonal bounded Bregman divergences", that satisfy all of these axioms. These rule out many loss functions used in practice, but notably include squared L2 error; we thus recommend its use for evaluating behavioral models.

CLApr 20
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

Taylor Lundy, Narun K. Raman, Kevin Leyton-Brown

LLM benchmarks are increasingly dynamic: instead of containing a fixed set of questions, they define templates and parameters that can generate an effectively unlimited number of question variants. This flexibility is valuable, but it makes evaluation expensive -- especially when the goal is not just determining an average score, but reliably identifying a model's weak spots. This paper introduces a new methodology for identifying hard questions in dynamic benchmarks. It leverages COUP, a recent Bayesian optimization algorithm (Graham, Velez & Leyton-Brown, 2026), after introducing several substantive modifications to make the algorithm suitable for practical LLM pipelines. We also wrap it in a tool that supports flexible choices of datasets and utility functions, enabling users to target the kinds of questions they care about (e.g., low-accuracy questions; questions that are unusually hard relative to their measured complexity). In experiments across a range of benchmarks, we show that our method, dubbed $\texttt{QuickScope}$, discovers truly difficult questions more sample efficiently than standard baselines, while also reducing false positives from noisy outcomes.

CLFeb 18, 2025Code
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Raman, Taylor Lundy, Thiago Amin et al.

How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into $58$ distinct elements, focusing on the logic of supply and demand, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art. We examined each model's ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.

AIJun 11, 2017Code
Deep Optimization for Spectrum Repacking

Neil Newman, Alexandre Fréchette, Kevin Leyton-Brown

Over 13 months in 2016-17 the FCC conducted an "incentive auction" to repurpose radio spectrum from broadcast television to wireless internet. In the end, the auction yielded $19.8 billion, $10.05 billion of which was paid to 175 broadcasters for voluntarily relinquishing their licenses across 14 UHF channels. Stations that continued broadcasting were assigned potentially new channels to fit as densely as possible into the channels that remained. The government netted more than $7 billion (used to pay down the national debt) after covering costs. A crucial element of the auction design was the construction of a solver, dubbed SATFC, that determined whether sets of stations could be "repacked" in this way; it needed to run every time a station was given a price quote. This paper describes the process by which we built SATFC. We adopted an approach we dub "deep optimization", taking a data-driven, highly parametric, and computationally intensive approach to solver design. More specifically, to build SATFC we designed software that could pair both complete and local-search SAT-encoded feasibility checking with a wide range of domain-specific techniques. We then used automatic algorithm configuration techniques to construct a portfolio of eight complementary algorithms to be run in parallel, aiming to achieve good performance on instances that arose in proprietary auction simulations. To evaluate the impact of our solver in this paper, we built an open-source reverse auction simulator. We found that within the short time budget required in practice, SATFC solved more than 95% of the problems it encountered. Furthermore, the incentive auction paired with SATFC produced nearly optimal allocations in a restricted setting and substantially outperformed other alternatives at national scale.

CLMay 10, 2024
What Can Natural Language Processing Do for Peer Review?

Ilia Kuznetsov, Osama Mohammed Afzal, Koen Dercksen et al.

The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond.

CLFeb 14, 2024
STEER: Assessing the Economic Rationality of Large Language Models

Narun Raman, Taylor Lundy, Samuel Amouyal et al.

There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.

CYMay 18, 2025
How Malicious AI Swarms Can Threaten Democracy: The Fusion of Agentic AI and LLMs Marks a New Frontier in Information Warfare

Daniel Thilo Schroeder, Meeyoung Cha, Andrea Baronchelli et al.

Public opinion manipulation has entered a new phase, amplifying its roots in rhetoric and propaganda. Advances in large language models (LLMs) and autonomous agents now let influence campaigns reach unprecedented scale and precision. Researchers warn AI could foster mass manipulation. Generative tools can expand propaganda output without sacrificing credibility and inexpensively create election falsehoods that are rated as more human-like than those written by humans. Techniques meant to refine AI reasoning, such as chain-of-thought prompting, can just as effectively be used to generate more convincing falsehoods. Enabled by these capabilities, another disruptive threat is emerging: swarms of collaborative, malicious AI agents. Fusing LLM reasoning with multi-agent architectures, these systems are capable of coordinating autonomously, infiltrating communities, and fabricating consensus cheaply. By adaptively mimicking human social dynamics, they threaten democracy.

CLJul 21, 2025
Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman, Taylor Lundy, Kevin Leyton-Brown

When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, GSM8K) and 27 different LLMs (including small models such as Qwen-2.5 7B, mid-sized models such as Llama-3.3 70B, and large state-of-the-art models such as OpenAI's o3). For each model--benchmark pair, we considered 5 ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether "none of the above" sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only \emph{before} being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning \emph{after} being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We identify and quantify the signals models are using when answering MCQA questions, and offer practical guidelines when analyzing results from MCQA that better reflect LLMs' genuine reasoning capabilities.

AIOct 16, 2025
Practical, Utilitarian Algorithm Configuration

Devon Graham, Eros Rojas Velez, Kevin Leyton-Brown

Utilitarian algorithm configuration identifies a parameter setting for a given algorithm that maximizes a user's utility. Utility functions offer a theoretically well-grounded approach to optimizing decision-making under uncertainty and are flexible enough to capture a user's preferences over algorithm runtimes (e.g., they can describe a sharp cutoff after which a solution is no longer required, a per-hour cost for compute, or diminishing returns from algorithms that take longer to run). COUP is a recently-introduced utilitarian algorithm configuration procedure which was designed mainly to offer strong theoretical guarantees about the quality of the configuration it returns, with less attention paid to its practical performance. This paper closes that gap, bringing theoretically-grounded, utilitarian algorithm configuration to the point where it is competitive with widely used, heuristic configuration procedures that offer no performance guarantees. We present a series of improvements to COUP that improve its empirical performance without degrading its theoretical guarantees and demonstrate their benefit experimentally. Using a case study, we also illustrate ways of exploring the robustness of a given solution to the algorithm selection problem to variations in the utility function.

LGMar 7, 2025
ElementaryNet: A Non-Strategic Neural Network for Predicting Human Behavior in Normal-Form Games

Greg d'Eon, Hala Murad, Kevin Leyton-Brown et al.

Behavioral game theory models serve two purposes: yielding insights into how human decision-making works, and predicting how people would behave in novel strategic settings. A system called GameNet represents the state of the art for predicting human behavior in the setting of unrepeated simultaneous-move games, combining a simple "level-k" model of strategic reasoning with a complex neural network model of non-strategic "level-0" behavior. Although this reliance on well-established ideas from cognitive science ought to make GameNet interpretable, the flexibility of its level-0 model raises the possibility that it is able to emulate strategic reasoning. In this work, we prove that GameNet's level-0 model is indeed too general. We then introduce ElementaryNet, a novel neural network that is provably incapable of expressing strategic behavior. We show that these additional restrictions are empirically harmless, with ElementaryNet and GameNet having statistically indistinguishable performance. We then show how it is possible to derive insights about human behavior by varying ElementaryNet's features and interpreting its parameters, finding evidence of iterative reasoning, learning about the depth of this reasoning process, and showing the value of a rich level-0 specification.

AIJun 16, 2024
Understanding Understanding: A Pragmatic Framework Motivated by Large Language Models

Kevin Leyton-Brown, Yoav Shoham

Motivated by the rapid ascent of Large Language Models (LLMs) and debates about the extent to which they possess human-level qualities, we propose a framework for testing whether any agent (be it a machine or a human) understands a subject matter. In Turing-test fashion, the framework is based solely on the agent's performance, and specifically on how well it answers questions. Elements of the framework include circumscribing the set of questions (the "scope of understanding"), requiring general competence ("passing grade"), avoiding "ridiculous answers", but still allowing wrong and "I don't know" answers to some questions. Reaching certainty about these conditions requires exhaustive testing of the questions which is impossible for nontrivial scopes, but we show how high confidence can be achieved via random sampling and the application of probabilistic confidence bounds. We also show that accompanying answers with explanations can improve the sample complexity required to achieve acceptable bounds, because an explanation of an answer implies the ability to answer many similar questions. According to our framework, current LLMs cannot be said to understand nontrivial domains, but as the framework provides a practical recipe for testing understanding, it thus also constitutes a tool for building AI agents that do understand.

GTFeb 29, 2024
Understanding Iterative Combinatorial Auction Designs via Multi-Agent Reinforcement Learning

Greg d'Eon, Neil Newman, Kevin Leyton-Brown

Iterative combinatorial auctions are widely used in high stakes settings such as spectrum auctions. Such auctions can be hard to analyze, making it difficult for bidders to determine how to behave and for designers to optimize auction rules to ensure desirable outcomes such as high revenue or welfare. In this paper, we investigate whether multi-agent reinforcement learning (MARL) algorithms can be used to understand iterative combinatorial auctions, given that these algorithms have recently shown empirical success in several other domains. We find that MARL can indeed benefit auction analysis, but that deploying it effectively is nontrivial. We begin by describing modelling decisions that keep the resulting game tractable without sacrificing important features such as imperfect information or asymmetry between bidders. We also discuss how to navigate pitfalls of various MARL algorithms, how to overcome challenges in verifying convergence, and how to generate and interpret multiple equilibria. We illustrate the promise of our resulting approach by using it to evaluate a specific rule change to a clock auction, finding substantially different auction outcomes due to complex changes in bidders' behavior.

AIFeb 24, 2022
Matching Papers and Reviewers at Large Conferences

Kevin Leyton-Brown, Mausam, Yatin Nandwani et al.

Peer-reviewed conferences, the main publication venues in CS, rely critically on matching highly qualified reviewers for each paper. Because of the growing scale of these conferences, the tight timelines on which they operate, and a recent surge in explicitly dishonest behavior, there is now no alternative to performing this matching in an automated way. This paper studies a novel reviewer-paper matching approach that was recently deployed in the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), and has since been adopted (wholly or partially) by other conferences including ICML 2022, AAAI 2022, and IJCAI 2022. This approach has three main elements: (1) collecting and processing input data to identify problematic matches and generate reviewer-paper scores; (2) formulating and solving an optimization problem to find good reviewer-paper matchings; and (3) a two-phase reviewing process that shifts reviewing resources away from papers likely to be rejected and towards papers closer to the decision boundary. This paper also describes an evaluation of these innovations based on an extensive post-hoc analysis on real data -- including a comparison with the matching algorithm used in AAAI's previous (2020) iteration -- and supplements this with additional numerical experimentation.

LGJul 1, 2021
The Spotlight: A General Method for Discovering Systematic Errors in Deep Learning Models

Greg d'Eon, Jason d'Eon, James R. Wright et al.

Supervised learning models often make systematic errors on rare subsets of the data. When these subsets correspond to explicit labels in the data (e.g., gender, race) such poor performance can be identified straightforwardly. This paper introduces a method for discovering systematic errors that do not correspond to such explicitly labelled subgroups. The key idea is that similar inputs tend to have similar representations in the final hidden layer of a neural network. We leverage this structure by "shining a spotlight" on this representation space to find contiguous regions where the model performs poorly. We show that the spotlight surfaces semantically meaningful areas of weakness in a wide variety of existing models spanning computer vision, NLP, and recommender systems.

LGJun 18, 2021
The Perils of Learning Before Optimizing

Chris Cameron, Jason Hartford, Taylor Lundy et al.

Formulating real-world optimization problems often begins with making predictions from historical data (e.g., an optimizer that aims to recommend fast routes relies upon travel-time predictions). Typically, learning the prediction model used to generate the optimization problem and solving that problem are performed in two separate stages. Recent work has showed how such prediction models can be learned end-to-end by differentiating through the optimization task. Such methods often yield empirical improvements, which are typically attributed to end-to-end making better error tradeoffs than the standard loss function used in a two-stage solution. We refine this explanation and more precisely characterize when end-to-end can improve performance. When prediction targets are stochastic, a two-stage solution must make an a priori choice about which statistics of the target distribution to model-we consider expectations over prediction targets-while an end-to-end solution can make this choice adaptively. We show that the performance gap between a two-stage and end-to-end approach is closely related to the price of correlation concept in stochastic optimization and show the implications of some existing POC results for the predict-then-optimize problem. We then consider a novel and particularly practical setting, where multiple prediction targets are combined to obtain each of the objective function's coefficients. We give explicit constructions where (1) two-stage performs unboundedly worse than end-to-end; and (2) two-stage is optimal. We use simulations to experimentally quantify performance gaps and identify a wide range of real-world applications from the literature whose objective functions rely on multiple prediction targets, suggesting that end-to-end learning could yield significant improvements.

LGNov 2, 2020
Exemplar Guided Active Learning

Jason Hartford, Kevin Leyton-Brown, Hadas Raviv et al.

We consider the problem of wisely using a limited budget to label a small subset of a large unlabeled dataset. We are motivated by the NLP problem of word sense disambiguation. For any word, we have a set of candidate labels from a knowledge base, but the label set is not necessarily representative of what occurs in the data: there may exist labels in the knowledge base that very rarely occur in the corpus because the sense is rare in modern English; and conversely there may exist true labels that do not exist in our knowledge base. Our aim is to obtain a classifier that performs as well as possible on examples of each "common class" that occurs with frequency above a given threshold in the unlabeled set while annotating as few examples as possible from "rare classes" whose labels occur with less than this frequency. The challenge is that we are not informed which labels are common and which are rare, and the true label distribution may exhibit extreme skew. We describe an active learning approach that (1) explicitly searches for rare classes by leveraging the contextual embedding spaces provided by modern language models, and (2) incorporates a stopping rule that ignores classes once we prove that they occur below our target threshold with high probability. We prove that our algorithm only costs logarithmically more than a hypothetical approach that knows all true label frequencies and show experimentally that incorporating automated search can significantly reduce the number of samples needed to reach target accuracy levels.

LGOct 5, 2020
PMI-Masking: Principled masking of correlated spans

Yoav Levine, Barak Lenz, Opher Lieber et al.

Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training.

MEJun 19, 2020
Valid Causal Inference with (Some) Invalid Instruments

Jason Hartford, Victor Veitch, Dhanya Sridhar et al.

Instrumental variable methods provide a powerful approach to estimating causal effects in the presence of unobserved confounding. But a key challenge when applying them is the reliance on untestable "exclusion" assumptions that rule out any relationship between the instrument variable and the response that is not mediated by the treatment. In this paper, we show how to perform consistent IV estimation despite violations of the exclusion assumption. In particular, we show that when one has multiple candidate instruments, only a majority of these candidates---or, more generally, the modal candidate-response relationship---needs to be valid to estimate the causal effect. Our approach uses an estimate of the modal prediction from an ensemble of instrumental variable estimators. The technique is simple to apply and is "black-box" in the sense that it may be used with any instrumental variable estimator as long as the treatment effect is identified for each valid instrument independently. As such, it is compatible with recent machine-learning based estimators that allow for the estimation of conditional average treatment effects (CATE) on complex, high dimensional data. Experimentally, we achieve accurate estimates of conditional average treatment effects using an ensemble of deep network-based estimators, including on a challenging simulated Mendelian Randomization problem.

GTJun 8, 2020
Learning under Invariable Bayesian Safety

Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown et al.

A recent body of work addresses safety constraints in explore-and-exploit systems. Such constraints arise where, for example, exploration is carried out by individuals whose welfare should be balanced with overall welfare. In this paper, we adopt a model inspired by recent work on a bandit-like setting for recommendations. We contribute to this line of literature by introducing a safety constraint that should be respected in every round and determines that the expected value in each round is above a given threshold. Due to our modeling, the safe explore-and-exploit policy deserves careful planning, or otherwise, it will lead to sub-optimal welfare. We devise an asymptotically optimal algorithm for the setting and analyze its instance-dependent convergence rate.

CYMar 21, 2020
Smarter Parking: Using AI to Identify Parking Inefficiencies in Vancouver

Devon Graham, Satish Kumar Sarraf, Taylor Lundy et al.

On-street parking is convenient, but has many disadvantages: on-street spots come at the expense of other road uses such as traffic lanes, transit lanes, bike lanes, or parklets; drivers looking for parking contribute substantially to traffic congestion and hence to greenhouse gas emissions; safety is reduced both due to the fact that drivers looking for spots are more distracted than other road users and that people exiting parked cars pose a risk to cyclists. These social costs may not be worth paying when off-street parking lots are nearby and have surplus capacity. To see where this might be true in downtown Vancouver, we used artificial intelligence techniques to estimate the amount of time it would take drivers to both park on and off street for destinations throughout the city. For on-street parking, we developed (1) a deep-learning model of block-by-block parking availability based on data from parking meters and audits and (2) a computational simulation of drivers searching for an on-street spot. For off-street parking, we developed a computational simulation of the time it would take drivers drive from their original destination to the nearest city-owned off-street lot and then to queue for a spot based on traffic and lot occupancy data. Finally, in both cases we also computed the time it would take the driver to walk from their parking spot to their original destination. We compared these time estimates for destinations in each block of Vancouver's downtown core and each hour of the day. We found many areas where off street would actually save drivers time over searching the streets for a spot, and many more where the time cost for parking off street was small. The identification of such areas provides an opportunity for the city to repurpose valuable curbside space for community-friendly uses more in line with its transportation goals.

GTMay 16, 2019
Fiduciary Bandits

Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown et al.

Recommendation systems often face exploration-exploitation tradeoffs: the system can only learn about the desirability of new options by recommending them to some user. Such systems can thus be modeled as multi-armed bandit settings; however, users are self-interested and cannot be made to follow recommendations. We ask whether exploration can nevertheless be performed in a way that scrupulously respects agents' interests---i.e., by a system that acts as a fiduciary. More formally, we introduce a model in which a recommendation system faces an exploration-exploitation tradeoff under the constraint that it can never recommend any action that it knows yields lower reward in expectation than an agent would achieve if it acted alone. Our main contribution is a positive result: an asymptotically optimal, incentive compatible, and ex-ante individually rational recommendation algorithm.

AIFeb 14, 2019
Procrastinating with Confidence: Near-Optimal, Anytime, Adaptive Algorithm Configuration

Robert Kleinberg, Kevin Leyton-Brown, Brendan Lucier et al.

Algorithm configuration methods optimize the performance of a parameterized heuristic algorithm on a given distribution of problem instances. Recent work introduced an algorithm configuration procedure ("Structured Procrastination") that provably achieves near optimal performance with high probability and with nearly minimal runtime in the worst case. It also offers an $\textit{anytime}$ property: it keeps tightening its optimality guarantees the longer it is run. Unfortunately, Structured Procrastination is not $\textit{adaptive}$ to characteristics of the parameterized algorithm: it treats every input like the worst case. Follow-up work ("LeapsAndBounds") achieves adaptivity but trades away the anytime property. This paper introduces a new algorithm, "Structured Procrastination with Confidence", that preserves the near-optimality and anytime properties of Structured Procrastination while adding adaptivity. In particular, the new algorithm will perform dramatically faster in settings where many algorithm configurations perform poorly. We show empirically both that such settings arise frequently in practice and that the anytime property is useful for finding good configurations quickly.

MLMar 7, 2018
Deep Models of Interactions Across Sets

Jason Hartford, Devon R Graham, Kevin Leyton-Brown et al.

We use deep learning to model interactions across two or more sets of objects, such as user-movie ratings, protein-drug bindings, or ternary user-item-tag interactions. The canonical representation of such interactions is a matrix (or a higher-dimensional tensor) with an exchangeability property: the encoding's meaning is not changed by permuting rows or columns. We argue that models should hence be Permutation Equivariant (PE): constrained to make the same predictions across such permutations. We present a parameter-sharing scheme and prove that it could not be made any more expressive without violating PE. This scheme yields three benefits. First, we demonstrate state-of-the-art performance on multiple matrix completion benchmarks. Second, our models require a number of parameters independent of the numbers of objects, and thus scale well to large datasets. Third, models can be queried about new objects that were not available at training time, but for which interactions have since been observed. In experiments, our models achieved surprisingly good generalization performance on this matrix extrapolation task, both within domains (e.g., new users and new movies drawn from the same distribution used for training) and even across domains (e.g., predicting music ratings after training on movies).

AIMar 30, 2017
Efficient Benchmarking of Algorithm Configuration Procedures via Model-Based Surrogates

Katharina Eggensperger, Marius Lindauer, Holger H. Hoos et al.

The optimization of algorithm (hyper-)parameters is crucial for achieving peak performance across a wide range of domains, ranging from deep neural networks to solvers for hard combinatorial problems. The resulting algorithm configuration (AC) problem has attracted much attention from the machine learning community. However, the proper evaluation of new AC procedures is hindered by two key hurdles. First, AC benchmarks are hard to set up. Second and even more significantly, they are computationally expensive: a single run of an AC procedure involves many costly runs of the target algorithm whose performance is to be optimized in a given AC benchmark scenario. One common workaround is to optimize cheap-to-evaluate artificial benchmark functions (e.g., Branin) instead of actual algorithms; however, these have different properties than realistic AC problems. Here, we propose an alternative benchmarking approach that is similarly cheap to evaluate but much closer to the original AC problem: replacing expensive benchmarks by surrogate benchmarks constructed from AC benchmarks. These surrogate benchmarks approximate the response surface corresponding to true target algorithm performance using a regression model, and the original and surrogate benchmark share the same (hyper-)parameter space. In our experiments, we construct and evaluate surrogate benchmarks for hyperparameter optimization as well as for AC problems that involve performance optimization of solvers for hard combinatorial problems, drawing training data from the runs of existing AC procedures. We show that our surrogate benchmarks capture overall important characteristics of the AC scenarios, such as high- and low-performing regions, from which they were derived, while being much easier to use and orders of magnitude cheaper to evaluate.

APDec 30, 2016
Counterfactual Prediction with Deep Instrumental Variables Networks

Jason Hartford, Greg Lewis, Kevin Leyton-Brown et al.

We are in the middle of a remarkable rise in the use and capability of artificial intelligence. Much of this growth has been fueled by the success of deep learning architectures: models that map from observables to outputs via multiple layers of latent representations. These deep learning algorithms are effective tools for unstructured prediction, and they can be combined in AI systems to solve complex automated reasoning problems. This paper provides a recipe for combining ML algorithms to solve for causal effects in the presence of instrumental variables -- sources of treatment randomization that are conditionally independent from the response. We show that a flexible IV specification resolves into two prediction tasks that can be solved with deep neural nets: a first-stage network for treatment prediction and a second-stage network whose loss function involves integration over the conditional treatment distribution. This Deep IV framework imposes some specific structure on the stochastic gradient descent routine used for training, but it is general enough that we can take advantage of off-the-shelf ML capabilities and avoid extensive algorithm customization. We outline how to obtain out-of-sample causal validation in order to avoid over-fit. We also introduce schemes for both Bayesian and frequentist inference: the former via a novel adaptation of dropout training, and the latter via a data splitting routine.

AIJun 8, 2015
ASlib: A Benchmark Library for Algorithm Selection

Bernd Bischl, Pascal Kerschke, Lars Kotthoff et al.

The task of algorithm selection involves choosing an algorithm from a set of algorithms on a per-instance basis in order to exploit the varying performance of algorithms over a set of instances. The algorithm selection problem is attracting increasing attention from researchers and practitioners in AI. Years of fruitful applications in a number of domains have resulted in a large amount of data, but the community lacks a standard format or repository for this data. This situation makes it difficult to share and compare different approaches effectively, as is done in other, more established fields. It also unnecessarily hinders new researchers who want to work in this area. To address this problem, we introduce a standardized format for representing algorithm selection scenarios and a repository that contains a growing number of data sets from the literature. Our format has been designed to be able to express a wide variety of different scenarios. Demonstrating the breadth and power of our platform, we describe a set of example experiments that build and evaluate algorithm selection models through a common interface. The results display the potential of algorithm selection to achieve significant performance improvements across a broad range of problems and algorithms.

AIMay 5, 2015
The Configurable SAT Solver Challenge (CSSC)

Frank Hutter, Marius Lindauer, Adrian Balint et al.

It is well known that different solution strategies work well for different types of instances of hard combinatorial problems. As a consequence, most solvers for the propositional satisfiability problem (SAT) expose parameters that allow them to be customized to a particular family of instances. In the international SAT competition series, these parameters are ignored: solvers are run using a single default parameter setting (supplied by the authors) for all benchmark instances in a given track. While this competition format rewards solvers with robust default settings, it does not reflect the situation faced by a practitioner who only cares about performance on one particular application and can invest some time into tuning solver parameters for this application. The new Configurable SAT Solver Competition (CSSC) compares solvers in this latter setting, scoring each solver by the performance it achieved after a fully automated configuration step. This article describes the CSSC in more detail, and reports the results obtained in its two instantiations so far, CSSC 2013 and 2014.

GTAug 4, 2014
Computational Analysis of Perfect-Information Position Auctions

David R. M Thompson, Kevin Leyton-Brown

After experimentation with other designs, the major search engines converged on the weighted, generalized second-price auction (wGSP) for selling keyword advertisements. Notably, this convergence occurred before position auctions were well understood (or, indeed, widely studied) theoretically. While much progress has been made since, theoretical analysis is still not able to settle the question of why search engines found wGSP preferable to other position auctions. We approach this question in a new way, adopting a new analytical paradigm we dub "computational mechanism analysis." By sampling position auction games from a given distribution, encoding them in a computationally efficient representation language, computing their Nash equilibria, and then calculating economic quantities of interest, we can quantitatively answer questions that theoretical methods have not. We considered seven widely studied valuation models from the literature and three position auction variants (generalized first price, unweighted generalized second price, and wGSP). We found that wGSP consistently showed the best ads of any position auction, measured both by social welfare and by relevance (expected number of clicks). Even in models where wGSP was already known to have bad worse-case efficiency, we found that it almost always performed well on average. In contrast, we found that revenue was extremely variable across auction mechanisms, and was highly sensitive to equilibrium selection, the preference model, and the valuation distribution.

GTJan 31, 2014
Empirically Evaluating Multiagent Learning Algorithms

Erik Zawadzki, Asher Lipson, Kevin Leyton-Brown

There exist many algorithms for learning how to play repeated bimatrix games. Most of these algorithms are justified in terms of some sort of theoretical guarantee. On the other hand, little is known about the empirical performance of these algorithms. Most such claims in the literature are based on small experiments, which has hampered understanding as well as the development of new multiagent learning (MAL) algorithms. We have developed a new suite of tools for running multiagent experiments: the MultiAgent Learning Testbed (MALT). These tools are designed to facilitate larger and more comprehensive experiments by removing the need to build one-off experimental code. MALT also provides baseline implementations of many MAL algorithms, hopefully eliminating or reducing differences between algorithm implementations and increasing the reproducibility of results. Using this test suite, we ran an experiment unprecedented in size. We analyzed the results according to a variety of performance metrics including reward, maxmin distance, regret, and several notions of equilibrium convergence. We confirmed several pieces of conventional wisdom, but also discovered some surprising results. For example, we found that single-agent $Q$-learning outperformed many more complicated and more modern MAL algorithms.

AIJan 15, 2014
ParamILS: An Automatic Algorithm Configuration Framework

Frank Hutter, Thomas Stuetzle, Kevin Leyton-Brown et al.

The identification of performance-optimizing parameter settings is an important part of the development and application of algorithms. We describe an automatic framework for this algorithm configuration problem. More formally, we provide methods for optimizing a target algorithm's performance on a given class of problem instances by varying a set of ordinal and/or categorical parameters. We review a family of local-search-based algorithm configuration procedures and present novel techniques for accelerating them by adaptively limiting the time spent for evaluating individual configurations. We describe the results of a comprehensive experimental evaluation of our methods, based on the configuration of prominent complete and incomplete algorithms for SAT. We also present what is, to our knowledge, the first published work on automatically configuring the CPLEX mixed integer programming solver. All the algorithms we considered had default parameter settings that were manually identified with considerable effort. Nevertheless, using our automated algorithm configuration procedures, we achieved substantial and consistent performance improvements.

AIOct 7, 2013
Bayesian Optimization With Censored Response Data

Frank Hutter, Holger Hoos, Kevin Leyton-Brown

Bayesian optimization (BO) aims to minimize a given blackbox function using a model that is updated whenever new evidence about the function becomes available. Here, we address the problem of BO under partially right-censored response data, where in some evaluations we only obtain a lower bound on the function value. The ability to handle such response data allows us to adaptively censor costly function evaluations in minimization problems where the cost of a function evaluation corresponds to the function value. One important application giving rise to such censored data is the runtime-minimizing variant of the algorithm configuration problem: finding settings of a given parametric algorithm that minimize the runtime required for solving problem instances from a given distribution. We demonstrate that terminating slow algorithm runs prematurely and handling the resulting right-censored observations can substantially improve the state of the art in model-based algorithm configuration.

LGAug 18, 2012
Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms

Chris Thornton, Frank Hutter, Holger H. Hoos et al.

Many different machine learning algorithms exist; taking into account each algorithm's hyperparameters, there is a staggeringly large number of possible alternatives overall. We consider the problem of simultaneously selecting a learning algorithm and setting its hyperparameters, going beyond previous work that addresses these issues in isolation. We show that this problem can be addressed by a fully automated approach, leveraging recent innovations in Bayesian optimization. Specifically, we consider a wide range of feature selection techniques (combining 3 search and 8 evaluator methods) and all classification approaches implemented in WEKA, spanning 2 ensemble methods, 10 meta-methods, 27 base classifiers, and hyperparameter settings for each classifier. On each of 21 popular datasets from the UCI repository, the KDD Cup 09, variants of the MNIST dataset and CIFAR-10, we show classification performance often much better than using standard selection/hyperparameter optimization methods. We hope that our approach will help non-expert users to more effectively identify machine learning algorithms and hyperparameter settings appropriate to their applications, and hence to achieve improved performance.

GTMay 9, 2012
Temporal Action-Graph Games: A New Representation for Dynamic Games

Albert Xin Jiang, Kevin Leyton-Brown, Avi Pfeffer

In this paper we introduce temporal action graph games (TAGGs), a novel graphical representation of imperfect-information extensive form games. We show that when a game involves anonymity or context-specific utility independencies, its encoding as a TAGG can be much more compact than its direct encoding as a multiagent influence diagram (MAID).We also show that TAGGs can be understood as indirect MAID encodings in which many deterministic chance nodes are introduced. We provide an algorithm for computing with TAGGs, and show both theoretically and empirically that our approach improves significantly on the previous state of the art.