Jayson Lynch

LG
h-index45
16papers
630citations
Novelty53%
AI Score58

16 Papers

CCMay 8
Pushing Blocks without Fixed Walls via Checkable Gizmos: Push-1 is PSPACE-Complete

MIT Hardness Group, Josh Brunner, Lily Chung et al. · mit

We prove PSPACE-completeness of Push-1: given a rectangular grid of 1 x 1 cells, each possibly occupied by a movable block, can a robot move from one specified location to another, given the ability to push up to one block at a time? In particular, we remove the need for fixed (immovable) walls from a 2022 result. This fundamental model of block pushing, introduced in 1999, abstracts the mechanics of many video games. It was shown NP-hard in 2000, but its final complexity remained open for 25 years. Our result uses a new framework for checkable gadgets/gizmos, extending a prior framework for checkable gadgets to handle reconfiguration problems, at the cost of requiring a stronger auxiliary gadget. We also introduce a new connection between the motion-planning-through-gadgets framework (with an agent) and the Graph Orientation Reconfiguration Problem (with no agent), including Nondeterministic Constraint Logic.

LGNov 26, 2025Code
EvilGenie: A Reward Hacking Benchmark

Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld

We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.

LGJan 24, 2025
Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han et al. · amazon-science, apple-ml

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

LGNov 26, 2025
On the Origin of Algorithmic Progress in AI

Hans Gundlach, Alex Fogelson, Jayson Lynch et al.

Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm's efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.

DSApr 30
The Impact of Approximation on Algorithmic Progress

Jeffery Li, Jayson Lynch, Liva Olina et al.

In nearly every discipline, scientific computations are limited by the cost and speed of computation. For example, the best-known exact algorithms for the canonical Traveling Salesman Problem would take centuries to run on an instance of size 1 million. A natural response to such limits is to try to find new algorithms or to parallelize existing ones, but many algorithms are already at their theoretically-optimal level and parallelization is often impossible or prohibitively expensive. Starting in the 1960's, computer scientists pursued another solution: allowing solutions to have a small amount of error (i.e. approximating them). In this paper, we survey 118 of the most important algorithm problems in computer science, quantifying the gains and tradeoffs from approximation that have been discovered over the history of the field. Overall, only $\approx$20\% of problems have benefited from approximation. However, those with good approximate algorithms can be dramatically faster to compute with little cost to accuracy. For example, a quarter of computationally intractable problems (e.g. those that take exponential time to compute) have polynomial time approximate algorithms. Approximation also increases the number of algorithms that can run in linear time by 23\%, opening up new computational opportunities for those working in the big data regime. This work also sheds light on what should be expected from progress in AI, where approximation is at the heart of how deep learning works.

QUANT-PHNov 3, 2025
Quantum Deep Learning Still Needs a Quantum Leap

Hans Gundlach, Hrvoje Kukina, Jayson Lynch et al.

Quantum computing technology is advancing rapidly. Yet, even accounting for these trends, a quantum leap would be needed for quantum computers to meaningfully impact deep learning over the coming decade or two. We arrive at this conclusion based on a first-of-its-kind survey of quantum algorithms and how they match potential deep learning applications. This survey reveals three important areas where quantum computing could potentially accelerate deep learning, each of which faces a challenging roadblock to realizing its potential. First, quantum algorithms for matrix multiplication and other algorithms central to deep learning offer small theoretical improvements in the number of operations needed, but this advantage is overwhelmed on practical problem sizes by how slowly quantum computers do each operation. Second, some promising quantum algorithms depend on practical Quantum Random Access Memory (QRAM), which is underdeveloped. Finally, there are quantum algorithms that offer large theoretical advantages, but which are only applicable to special cases, limiting their practical benefits. In each of these areas, we support our arguments using quantitative forecasts of quantum advantage that build on the work by Choi et al. [2023] as well as new research on limitations and quantum hardware trends. Our analysis outlines the current scope of quantum deep learning and points to research directions that could lead to greater practical advances in the field.

AIFeb 13, 2025
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

Clinton J. Wang, Dean Lee, Cristina Menghini et al.

As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity -- each typically requiring teams of skilled solvers hours to days to complete -- with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art language models achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity's Last Exam, unveiling models' shortcomings when challenged with problems requiring unstructured and lateral reasoning.

LGNov 28, 2025
The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

Hans Gundlach, Jayson Lynch, Matthias Mertens et al.

Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities per dollar. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around $5\times$ to $10\times$ per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around $3\times$ per year. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.

AIJul 10, 2025
Meek Models Shall Inherit the Earth

Hans Gundlach, Jayson Lynch, Neil Thompson

The past decade has seen incredible scaling of AI systems by a few companies, leading to inequality in AI model performance. This paper argues that, contrary to prevailing intuition, the diminishing returns to compute scaling will lead to a convergence of AI model capabilities. In other words, meek models (those with limited computation budget) shall inherit the earth, approaching the performance level of the best models overall. We develop a model illustrating that under a fixed-distribution next-token objective, the marginal capability returns to raw compute shrink substantially. Given current scaling practices, we argue that these diminishing returns are strong enough that even companies that can scale their models exponentially faster than other organizations will eventually have little advantage in capabilities. As part of our argument, we give several reasons that proxies like training loss differences capture important capability measures using evidence from benchmark data and theoretical performance models. In addition, we analyze empirical data on the capability difference of AI models over time. Finally, in light of the increasing ability of meek models, we argue that AI strategy and policy require reexamination, and we outline the areas this shift will affect.

LGDec 31, 2021
A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level

Iddo Drori, Sarah Zhang, Reece Shuttleworth et al.

We demonstrate that a neural network pre-trained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates new questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI's Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a new dataset of questions from MIT's largest mathematics courses (Single Variable and Multivariable Calculus, Differential Equations, Introduction to Probability and Statistics, Linear Algebra, and Mathematics for Computer Science) and Columbia University's Computational Linear Algebra. We solve questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Intermediate Algebra, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems designed to assess mathematical reasoning. We randomly sample questions and generate solutions with multiple modalities, including numbers, equations, and plots. The latest GPT-3 language model pre-trained on text automatically solves only 18.8% of these university questions using zero-shot learning and 30.8% using few-shot learning and the most recent chain of thought prompting. In contrast, program synthesis with few-shot learning using Codex fine-tuned on code generates programs that automatically solve 81% of these questions. Our approach improves the previous state-of-the-art automatic solution accuracy on the benchmark topics from 8.8% to 81.1%. We perform a survey to evaluate the quality and difficulty of generated questions. This work is the first to automatically solve university-level mathematics course questions at a human level and the first work to explain and generate university-level mathematics course questions at scale, a milestone for higher education.

LGSep 23, 2021
Multidimensional Scaling: Approximation and Complexity

Erik Demaine, Adam Hesterberg, Frederic Koehler et al.

Metric Multidimensional scaling (MDS) is a classical method for generating meaningful (non-linear) low-dimensional embeddings of high-dimensional data. MDS has a long history in the statistics, machine learning, and graph drawing communities. In particular, the Kamada-Kawai force-directed graph drawing method is equivalent to MDS and is one of the most popular ways in practice to embed graphs into low dimensions. Despite its ubiquity, our theoretical understanding of MDS remains limited as its objective function is highly non-convex. In this paper, we prove that minimizing the Kamada-Kawai objective is NP-hard and give a provable approximation algorithm for optimizing it, which in particular is a PTAS on low-diameter graphs. We supplement this result with experiments suggesting possible connections between our greedy approximation algorithm and gradient-based methods.

LGJul 2, 2021
Solving Machine Learning Problems

Sunny Tran, Pranav Krishna, Ishan Pakuwal et al.

Can a machine learn Machine Learning? This work trains a machine learning model to solve machine learning problems from a University undergraduate level course. We generate a new training set of questions and answers consisting of course exercises, homework, and quiz questions from MIT's 6.036 Introduction to Machine Learning course and train a machine learning model to answer these questions. Our system demonstrates an overall accuracy of 96% for open-response questions and 97% for multiple-choice questions, compared with MIT students' average of 93%, achieving grade A performance in the course, all in real-time. Questions cover all 12 topics taught in the course, excluding coding questions or questions with images. Topics include: (i) basic machine learning principles; (ii) perceptrons; (iii) feature extraction and selection; (iv) logistic regression; (v) regression; (vi) neural networks; (vii) advanced neural networks; (viii) convolutional neural networks; (ix) recurrent neural networks; (x) state machines and MDPs; (xi) reinforcement learning; and (xii) decision trees. Our system uses Transformer models within an encoder-decoder architecture with graph and tree representations. An important aspect of our approach is a data-augmentation scheme for generating new example problems. We also train a machine learning model to generate problem hints. Thus, our system automatically generates new questions across topics, answers both open-response questions and multiple-choice questions, classifies problems, and generates problem hints, pushing the envelope of AI for STEM education.

CGDec 14, 2020
Characterizing Universal Reconfigurability of Modular Pivoting Robots

Hugo A. Akitaya, Erik D. Demaine, Andrei Gonczi et al.

We give both efficient algorithms and hardness results for reconfiguring between two connected configurations of modules in the hexagonal grid. The reconfiguration moves that we consider are "pivots", where a hexagonal module rotates around a vertex shared with another module. Following prior work on modular robots, we define two natural sets of hexagon pivoting moves of increasing power: restricted and monkey moves. When we allow both moves, we present the first universal reconfiguration algorithm, which transforms between any two connected configurations using $O(n^3)$ monkey moves. This result strongly contrasts the analogous problem for squares, where there are rigid examples that do not have a single pivoting move preserving connectivity. On the other hand, if we only allow restricted moves, we prove that the reconfiguration problem becomes PSPACE-complete. Moreover, we show that, in contrast to hexagons, the reconfiguration problem for pivoting squares is PSPACE-complete regardless of the set of pivoting moves allowed. In the process, we strengthen the reduction framework of Demaine et al. [FUN'18] that we consider of independent interest.

CCJun 1, 2020
Walking through Doors is Hard, even without Staircases: Proving PSPACE-hardness via Planar Assemblies of Door Gadgets

Hayashi Ani, Jeffrey Bosboom, Erik D. Demaine et al.

A door gadget has two states and three tunnels that can be traversed by an agent (player, robot, etc.): the "open" and "close" tunnel sets the gadget's state to open and closed, respectively, while the "traverse" tunnel can be traversed if and only if the door is in the open state. We prove that it is PSPACE-complete to decide whether an agent can move from one location to another through a planar assembly of such door gadgets, removing the traditional need for crossover gadgets and thereby simplifying past PSPACE-hardness proofs of Lemmings and Nintendo games Super Mario Bros., Legend of Zelda, and Donkey Kong Country. Our result holds in all but one of the possible local planar embedding of the open, close, and traverse tunnels within a door gadget; in the one remaining case, we prove NP-hardness. We also introduce and analyze a simpler type of door gadget, called the self-closing door. This gadget has two states and only two tunnels, similar to the "open" and "traverse" tunnels of doors, except that traversing the traverse tunnel also closes the door. In a variant called the symmetric self-closing door, the "open" tunnel can be traversed if and only if the door is closed. We prove that it is PSPACE-complete to decide whether an agent can move from one location to another through a planar assembly of either type of self-closing door. Then we apply this framework to prove new PSPACE-hardness results for eight different 3D Mario games and Sokobond.

AIFeb 12, 2020
Recursed is not Recursive: A Jarring Result

Erik Demaine, Justin Kopinsky, Jayson Lynch

Recursed is a 2D puzzle platform video game featuring treasure chests that, when jumped into, instantiate a room that can later be exited (similar to function calls), optionally generating a jar that returns back to that room (similar to continuations). We prove that Recursed is RE-complete and thus undecidable (not recursive) by a reduction from the Post Correspondence Problem. Our reduction is "practical": the reduction from PCP results in fully playable levels that abide by all constraints governing levels (including the 15x20 room size) designed for the main game. Our reduction is also "efficient": a Turing machine can be simulated by a Recursed level whose size is linear in the encoding size of the Turing machine and whose solution length is polynomial in the running time of the Turing machine.

CCJun 9, 2018
Computational Complexity of Motion Planning of a Robot through Simple Gadgets

Erik D. Demaine, Isaac Grosof, Jayson Lynch et al.

We initiate a general theory for analyzing the complexity of motion planning of a single robot through a graph of "gadgets", each with their own state, set of locations, and allowed traversals between locations that can depend on and change the state. This type of setup is common to many robot motion planning hardness proofs. We characterize the complexity for a natural simple case: each gadget connects up to four locations in a perfect matching (but each direction can be traversable or not in the current state), has one or two states, every gadget traversal is immediately undoable, and that gadget locations are connected by an always-traversable forest, possibly restricted to avoid crossings in the plane. Specifically, we show that any single nontrivial four-location two-state gadget type is enough for motion planning to become PSPACE-complete, while any set of simpler gadgets (effectively two-location or one-state) has a polynomial-time motion planning algorithm. As a sample application, our results show that motion planning games with "spinners" are PSPACE-complete, establishing a new hard aspect of Zelda: Oracle of Seasons.