Mathematics and Coding are Universal AI Benchmarks
This provides a foundational framework for AI evaluation, suggesting universal benchmarks for advanced agents, though it is incremental in building on existing AAI and GVU concepts.
The paper tackles the problem of evaluating AI agents by showing that mathematical theorem-proving and coding tasks are dense in the moduli space of psychometric batteries, with coding being universal and mathematics offering spectral stability for self-improvement.
We study the special role of mathematics and coding inside the moduli space of psychometric batteries for AI agents. Building on the AAI framework and GVU dynamics from previous works, we define the Mathematics Fiber and show that, when paired with formal proof kernels (e.g. Lean, Coq), GVU flows on this fiber admit spectrally stable self-improvement regimes due to oracle-like verification. Our main technical result is a density theorem: under uniform tightness of agent outputs and a Lipschitz AAI functional, the subspace of batteries generated by mathematical theorem-proving and coding tasks is dense in the moduli space of batteries with respect to the evaluation metric. Coding alone is universal in this sense, while pure mathematics is not; its privilege is spectral rather than expressive. We interpret this as evidence that mathematics and coding provide ``universal coordinates'' for evaluation, and that formal mathematics is a natural ignition domain for recursive self-improvement in advanced AI agents.