NOVA: Fundamental Limits of Knowledge Discovery Through AI

Salman Avestimehr, Ken Duffy, Muriel Médard

arXiv:2605.1521950.5

AI Analysis

This work provides a theoretical foundation for understanding the fundamental limits and costs of AI-driven knowledge discovery, relevant to researchers and practitioners in AI safety and scientific discovery.

The paper introduces the NOVA framework to model iterative AI self-improvement as adaptive sampling, identifying conditions for covering a finite knowledge domain and failure modes like contamination. It proves that cumulative generation cost scales as Θ(c_gen D^α) for Zipf-distributed discoveries, quantifying diminishing returns.

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $α>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

View on arXiv PDF

Similar