LG AIMar 16

Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

arXiv:2603.1591664.71 citations

AI Analysis

This provides a large-scale empirical framework for LLM-guided combinatorial ML experiment design, addressing the problem of autonomous experiment design for researchers in machine learning.

The study analyzed 10,469 experiments by LLM agents to determine if they perform genuine architecture search or default to hyperparameter tuning, finding that architectural choices explain 94% of performance variance, with agents discovering a novel configuration achieving 0.9245 AP and reaching 0.985 AP at N=50 versus 0.965 for random search.

When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94\% of performance variance} ($F = 1324$, $Î·^2 = 0.94$), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP -- a configuration no human proposed -- and concentrating search on productive architectural regions: at $N = 50$, LLM-guided search reaches AP $= 0.985$ versus $0.965$ for from-scratch random search. Post-bugfix convergence follows a power law ($c = 0.11$, $R^2 = 0.93$); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen--Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.

View on arXiv PDF

Similar