AICLApr 4, 2025

Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time

arXiv:2504.03635v35 citationsh-index: 40
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding scaling effects on reasoning for AI researchers, providing counterintuitive insights into model optimization, though it is incremental in exploring synthetic environments.

The study investigated how scaling model size and data affects language models' implicit reasoning abilities during pretraining, finding that overparameterization can impair performance due to memorization, with optimal models achieving approximately 0.008 bits of reasoning information per parameter.

Reasoning is an integral part of many tasks performed by language models (LMs). However, the effects of scaling model sizes and data on reasoning abilities at pretraining time remain understudied. To rigorously investigate this problem, we pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining. Interestingly, we observe that overparameterization can impair the implicit reasoning performance due to excessive memorization. We investigate different factors that affect the loss curve when scaling different components of the knowledge graph, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling law that shows optimal-sized LMs can approximately reason over 0.008 bit information per parameter. This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes