LG CLApr 22

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

Kristian Schwethelm, Daniel Rueckert, Georgios Kaissis

arXiv:2604.2110619.05 citationsh-index: 13

Predicted impact top 22% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers designing looped or depth-recurrent language models, this provides a scaling law that quantifies the trade-off between recurrence and model size, enabling predictable cost-benefit analysis.

The authors measure the effect of adding recurrences to looped language models, finding that each extra recurrence is worth about 0.46 times the effective parameters of a non-looped model, with a 410M looped model (r=4) matching a 580M non-looped model in validation loss but costing training compute equivalent to a 1B model.

We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts $r \in \{1, 2, 4, 8\}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^φ N_\text{rec})^{-α} + B\,D^{-β}$ and recover a new recurrence-equivalence exponent $φ= 0.46$ at $R^2 = 0.997$. Intuitively, $φ$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $φ{=}1$) or to a single block run repeatedly with no capacity gain ($φ{=}0$). Our $φ= 0.46$ sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our $φ$ converts the design choice of $r$ into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise $φ$ above $0.46$.

View on arXiv PDF

Similar