CLAILGApr 9

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

arXiv:2604.0782278.16 citationsh-index: 3
Predicted impact top 78% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a key limitation in compositional generalization for large language models, with incremental improvements in model architecture and training strategies.

The paper tackles the problem of implicit multi-hop reasoning in transformers, showing that recurrent-depth transformers can generalize to unseen compositions and deeper reasoning depths, achieving systematic generalization through a grokking process and enabling depth extrapolation with scaled inference-time recurrence.

We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes