Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

arXiv:2604.219999.11 citationsHas Code

Predicted impact top 47% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on adaptive computation in transformers, this paper identifies and solves a critical initialization failure in ACT, enabling reliable training and demonstrating the necessity of memory tokens for recursive reasoning.

The paper shows that memory tokens are necessary for a single-block Universal Transformer to solve Sudoku-Extreme, achieving 57.4% exact-match accuracy with 8-32 tokens. They identify a router initialization trap causing >70% training failures and fix it with a negative bias, enabling reliable training and 34% fewer steps with ACT.

We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested -- 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing -- no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes >70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves' recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 ("deep start," p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at https://github.com/che-shr-cat/utm-jax.

View on arXiv PDF Code

Similar