Hierarchical vs. Flat Iteration in Shared-Weight Transformers
For researchers in language modeling, this paper provides an empirical negative result showing that hierarchical shared-weight recurrence does not match the performance of independent-layer stacking, which is an incremental finding.
The study investigates whether hierarchical shared-weight recurrence can match the representational quality of independent-layer stacking in Transformers. The HRM-LM model, using a two-speed recurrent pair, shows a sharp empirical gap compared to a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five runs, indicating that hierarchical recurrence underperforms independent stacking.
We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.