CLLGApr 1

Associative-State Universal Transformers: Sparse Retrieval Meets Structured Recurrence

arXiv:2604.259307.6
AI Analysis

For language model designers, this work shows that structured recurrent states are parameter-efficient but require explicit sparse retrieval for exact long-range recall.

UniMatrix, a Universal Transformer variant with hybrid state updates, achieves competitive language modeling perplexity (5.083 vs 5.124 bits-per-byte) with fewer parameters than a Transformer, but fails on associative recall (near chance vs 25.4%). Adding sparse slot routing and pointer-logit fusion (UniMatrix-SparsePointer) boosts recall to 99.2% with 53.8% fewer parameters.

We study whether a structured recurrent state can serve as a compact associative backbone for language modeling while still supporting exact retrieval. We introduce UniMatrix, a Universal Transformer style family that reuses a shared recurrent block across depth and augments it with hybrid state updates, a ROSA-style residual path, and token-conditioned embedding modulation. We evaluate these models on byte-level WikiText-2, synthetic associative recall, throughput profiling on Apple MPS, and a corrected benchmark for triple-token interactions. At small scale, UniMatrix-Core and UniMatrix-ROSA slightly outperform a parameter-matched Transformer on WikiText-2 while using many fewer parameters, reaching 5.084 and 5.083 bits-per-byte versus 5.124. The main negative result is equally important: on associative recall, the original UniMatrix family remains near chance while the Transformer reaches 25.4 percent, showing that compressed recurrent state alone is not enough for exact lookup. A retrieval-oriented follow-up, UniMatrix-Assoc, helps only marginally. By contrast, UniMatrix-SparsePointer, which adds sparse slot routing and direct pointer-logit fusion, reaches 75.6 percent on the original pilot recipe and 99.2 percent on a no-dropout follow-up while using 53.8 percent fewer parameters than the Transformer baseline. Ablations show that the gain comes from sufficient slot capacity and exact pointer-level output routing. Overall, structured recurrent state is promising and parameter-efficient, but strong long-range behavior still requires explicit sparse retrieval and better kernels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes