CLMay 28

A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

arXiv:2605.3020284.6
AI Analysis

For LLM practitioners, this offers a parameter-efficient way to scale compute and capacity independently, outperforming standard transformers at fixed FLOPs.

The paper proposes a dual-path architecture for LLMs that separately scales compute (via repeated deep sublayers) and capacity (via a wide FFN), achieving better language modeling and downstream performance than iso-FLOP baselines with fewer parameters. Learned per-token gates reveal interpretable patterns: function words and lexical content favor wide paths, while punctuation and arithmetic favor deep paths.

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes