LGDec 14, 2025

Improving Recursive Transformers with Mixture of LoRAs

arXiv:2512.12880v21 citations
Originality Incremental advance
AI Analysis

This addresses the efficiency-expressivity trade-off in recursive transformers for NLP practitioners, though it appears incremental as it builds on existing LoRA and recursive transformer concepts.

The paper tackles the problem of reduced expressivity in recursive transformers due to parameter sharing by proposing Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that modulates shared feed-forward networks. The result is that ModernALBERT (50M-120M parameters) achieves state-of-the-art performance among compact models on GLUE, SQuAD-v2, and BEIR benchmarks, surpassing larger fully parameterized baselines.

Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes