LG CLApr 2

Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation

arXiv:2604.0205164.2Has Code

Predicted impact top 40% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This addresses a core problem in recursive transformers for AI researchers by enabling distinct operations across depth, though it is incremental as it builds on existing recursive and LoRA methods.

The paper tackles the limitation of recursive transformers applying the same transformation across depth steps by introducing Ouroboros, a system that uses a Controller hypernetwork to generate input-dependent weights via LoRA modulation, reducing training loss by 43.4% over a baseline and recovering 51.3% of the performance gap from layer removal.

Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros

View on arXiv PDF Code

Similar