NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

arXiv:2603.06492v15.9

Predicted impact top 86% in LG · last 90 daysOriginality Highly original

AI Analysis

NOBLE offers substantial training efficiency improvements for large language models, BERT, VQGAN, and ViT, benefiting researchers and practitioners by reducing pretraining time and computational costs.

This paper introduces NOBLE, a new architectural augmentation for transformers that adds nonlinear low-rank branches to linear layers, designed for pretraining from scratch. NOBLE achieves up to 1.47x step speedup to reach baseline evaluation loss (up to 32% fewer training steps) with only 4% additional parameters and 7% step time overhead, resulting in up to 1.22x net wallclock speedup.

We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transformer linear layers. Unlike LoRA and other parameter-efficient fine-tuning (PEFT) methods, NOBLE is designed for pretraining from scratch. The branch is a permanent part of the architecture as opposed to an adapter for finetuning on top of frozen weights. The branch computes σ(xWdown)Wup where σ is a learnable nonlinearity. We evaluate several activation functions and find that CosNet, a two-layer cosine nonlinearity with learnable frequency and phase with a linear projection in between them in the bottleneck space, performs best. NOBLE achieves substantial improvements with minimal overhead: up to 1.47x step speedup to reach baseline eval loss (up to 32% fewer training steps), with as low as 4% additional parameters and 7% step time overhead, resulting in up to 1.22x net wallclock speedup. Experiments on LLMs (250M and 1.5B parameters), BERT, VQGAN, and ViT consistently show improved training efficiency. We identify one caveat: Mixup/CutMix augmentation interferes with NOBLE's benefits in Imagenet classification along with other stochastic augmentations, but when disabled, ViT also improves. This discrepancy is possibly explained by regularization techniques that encourage smoother fits to the target function while NOBLE may specialize more in sharper aspects of the target function.

View on arXiv PDF

Similar