LG AI CLMay 12, 2025

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock

arXiv:2505.08823v1h-index: 1

Originality Incremental advance

AI Analysis

This enables more cost-effective deployment of LLMs by making ultra-low-bit inference practical, though it is incremental as it builds on prior work on bias-free, RMS-normalized Transformers.

The paper tackles the problem of fine-tuning large language models (LLMs) to ternary (1.58-bit) precision, which is unstable, by inserting RMS normalization before linear projections and using a gradual quantization schedule, achieving stable fine-tuning that matches or surpasses more complex methods on standard benchmarks.

Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.

View on arXiv PDF

Similar