LGAICLMay 12, 2025

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

arXiv:2505.08823v1h-index: 1
Originality Incremental advance
AI Analysis

This enables more cost-effective deployment of LLMs by making ultra-low-bit inference practical, though it is incremental as it builds on prior work on bias-free, RMS-normalized Transformers.

The paper tackles the problem of fine-tuning large language models (LLMs) to ternary (1.58-bit) precision, which is unstable, by inserting RMS normalization before linear projections and using a gradual quantization schedule, achieving stable fine-tuning that matches or surpasses more complex methods on standard benchmarks.

Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes