CLCVLGNov 28, 2025

TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

arXiv:2511.23225v14 citations
Originality Highly original
AI Analysis

This enables efficient FP8 training and quantization for large Transformers without complex engineering, benefiting AI practitioners by reducing computational costs.

The paper tackles the problem of extreme activation outliers hindering FP8 training of Transformers by showing they are mechanically produced artifacts from weight matrix colinearity, and introduces TWEO, a non-invasive loss function that reduces outliers from over 10000 to under 20, enabling full-model FP8 pre-training with a 36% throughput increase and SOTA performance in W8A8 quantization.

Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes