LGAIDCMay 30, 2025

Recipes for Pre-training LLMs with MXFP8

arXiv:2506.08027v214 citationsh-index: 7
AI Analysis

This work addresses efficiency challenges for AI researchers and practitioners by enabling more efficient LLM pre-training with minimal accuracy loss, though it is incremental as it builds on existing quantization techniques.

The paper tackles the problem of improving GPU efficiency in pre-training large language models (LLMs) by using MXFP8-E4M3 datatype and a specific number conversion algorithm, achieving training sessions that match BF16 accuracy with models up to 8B parameters trained on datasets up to 15T tokens.

Using fewer bits to represent model parameters and related tensors during pre-training has become a required technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats introduced in NVIDIA Blackwell generation of GPUs represent a major advancement of this technique, making it practical to combine narrow floating-point data types with finer granularity per-block scaling factors. In turn, this enables both quantization of more tensors than previous approaches and more efficient execution of operations on those tensors. Effective use of MX-formats requires careful choices of various parameters. In this paper we review these choices and show how MXFP8-E4M3 datatype and a specific number conversion algorithm result in training sessions that match those carried out in BF16. We present results using models with up to 8B parameters, trained on high-quality datasets of up to 15T tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes