LGFeb 2

An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

arXiv:2602.02400v11 citationsh-index: 22
AI Analysis

This addresses the problem of training instability for LLM developers, providing empirical insights into noise-induced divergence, though it is incremental as it builds on existing speculation about noise effects.

The study systematically investigated whether noisy data causes loss divergence in large language model pretraining, finding that controlled synthetic noise indeed induces divergence with probability depending on noise type, amount, and model scale, and identified distinct activation patterns compared to high learning rate failures.

Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model scale. We further find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates, and we provide diagnostics that differentiate these two failure modes. Together, these results provide a large-scale, controlled characterization of how noisy data affects loss divergence in LLM pretraining.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes