CLMay 11

Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

Xu Guo, Runyu Peng, Jian Tong, Yunhua Zhou, Haijun Lv, Zhihui Lu, Qipeng Guo

arXiv:2605.1012994.5Has Code

Predicted impact top 14% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners training large language models on noisy web-scale data, this method offers a lightweight way to improve data efficiency and robustness.

The paper proposes a synthetic pre-pre-training (PPT) stage that improves language model robustness to noisy pre-training data, achieving the same final loss as the baseline while using up to 49% fewer natural-text tokens for a 1B-parameter model.

Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at https://github.com/guox18/formal-language-prepretraining.

View on arXiv PDF Code

Similar