CL AIOct 28, 2025

Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

Chanwoo Park, Suyoung Park, Yelim Ahn, Jongmin Kim, Jongyeon Park, Jaejin Lee

arXiv:2510.24139v1h-index: 2

Originality Incremental advance

AI Analysis

This work addresses data quality issues in LLM pretraining for researchers and practitioners, but it is incremental as it builds on existing filtering techniques.

The paper tackled the problem of traditional line-level filtering discarding valuable content in LLM pretraining corpora by introducing pattern-aware line-level deduplication and trailing punctuation filtering, which improved performance on multiple-choice benchmarks and significantly enhanced generative question-answering accuracy on SQuAD v1 and KorQuAD v1.

While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.

View on arXiv PDF

Similar