CLMar 25

Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

arXiv:2603.2482675.7h-index: 5
Predicted impact top 80% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of optimizing synthetic data generation for non-English languages, showing that data quality is crucial, but the findings are incremental as they extend known principles to Portuguese.

The study investigated how synthetic rewriting of Portuguese text interacts with source data quality in language model pretraining, finding that rewriting high-quality data improved performance by +3.4 NPM at the 7B scale, while rewriting low-quality data had minimal effect.

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes