CLMay 30, 2025

Domain Pre-training Impact on Representations

Cesar Gonzalez-Gutierrez, Ariadna Quattoni

arXiv:2505.24455v14.91 citationsh-index: 18EMNLP

Originality Synthesis-oriented

AI Analysis

This provides insights for practitioners in NLP on optimizing pre-training strategies, though it is incremental as it builds on existing pre-training methods.

The study investigated how the choice of pre-training corpus affects transformer representation quality, finding that small specialized corpora can be effective and that combining generic and specialized corpora works best when the specialized corpus is distributionally similar to the target task.

This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.

View on arXiv PDF

Similar