CLSep 8, 2025

LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection

arXiv:2509.06524v11 citationsh-index: 10
Originality Highly original
AI Analysis

This work solves the data selection problem for domain adaptation of LLMs, offering a more efficient and accurate method compared to existing approaches.

The paper tackles the problem of adapting large language models (LLMs) to specific domains by addressing the bottleneck of scarce high-quality data, introducing LAMDAS, which uses the LLM as an implicit classifier for data selection and outperforms nine SOTA baselines while using a fraction of the data.

Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes