CL AIMar 3, 2025

Rethinking Data: Towards Better Performing Domain-Specific Small Language Models

Boris Nazarov, Darya Frolova, Yackov Lubarsky, Alexei Gaissinski, Pavel Kisilev

arXiv:2503.01464v12.71 citationsh-index: 132024 IEEE Globecom Workshops (GC Wkshps)

Originality Incremental advance

AI Analysis

This work addresses the computational cost limitations of deploying large language models commercially by enhancing small models for domain-specific applications, though it is incremental in nature.

The paper tackles the problem of poor performance in domain-specific small language models by improving data quality throughout the training pipeline, achieving high accuracy in multiple-choice question answering tasks.

Fine-tuning of Large Language Models (LLMs) for downstream tasks, performed on domain-specific data has shown significant promise. However, commercial use of such LLMs is limited by the high computational cost required for their deployment at scale. On the other hand, small Language Models (LMs) are much more cost effective but have subpar performance in a similar setup. This paper presents our approach to finetuning a small LM, that reaches high accuracy in multiple choice question answering task. We achieve this by improving data quality at each stage of the LM training pipeline. In particular, we start with data structuring resulting in extraction of compact, semantically meaningful text chunks used by a retriever. This allows more efficient knowledge digestion by the LM. Further, we improve the retrieved context by training a lightweight Chunk Re-Ranker (CRR) that generates more accurate relative relevance chunk scores. Finally, we improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets. We present detailed procedure descriptions, and corresponding experimental findings that show the improvements of each one of the proposed techniques.

View on arXiv PDF

Similar