CLNov 10, 2025

Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains

Pingjie Wang, Hongcheng Liu, Yusheng Liao, Ziqing Fan, Yaxin Du, Shuo Tang, Yanfeng Wang, Yu Wang

arXiv:2511.07380v14.93 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work addresses the challenge of applying LLMs in low-resource domains like medical and financial fields, where data is scarce, by providing an efficient method to leverage auxiliary data, though it is incremental as it builds on existing NTK and fine-tuning techniques.

The paper tackles the problem of data scarcity in low-resource domains for large language models by proposing NTK-Selector, a framework that selects valuable auxiliary data from general domains using neural tangent kernels, resulting in substantial performance gains of up to +8.7 points and 10.9x improvement over domain-only fine-tuning.

Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: \textbf{\textit{how to effectively select the most valuable auxiliary data to maximize domain-specific performance}}, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose \textbf{NTK-Selector}, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a \textbf{10.9x and 5.7x improvement} over the domain-only setting.

View on arXiv PDF

Similar