CLAIFeb 3

Task--Specificity Score: Measuring How Much Instructions Really Matter for Supervision

arXiv:2602.03103v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses the issue of ambiguous supervision in instruction tuning for large language models, offering a practical tool for data selection, though it is incremental as it builds on existing filtering methods.

The paper tackles the problem of weakly specified instructions in instruction tuning by proposing the Task-Specificity Score (TSS) to measure how much an instruction uniquely determines the target output, and shows that selecting task-specific examples improves downstream performance under tight token budgets across three datasets and three LLMs.

Instruction tuning is now the default way to train and adapt large language models, but many instruction--input--output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task--Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes