#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models
This work addresses the need for better understanding and optimizing SFT datasets for large language models, which is crucial for researchers and practitioners aiming to enhance instruction-following capabilities, though it is incremental as it builds on existing SFT methods.
The authors tackled the problem of analyzing and improving supervised fine-tuning (SFT) datasets for large language models by proposing InsTag, a method to tag instruction samples based on semantics and intentions, which they used to define and quantify diversity and complexity. They found that model ability improves with more diverse and complex data, and by selecting 6K such samples, their resulting models outperformed open-source models on MT-Bench, demonstrating the importance of these factors.
Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. Then we analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity. We open-source InsTag in https://github.com/OFA-Sys/InsTag.