ASAICLApr 12, 2025

SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

MIT
arXiv:2504.09081v211 citationsh-index: 46ACL
Originality Synthesis-oriented
AI Analysis

This work addresses the need for large-scale, multilingual datasets to improve instruction-following capabilities in speech-text LLMs, representing an incremental advancement in the field.

The authors tackled the problem of instruction fine-tuning for speech-text large language models by introducing SIFT-50M, a 50M-example multilingual dataset built from 14K hours of speech, and reported that their trained model, SIFT-LLM, outperforms existing models on instruction-following benchmarks while achieving competitive performance on foundational speech tasks.

We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes