CL AI ASNov 3, 2023

COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li

arXiv:2311.02248v29.141 citationsh-index: 34

Originality Incremental advance

AI Analysis

This addresses the challenge of data-efficient speech processing for AI applications, though it is incremental as it builds on existing LLM and speech technologies.

The paper tackles the problem of integrating speech into large language models for in-context learning by proposing COSMIC, a cost-effective method that uses GPT-3.5 to generate data and achieves a 33.18 BLEU score in 0-shot speech-to-text translation and a 25.8% relative WER reduction in 1-shot cross-domain adaptation.

We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters and only 450 hours of English speech data, COSMIC demonstrates emerging capabilities in instruction-following and in-context learning. Equipped with such capabilities, COSMIC achieves a maximum 33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a significant boost in the 1-shot setting. Additionally, there is an average 25.8\% relative Word Error Rate (WER) reduction for 1-shot cross-domain adaptation. COSMIC exhibits a significant automatic speech recognition (ASR) accuracy gain in contextual biasing tasks due to its instruction-following capability.

View on arXiv PDF

Similar