AICLSDASJun 20, 2024

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

arXiv:2406.14701v14 citations
AI Analysis

This work addresses constraints in ASR for LLMs, particularly for Indic languages, but is incremental as it builds on existing prefixLM-type models.

The paper tackled the problem of applying large language models (LLMs) to automatic speech recognition (ASR) by proposing speech prefix-tuning with RNNT loss, which improved ASR performance without increasing model complexity or altering inference, resulting in a 12% relative WER improvement with fine-tuned LLMs and a 31% improvement with frozen LLMs over baselines.

In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12\% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31\% relative improvement over basic soft-prompting prefixLM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes