ASCLLGJan 28

Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection

arXiv:2601.20898v1
Originality Incremental advance
AI Analysis

This work addresses instability in ASR systems for users by reducing sensitivity to prompt design, though it is incremental as it builds on existing LLM-based ASR architectures.

The paper tackles the problem of prompt sensitivity in LLM-based speech recognition by analyzing how prompt choice affects performance and proposing a learnable prompt projector to reduce variability. Experiments on four datasets show that this method consistently improves performance and outperforms manually selected prompts.

LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space, without modifying the underlying LLM-based ASR model. Experiments on four datasets show that the addition of a prompt projector consistently improves performance, reduces variability, and outperforms the best manually selected prompts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes