ASAICLSDMay 31, 2025

CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models

arXiv:2506.12059v12 citationsh-index: 55INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses the challenge of complex speech scenarios for ASR systems, though it is incremental as it builds on existing methods like pretrained encoders and LLMs.

The paper tackled the problem of handling overlapping speech and recognizing rare words in automatic speech recognition by proposing a unified framework that integrates multi-talker ASR and contextual biasing, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM with a biasing size of 1,000.

In real-world applications, automatic speech recognition (ASR) systems must handle overlapping speech from multiple speakers and recognize rare words like technical terms. Traditional methods address multi-talker ASR and contextual biasing separately, limiting performance in complex scenarios. We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task. Our ASR method integrates pretrained speech encoders and large language models (LLMs), using optimized finetuning strategies. We also introduce a two-stage filtering algorithm to efficiently identify relevant rare words from large biasing lists and incorporate them into the LLM's prompt input, enhancing rare word recognition. Experiments show that our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM when the biasing size is 1,000, demonstrating its effectiveness in complex speech scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes