ASAIJun 9, 2024

MaLa-ASR: Multimedia-Assisted LLM-Based ASR

arXiv:2406.05839v25 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing ASR accuracy in multimedia-rich environments like conferences, representing an incremental improvement by applying LLMs to integrate auxiliary textual information.

The paper tackles the problem of improving automatic speech recognition (ASR) for conference content by integrating textual keywords from presentation slides, resulting in average word error rates (WERs) of 9.4% and 11.7% on specific subsets, with relative WER drops of 27.9% and 44.7% over the baseline and biased WER reductions of 46.0% and 44.2%.

As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content. MaLa-ASR yields average WERs of 9.4% and 11.7% on the L95 and S95 subsets of the SlideSpeech corpus, representing a significant relative WER drop of 27.9% and 44.7% over the baseline model reported in SlideSpeech. MaLa-ASR underscores LLM's strong performance in speech tasks and the capability to integrate auxiliary information conveniently. By adding keywords to the input prompt, the biased word error rate (B-WER) reduces relatively by 46.0% and 44.2%, establishing a new SOTA on this dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes