CLAIMMSDASOct 10, 2023

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

arXiv:2310.06434v2155 citationsh-index: 40Has Code
Originality Incremental advance
AI Analysis

This work addresses error correction in speech recognition, offering a novel approach that could benefit applications requiring high transcription accuracy, though it appears incremental as it builds on existing pre-trained models.

The paper tackles generative error correction in automatic speech recognition by introducing a cross-modal fusion technique that uses acoustic and linguistic information to improve transcription accuracy, achieving a 37.66% relative improvement in word error rate compared to n-best hypotheses.

We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes