ASAIDec 26, 2024

Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization

NVIDIA
arXiv:2412.19005v13 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the problem of noisy and spontaneous speech recognition in videos for applications like transcription services, though it appears incremental as it builds on existing fine-tuning approaches.

The paper tackles the challenge of improving audiovisual speech recognition in unconstrained real-world scenarios by proposing a bifocal preference optimization method that simulates errors from both input and output perspectives, resulting in significant accuracy improvements across various domains.

Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes