CVAIMMASSPAug 26, 2025

Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

arXiv:2508.18734v12 citationsh-index: 6
Originality Highly original
AI Analysis

This addresses the problem of noise robustness in AVSR for applications like hearing aids or voice assistants, representing a strong specific gain with a novel method for a known bottleneck.

The paper tackles robust audio-visual speech recognition in noisy environments by proposing a router-gated cross-modal feature fusion framework that adaptively reweights audio and visual features based on acoustic corruption scores, achieving a 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT on LRS3.

Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes