ASCLLGFeb 11, 2025

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

arXiv:2502.10447v24 citationsh-index: 12ICML
Originality Highly original
AI Analysis

This addresses scalability constraints in AVSR for robust speech recognition in noisy environments, representing a strong incremental advance with novel architectural improvements.

The paper tackled the problem of scaling audio-visual speech recognition (AVSR) systems without compromising computational efficiency by introducing MoHAVE, a framework that uses a Mixture-of-Experts architecture to dynamically adapt to inputs, achieving state-of-the-art performance on benchmarks like LRS3 and MuAViC.

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes