SD CV MM ASDec 10, 2022

Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng

arXiv:2212.05301v220.235 citationsh-index: 22

Originality Incremental advance

AI Analysis

This work addresses noise robustness in AVSR, a domain-specific problem for speech recognition systems, by introducing a novel integration strategy, though it is incremental as it builds on existing fusion methods.

The paper tackles the problem of audio-visual speech recognition (AVSR) models over-relying on audio in clean conditions, which reduces robustness to noise, by leveraging visual modality-specific representations to provide complementary information. The proposed reinforcement learning framework, MSRL, dynamically harmonizes these representations during decoding, achieving state-of-the-art results on the LRS3 dataset in clean and noisy conditions, with better generality to unseen noises.

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

View on arXiv PDF

Similar