SDLGMMASJan 23, 2025

Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement

arXiv:2501.13375v23 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses speech enhancement in noisy environments, which is crucial for applications like hearing aids and communication systems, by incorporating linguistic information to bridge modality gaps, representing an incremental advance over existing audio-visual methods.

The paper tackles the challenge of integrating audio, visual, and linguistic modalities for speech enhancement by proposing DLAV-SE, a diffusion-based framework that uses cross-modal knowledge transfer to embed linguistic knowledge, resulting in significant improvements in speech quality and reduction of artifacts compared to state-of-the-art methods.

Speech enhancement (SE) aims to improve the quality and intelligibility of speech in noisy environments. Recent studies have shown that incorporating visual cues in audio signal processing can enhance SE performance. Given that human speech communication naturally involves audio, visual, and linguistic modalities, it is reasonable to expect additional improvements by integrating linguistic information. However, effectively bridging these modality gaps, particularly during knowledge transfer remains a significant challenge. In this paper, we propose a novel multi-modal learning framework, termed DLAV-SE, which leverages a diffusion-based model integrating audio, visual, and linguistic information for audio-visual speech enhancement (AVSE). Within this framework, the linguistic modality is modeled using a pretrained language model (PLM), which transfers linguistic knowledge to the audio-visual domain through a cross-modal knowledge transfer (CMKT) mechanism during training. After training, the PLM is no longer required at inference, as its knowledge is embedded into the AVSE model through the CMKT process. We conduct a series of SE experiments to evaluate the effectiveness of our approach. Results show that the proposed DLAV-SE system significantly improves speech quality and reduces generative artifacts, such as phonetic confusion, compared to state-of-the-art (SOTA) methods. Furthermore, visualization analyses confirm that the CMKT method enhances the generation quality of the AVSE outputs. These findings highlight both the promise of diffusion-based methods for advancing AVSE and the value of incorporating linguistic information to further improve system performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes