MMAILGSDASJun 2, 2024

Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach

arXiv:2406.00901v1
Originality Incremental advance
AI Analysis

This addresses speech in-painting for real-world challenging environments, but it is incremental as it extends existing techniques with multi-modal robustness.

The paper tackles the problem of reconstructing missing parts of speech audio in multi-modal scenarios where both audio and visual data may be corrupted, and the result is a sequence-to-sequence model that outperforms the state-of-the-art transformer solution by 38.8% in speech quality and 7.14% in speech intelligibility.

The process of reconstructing missing parts of speech audio from context is called speech in-painting. Human perception of speech is inherently multi-modal, involving both audio and visual (AV) cues. In this paper, we introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features. Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted. To achieve this, we employ a multi-modal training paradigm that boosts the robustness of our model across various conditions involving acoustic and visual distortions. This makes our distortion-aware model a plausible solution for real-world challenging environments. We compare our method with existing transformer-based and recurrent neural network-based models, which attempt to reconstruct missing speech gaps ranging from a few milliseconds to over a second. Our experimental results demonstrate that our novel seq2seq architecture outperforms the state-of-the-art transformer solution by 38.8% in terms of enhancing speech quality and 7.14% in terms of improving speech intelligibility. We exploit a multi-task learning framework that simultaneously performs lip-reading (transcribing video components to text) while reconstructing missing parts of the associated speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes