MM AI LG SD ASJun 2, 2024

Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach

Mahsa Kadkhodaei Elyaderani, Shahram Shirani

arXiv:2406.00901v11.2

Originality Incremental advance

AI Analysis

This addresses speech in-painting for real-world challenging environments, but it is incremental as it extends existing techniques with multi-modal robustness.

The paper tackles the problem of reconstructing missing parts of speech audio in multi-modal scenarios where both audio and visual data may be corrupted, and the result is a sequence-to-sequence model that outperforms the state-of-the-art transformer solution by 38.8% in speech quality and 7.14% in speech intelligibility.

The process of reconstructing missing parts of speech audio from context is called speech in-painting. Human perception of speech is inherently multi-modal, involving both audio and visual (AV) cues. In this paper, we introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features. Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted. To achieve this, we employ a multi-modal training paradigm that boosts the robustness of our model across various conditions involving acoustic and visual distortions. This makes our distortion-aware model a plausible solution for real-world challenging environments. We compare our method with existing transformer-based and recurrent neural network-based models, which attempt to reconstruct missing speech gaps ranging from a few milliseconds to over a second. Our experimental results demonstrate that our novel seq2seq architecture outperforms the state-of-the-art transformer solution by 38.8% in terms of enhancing speech quality and 7.14% in terms of improving speech intelligibility. We exploit a multi-task learning framework that simultaneously performs lip-reading (transcribing video components to text) while reconstructing missing parts of the associated speech.

View on arXiv PDF

Similar