SDCVMMASJul 10, 2024

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

arXiv:2407.07825v18 citationsh-index: 36
Originality Incremental advance
AI Analysis

This work addresses the problem of low-delay speech enhancement for real-time applications, though it is incremental as it re-designs an existing model for causal inference.

The paper tackled real-time audio-visual speech enhancement from live video and noisy audio streams without future inputs, achieving state-of-the-art results on the AVSpeech dataset with an end-to-end processing latency of 28.15ms per frame.

In this paper, we aim to generate clean speech frame by frame from a live video stream and a noisy audio stream without relying on future inputs. To this end, we propose RT-LA-VocE, which completely re-designs every component of LA-VocE, a state-of-the-art non-causal audio-visual speech enhancement model, to perform causal real-time inference with a 40ms input frame. We do so by devising new visual and audio encoders that rely solely on past frames, replacing the Transformer encoder with the Emformer, and designing a new causal neural vocoder C-HiFi-GAN. On the popular AVSpeech dataset, we show that our algorithm achieves state-of-the-art results in all real-time scenarios. More importantly, each component is carefully tuned to minimize the algorithm latency to the theoretical minimum (40ms) while maintaining a low end-to-end processing latency of 28.15ms per frame, enabling real-time frame-by-frame enhancement with minimal delay.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes