SD CL CV MM ASJul 8, 2022

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

arXiv:2207.03800v212.213 citationsh-index: 27Has Code

Originality Incremental advance

AI Analysis

This addresses the need for efficient and high-quality lip-to-speech synthesis in applications like video conferencing or assistive technologies, though it is incremental as it builds on prior sequence-to-sequence models.

The paper tackles the problem of generating speech from silent talking-face videos without constraints on head poses or vocabulary, proposing FastLTS, a non-autoregressive end-to-end model that directly synthesizes audio, achieving a 19.76x speedup in audio generation and superior quality compared to existing methods.

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves $19.76\times$ speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.

View on arXiv PDF Code

Similar