SD LG ASJul 3, 2023

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Neha Sahipjohn, Neil Shah, Vishal Tambrahalli, Vineet Gandhi

arXiv:2307.01233v14.25 citationsh-index: 21

Originality Incremental advance

AI Analysis

This work addresses the challenge of generating accurate speech from silent videos for applications like assistive technologies, though it is incremental as it builds on existing non-autoregressive architectures.

The paper tackles the problem of speaker-dependent lip-to-speech synthesis by proposing RobustL2S, a modular framework that disentangles speech content from ambient information and speaker characteristics, achieving state-of-the-art performance on datasets like Lip2Wav, GRID, and TCD-TIMIT.

Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/model efficiency due to the entanglement of speech content with ambient information and speaker characteristics. To this end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis. First, a non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content. A vocoder then converts the speech features into raw waveforms. Extensive evaluations confirm the effectiveness of our setup, achieving state-of-the-art performance on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT datasets. Speech samples from RobustL2S can be found at https://neha-sherin.github.io/RobustL2S/

View on arXiv PDF

Similar