SDCVLGASMay 4, 2022

SVTS: Scalable Video-to-Speech Synthesis

arXiv:2205.02058v243 citationsh-index: 105
Originality Incremental advance
AI Analysis

This work addresses the challenge of scaling video-to-speech synthesis to large, unconstrained datasets, which is important for applications in accessibility and human-computer interaction, though it is incremental in its method.

The authors tackled the problem of synthesizing speech from silent lip movements by developing a scalable framework that achieves state-of-the-art results on GRID and outperforms previous methods on LRW, and is the first to show intelligible results on the challenging LRS3 dataset.

Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes