CVCLOct 23, 2020

Show and Speak: Directly Synthesize Spoken Description of Images

arXiv:2010.12267v23 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of multimodal AI by enabling direct image-to-speech synthesis, which is a novel but incremental step in audio-visual integration.

The paper tackles the problem of generating spoken descriptions directly from images without intermediate text or phonemes, achieving natural-sounding speech synthesis as demonstrated on the Flickr8k benchmark.

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes