ASCLSDJun 11, 2024

Translating speech with just images

arXiv:2406.07133v11 citations
Originality Incremental advance
AI Analysis

This is an incremental approach to speech translation for low-resource languages like Yorùbá, potentially aiding communication in data-scarce settings.

The authors tackled speech translation for low-resource languages by mapping speech audio to text via images, using a Yorùbá-to-English model that leverages pretrained components to address data scarcity. Results indicate the translations capture main semantics but are simpler and shorter.

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yorùbá, and propose a Yorùbá-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes