CL ASOct 10, 2022

YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding

Kayode Olaleye, Dan Oneata, Herman Kamper

arXiv:2210.04600v21.98 citationsh-index: 29

Originality Synthesis-oriented

AI Analysis

This addresses the problem of building speech systems for unwritten or low-resource languages like Yorùbá, though it is incremental as it extends existing VGS methods to a new language.

The authors tackled the lack of visually grounded speech datasets for low-resource languages by collecting and releasing a Yorùbá speech-image dataset, enabling cross-lingual keyword localization where English queries are detected in Yorùbá speech, with performance compared to English systems trained on similar and more data.

Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yorùbá -- a real low-resource language spoken in Nigeria. We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yorùbá utterances. This enables cross-lingual keyword localisation: a written English query is detected and located in Yorùbá speech. To quantify the effect of the smaller dataset, we compare to English systems trained on similar and more data. We hope that this new dataset will stimulate research in the use of VGS models for real low-resource languages.

View on arXiv PDF

Similar