CLSDASApr 24, 2019

On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval

arXiv:1904.10947v212 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving speech retrieval accuracy for low-resource language applications, though it is incremental as it builds on existing datasets and tasks.

The paper tackles the problem of semantic speech retrieval in low-resource settings by investigating the utility of visual grounding alongside textual supervision, finding that incorporating visual supervision improves average precision by 23% with about 5 hours of transcribed speech.

Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision. In real-world low-resource settings, however, we often have access to some transcribed speech. We study whether and how visual grounding is useful in the presence of varying amounts of textual supervision. In particular, we consider the task of semantic speech retrieval in a low-resource setting. We use a previously studied data set and task, where models are trained on images with spoken captions and evaluated on human judgments of semantic relevance. We propose a multitask learning approach to leverage both visual and textual modalities, with visual supervision in the form of keyword probabilities from an external tagger. We find that visual grounding is helpful even in the presence of textual supervision, and we analyze this effect over a range of sizes of transcribed data sets. With ~5 hours of transcribed speech, we obtain 23% higher average precision when also using visual supervision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes