CLCVJun 13, 2018

Visually grounded cross-lingual keyword spotting in speech

arXiv:1806.05030v134 citations
Originality Incremental advance
AI Analysis

This could enable searching through speech in low-resource languages using text queries in high-resource languages, but it is a proof-of-concept incremental approach.

The paper tackles cross-lingual keyword spotting in speech by using visual grounding as supervision, enabling retrieval of spoken utterances in one language (English) with text queries in another (German) without parallel data, achieving a precision at ten of 58%.

Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding these would improve P@10 to 91%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes