ASAICLSDMar 28, 2022

Word Discovery in Visually Grounded, Self-Supervised Speech Models

arXiv:2203.15081v552 citationsh-index: 16Has Code
Originality Highly original
AI Analysis

This addresses the problem of unsupervised spoken term discovery for speech processing, offering a novel approach that leverages visual grounding to enhance word segmentation.

The authors tackled the problem of discovering words from speech without text supervision by training self-supervised speech models with visual grounding, resulting in emergent word segmentation and clustering capabilities that perform on par with or better than existing methods on benchmarks like Buckeye and ZeroSpeech.

We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we perform on par with or better than currently published methods on several metrics. Code and model weights are available at https://github.com/jasonppy/word-discovery.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes