ASCLIRSep 16, 2021

Fast-Slow Transformer for Visually Grounding Speech

arXiv:2109.08186v434 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of visually grounding speech for applications in multimodal AI, representing an incremental improvement by combining existing architectures.

The paper tackles the problem of associating raw speech waveforms with visual images by introducing FaST-VGS, a Transformer-based model that unifies dual-encoder and cross-attention architectures, achieving state-of-the-art speech-image retrieval accuracy on benchmarks and strong performance on ZeroSpeech 2021 tasks.

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes