Fast-Slow Transformer for Visually Grounding Speech
This work addresses the challenge of visually grounding speech for applications in multimodal AI, representing an incremental improvement by combining existing architectures.
The paper tackles the problem of associating raw speech waveforms with visual images by introducing FaST-VGS, a Transformer-based model that unifies dual-encoder and cross-attention architectures, achieving state-of-the-art speech-image retrieval accuracy on benchmarks and strong performance on ZeroSpeech 2021 tasks.
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.