CV CL SD ASSep 19, 2019

Large-scale representation learning from visually grounded untranscribed speech

Gabriel Ilharco, Yuan Zhang, Jason Baldridge

arXiv:1909.08782v150.41023 citations

Originality Highly original

AI Analysis

This work addresses the challenge of visually grounded language learning for AI systems, representing an incremental advance with strong specific gains in audio-image retrieval.

The paper tackled the problem of learning multimodal representations from untranscribed speech and images by developing a scalable method to generate audio for image captioning datasets and using a dual encoder with a masked margin softmax loss. The result was state-of-the-art performance on the Flickr8k Audio Captions Corpus, improving recall in the top 10 from 29.6% to 49.5%.

Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both modalities. We show that a masked margin softmax loss for such models is superior to the standard triplet loss. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art results---improving recall in the top 10 from 29.6% to 49.5%. We also obtain human ratings on retrieval outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, finding that automatic evaluation substantially underestimates the quality of the retrieved results.

View on arXiv PDF

Similar