CL CV SD ASAug 14, 2020

Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

arXiv:2008.06258v11.010 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of efficient multimodal learning for AI agents, but it is incremental as it primarily compares existing methods without introducing a novel approach.

The study compared unsupervised autoencoder-like models to supervised transfer learning for multimodal one-shot speech-image matching, finding that transfer learning outperformed unsupervised training on a dataset of paired spoken and visual digits.

We consider the task of multimodal one-shot speech-image matching. An agent is shown a picture along with a spoken word describing the object in the picture, e.g. cookie, broccoli and ice-cream. After observing one paired speech-image example per class, it is shown a new set of unseen pictures, and asked to pick the "ice-cream". Previous work attempted to tackle this problem using transfer learning: supervised models are trained on labelled background data not containing any of the one-shot classes. Here we compare transfer learning to unsupervised models trained on unlabelled in-domain data. On a dataset of paired isolated spoken and visual digits, we specifically compare unsupervised autoencoder-like models to supervised classifier and Siamese neural networks. In both unimodal and multimodal few-shot matching experiments, we find that transfer learning outperforms unsupervised training. We also present experiments towards combining the two methodologies, but find that transfer learning still performs best (despite idealised experiments showing the benefits of unsupervised learning).

View on arXiv PDF Code

Similar