CLCVMar 30, 2016

Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings

arXiv:1603.09188v154 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses a multimodal task useful for applications like image retrieval and description, but it is incremental as it adapts existing methods to a new dataset.

The paper tackles the problem of visual sense disambiguation for verbs by introducing a new task and dataset (VerSe), and finds that textual embeddings work well with annotations while multimodal embeddings perform better on unannotated images, achieving competitive results in unsupervised and supervised settings.

We introduce a new task, visual sense disambiguation for verbs: given an image and a verb, assign the correct sense of the verb, i.e., the one that describes the action depicted in the image. Just as textual word sense disambiguation is useful for a wide range of NLP tasks, visual sense disambiguation can be useful for multimodal tasks such as image retrieval, image description, and text illustration. We introduce VerSe, a new dataset that augments existing multimodal datasets (COCO and TUHOI) with sense labels. We propose an unsupervised algorithm based on Lesk which performs visual sense disambiguation using textual, visual, or multimodal embeddings. We find that textual embeddings perform well when gold-standard textual annotations (object labels and image descriptions) are available, while multimodal embeddings perform well on unannotated images. We also verify our findings by using the textual and multimodal embeddings as features in a supervised setting and analyse the performance of visual sense disambiguation task. VerSe is made publicly available and can be downloaded at: https://github.com/spandanagella/verse.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes