CVJul 26, 2021

Language Models as Zero-shot Visual Semantic Learners

arXiv:2107.12021v11 citations
Originality Incremental advance
AI Analysis

This work addresses visual semantic embedding for object recognition and zero-shot learning, but it is incremental as it builds on existing language model techniques.

The authors tackled the problem of visual semantic understanding by proposing a Visual Semantic Embedding Probe (VSEP) that leverages contextualized word embeddings from transformer language models, showing it outperforms static embeddings in zero-shot tasks with short compositional chains.

Visual Semantic Embedding (VSE) models, which map images into a rich semantic embedding space, have been a milestone in object recognition and zero-shot learning. Current approaches to VSE heavily rely on static word em-bedding techniques. In this work, we propose a Visual Se-mantic Embedding Probe (VSEP) designed to probe the semantic information of contextualized word embeddings in visual semantic understanding tasks. We show that the knowledge encoded in transformer language models can be exploited for tasks requiring visual semantic understanding.The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner. We further introduce a zero-shot setting with VSEPs to evaluate a model's ability to associate a novel word with a novel visual category. We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short. We notice that current visual semantic embedding models lack a mutual exclusivity bias which limits their performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes