CVJan 3, 2022

Semantically Grounded Visual Embeddings for Zero-Shot Learning

arXiv:2201.00577v26 citations
AI Analysis

This work addresses a key bottleneck in zero-shot learning for computer vision by enhancing alignment between visual and textual representations, though it is incremental as it builds on existing methods.

The paper tackles the problem of disjoint visual and semantic embeddings in zero-shot learning by proposing a joint image-text model that leverages ancillary captions for semantic grounding, resulting in performance improvements of up to +2.6% on benchmark datasets.

Zero-shot learning methods rely on fixed visual and semantic embeddings, extracted from independent vision and language models, both pre-trained for other large-scale tasks. This is a weakness of current zero-shot learning frameworks as such disjoint embeddings fail to adequately associate visual and textual information to their shared semantic content. Therefore, we propose to learn semantically grounded and enriched visual information by computing a joint image and text model with a two-stream network on a proxy task. To improve this alignment between image and textual representations, provided by attributes, we leverage ancillary captions to provide grounded semantic information. Our method, dubbed joint embeddings for zero-shot learning is evaluated on several benchmark datasets, improving the performance of existing state-of-the-art methods in both standard ($+1.6$\% on aPY, $+2.6\%$ on FLO) and generalized ($+2.1\%$ on AWA$2$, $+2.2\%$ on CUB) zero-shot recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes