CVMar 13, 2025

SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

arXiv:2503.10252v27 citationsh-index: 14Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of recognizing unseen classes in computer vision, but it is incremental as it builds on existing transformer and zero-shot learning methods.

The paper tackles semantic misalignment in zero-shot learning by introducing SVIP, a transformer-based framework that preemptively removes and replaces semantic-unrelated visual patches, achieving state-of-the-art performance on benchmarks.

Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch's semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations. Code is available at https://github.com/uqzhichen/SVIP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes