CVJan 23, 2025

Learning Visual Proxy for Compositional Zero-Shot Learning

arXiv:2501.13859v41 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work improves compositional generalization for AI systems in visual recognition tasks, representing an incremental advancement over existing methods.

The paper tackles the problem of recognizing novel attribute-object compositions in Compositional Zero-Shot Learning by addressing modality gaps and lack of fine-grained visual cues, achieving state-of-the-art performance in closed-world scenarios and competitive results in open-world settings across four benchmarks.

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions by leveraging knowledge from seen compositions. Current methods align textual prototypes with visual features via Vision-Language Models (VLMs), but suffer from two limitations: (1) modality gaps hinder the discrimination of semantically similar pairs, and (2) single-modal textual prototypes lack fine-grained visual cues. In this paper, we introduce Visual Proxy Learning, a method that reduces modality gaps and enhances compositional generalization. We initialize visual proxies for attributes, objects, and their compositions using text representations and optimize the visual space to capture fine-grained cues, improving visual representations. Additionally, we propose Cross-Modal Joint Learning (CMJL), which imposes cross-modal constraints between the text-image and fine-grained visual spaces, improving generalization for unseen compositions and discriminating similar pairs. Experiments show state-of-the-art performance in closed-world scenarios and competitive results in open-world settings across four CZSL benchmarks, demonstrating the effectiveness of our approach in compositional generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes