CVJan 23, 2025

Learning Visual Proxy for Compositional Zero-Shot Learning

Shiyu Zhang, Cheng Yan, Yang Liu, Chenchen Jing, Lei Zhou, Wenjun Wang

arXiv:2501.13859v46.21 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work improves compositional generalization for AI systems in visual recognition tasks, representing an incremental advancement over existing methods.

The paper tackles the problem of recognizing novel attribute-object compositions in Compositional Zero-Shot Learning by addressing modality gaps and lack of fine-grained visual cues, achieving state-of-the-art performance in closed-world scenarios and competitive results in open-world settings across four benchmarks.

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions by leveraging knowledge from seen compositions. Current methods align textual prototypes with visual features via Vision-Language Models (VLMs), but suffer from two limitations: (1) modality gaps hinder the discrimination of semantically similar pairs, and (2) single-modal textual prototypes lack fine-grained visual cues. In this paper, we introduce Visual Proxy Learning, a method that reduces modality gaps and enhances compositional generalization. We initialize visual proxies for attributes, objects, and their compositions using text representations and optimize the visual space to capture fine-grained cues, improving visual representations. Additionally, we propose Cross-Modal Joint Learning (CMJL), which imposes cross-modal constraints between the text-image and fine-grained visual spaces, improving generalization for unseen compositions and discriminating similar pairs. Experiments show state-of-the-art performance in closed-world scenarios and competitive results in open-world settings across four CZSL benchmarks, demonstrating the effectiveness of our approach in compositional generalization.

View on arXiv PDF

Similar