EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning
This work addresses compositional generalization for recognizing unseen state-object pairs in computer vision, offering an incremental improvement over existing methods.
The paper tackles the problem of compositional zero-shot learning by proposing EVA, a mixture-of-experts framework that improves primitive representation and alignment, achieving state-of-the-art results on three benchmarks in both closed- and open-world settings.
Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the all-to-one cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Semantic Variant Alignment framework for CZSL. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.