Dynamic Visual-semantic Alignment for Zero-shot Learning with Ambiguous Labels
For zero-shot learning practitioners, DVSA addresses the practical problem of label noise, offering a robust framework that outperforms existing methods under ambiguous supervision.
Zero-shot learning typically assumes clean labels, but real-world label noise degrades performance. DVSA introduces a dynamic label disambiguation mechanism and bidirectional visual-semantic alignment with contrastive optimization, achieving state-of-the-art results under ambiguous labels on standard benchmarks.
Zero-shot learning (ZSL) aims to recognize unseen classes without visual instances. However, existing methods usually assume clean labels, overlooking real-world label noise and ambiguity, which degrades performance. To bridge this gap, we propose the Dynamic Visual-semantic Alignment (DVSA), a robust ZSL framework for learning from ambiguous labels. DVSA uses a bidirectional visual-semantic alignment module with attention to mutually calibrate visual features and attribute prototypes, and a contrastive optimization grounded in Mutual Information (MI) at the attribute level to strengthen discriminative, semantically consistent attributes. In addition, a dynamic label disambiguation mechanism iteratively corrects noisy supervision while preserving semantic consistency, narrowing the instance-label gap, and improving generalization. Extensive experiments on standard benchmarks verify that DVSA achieves stronger performance under ambiguous supervision.