Xudong Yan

h-index10

3papers

3citations

Novelty57%

AI Score47

Ranked #32,173 of 194,257 authors (top 17%)#11,493 in CV (top 19%)

3 Papers

3.6CVOct 23, 2025Code

TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan, Songhe Feng

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .

9.4LGFeb 18, 2025Code

MaxSup: Overcoming Representation Collapse in Label Smoothing

Yuxuan Zhou, Heng Li, Zhi-Qi Cheng et al. · cmu, uw

Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS, consistently reducing overconfidence while preserving richer feature representations. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization

3.7CVNov 18, 2024Code

Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Xudong Yan, Songhe Feng, Yang Zhang et al.

Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attributes and objects by extracting shared and exclusive parts between the image pair sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) The efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attributes with objects in the same parts. (2) Existing word embeddings fail to capture complex multimodal semantic information. (3) Overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named multimodal large language model (MLLM) embeddings and attribute smoothing guided disentanglement for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multi-granularity features for disentanglement. Moreover, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Furthermore, we propose attribute smoothing with auxiliary attributes generated by the large language model (LLM) for seen compositions to address the overconfidence challenge. Extensive experiments demonstrate that our method achieves state-of-the-art performance on three challenging datasets. The source code will be available at https://github.com/xud-yan/Trident .