CVLGJan 22, 2024

Multi-level Cross-modal Alignment for Image Clustering

arXiv:2401.11740v18 citationsh-index: 5AAAI
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in cross-modal learning for image clustering, representing an incremental improvement over existing approaches.

The paper tackles the problem of erroneous alignments in cross-modal pretraining models that degrade image clustering performance by proposing a Multi-level Cross-modal Alignment method that builds a better semantic space and aligns images and texts at three levels. Experimental results on five benchmark datasets demonstrate the superiority of their method.

Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could produce poor-quality pseudo-labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel \textbf{Multi-level Cross-modal Alignment} method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes