AISep 28, 2022

Clustering-Induced Generative Incomplete Image-Text Clustering (CIGIT-C)

arXiv:2209.13763v2h-index: 2
Originality Incremental advance
AI Analysis

This work addresses a practical issue in multi-modal clustering for real-world applications where data is often incomplete, though it appears incremental as it builds on existing methods to handle missing data.

The paper tackles the problem of incomplete image-text clustering (IITC), where data may be missing in one modality, by proposing a clustering-induced generative network that uses adversarial generation and KL divergence losses to explore latent connections and improve feature learning. The method outperforms existing approaches on public datasets, demonstrating effectiveness in IITC tasks.

The target of image-text clustering (ITC) is to find correct clusters by integrating complementary and consistent information of multi-modalities for these heterogeneous samples. However, the majority of current studies analyse ITC on the ideal premise that the samples in every modality are complete. This presumption, however, is not always valid in real-world situations. The missing data issue degenerates the image-text feature learning performance and will finally affect the generalization abilities in ITC tasks. Although a series of methods have been proposed to address this incomplete image text clustering issue (IITC), the following problems still exist: 1) most existing methods hardly consider the distinct gap between heterogeneous feature domains. 2) For missing data, the representations generated by existing methods are rarely guaranteed to suit clustering tasks. 3) Existing methods do not tap into the latent connections both inter and intra modalities. In this paper, we propose a Clustering-Induced Generative Incomplete Image-Text Clustering(CIGIT-C) network to address the challenges above. More specifically, we first use modality-specific encoders to map original features to more distinctive subspaces. The latent connections between intra and inter-modalities are thoroughly explored by using the adversarial generating network to produce one modality conditional on the other modality. Finally, we update the corresponding modalityspecific encoders using two KL divergence losses. Experiment results on public image-text datasets demonstrated that the suggested method outperforms and is more effective in the IITC job.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes