Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets
This addresses a practical issue in real-world multimodal applications where category sets are heterogeneous, which is incremental as it builds on existing multimodal methods.
The paper tackles the problem of multimodal classification when different modalities have inconsistent category sets, proposing a new setting called MMHCL and a method (CSCF) that aligns features and fuses modalities based on class similarity. Experimental results show it significantly outperforms state-of-the-art approaches on multiple benchmarks.
Existing multimodal methods typically assume that different modalities share the same category set. However, in real-world applications, the category distributions in multimodal data exhibit inconsistencies, which can hinder the model's ability to effectively utilize cross-modal information for recognizing all categories. In this work, we propose the practical setting termed Multi-Modal Heterogeneous Category-set Learning (MMHCL), where models are trained in heterogeneous category sets of multi-modal data and aim to recognize complete classes set of all modalities during test. To effectively address this task, we propose a Class Similarity-based Cross-modal Fusion model (CSCF). Specifically, CSCF aligns modality-specific features to a shared semantic space to enable knowledge transfer between seen and unseen classes. It then selects the most discriminative modality for decision fusion through uncertainty estimation. Finally, it integrates cross-modal information based on class similarity, where the auxiliary modality refines the prediction of the dominant one. Experimental results show that our method significantly outperforms existing state-of-the-art (SOTA) approaches on multiple benchmark datasets, effectively addressing the MMHCL task.