Multimodal Generalized Category Discovery
This work addresses the challenge of open-world scientific discoveries by enabling multimodal classification, which is crucial for real-world applications where data is inherently multimodal, representing an incremental advance over unimodal GCD methods.
The paper tackles the problem of extending Generalized Category Discovery (GCD) to multimodal data, which involves classifying inputs into known and novel categories using richer information from multiple modalities, and achieves state-of-the-art performance with improvements of 11.5% on UPMC-Food101 and 4.7% on N24News datasets.
Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5\% and 4.7\%, respectively.