Multimodal Multilabel Classification by CLIP
This work addresses multimodal multilabel classification for researchers and practitioners, but it is incremental as it applies existing CLIP to a specific task with optimizations.
The paper tackled multimodal multilabel classification by using CLIP as a feature extractor and fine-tuning it with various classification heads, fusion methods, and loss functions, achieving over 90% F1 score on a public Kaggle competition leaderboard.
Multimodal multilabel classification (MMC) is a challenging task that aims to design a learning algorithm to handle two data sources, the image and text, and learn a comprehensive semantic feature presentation across the modalities. In this task, we review the extensive number of state-of-the-art approaches in MMC and leverage a novel technique that utilises the Contrastive Language-Image Pre-training (CLIP) as the feature extractor and fine-tune the model by exploring different classification heads, fusion methods and loss functions. Finally, our best result achieved more than 90% F_1 score in the public Kaggle competition leaderboard. This paper provides detailed descriptions of novel training methods and quantitative analysis through the experimental results.