CVJun 23, 2024

Multimodal Multilabel Classification by CLIP

arXiv:2406.16141v15.21 citations

Originality Synthesis-oriented

AI Analysis

This work addresses multimodal multilabel classification for researchers and practitioners, but it is incremental as it applies existing CLIP to a specific task with optimizations.

The paper tackled multimodal multilabel classification by using CLIP as a feature extractor and fine-tuning it with various classification heads, fusion methods, and loss functions, achieving over 90% F1 score on a public Kaggle competition leaderboard.

Multimodal multilabel classification (MMC) is a challenging task that aims to design a learning algorithm to handle two data sources, the image and text, and learn a comprehensive semantic feature presentation across the modalities. In this task, we review the extensive number of state-of-the-art approaches in MMC and leverage a novel technique that utilises the Contrastive Language-Image Pre-training (CLIP) as the feature extractor and fine-tune the model by exploring different classification heads, fusion methods and loss functions. Finally, our best result achieved more than 90% F_1 score in the public Kaggle competition leaderboard. This paper provides detailed descriptions of novel training methods and quantitative analysis through the experimental results.

View on arXiv PDF

Similar