CVJun 23, 2024

Multimodal Multilabel Classification by CLIP

arXiv:2406.16141v11 citations
Originality Synthesis-oriented
AI Analysis

This work addresses multimodal multilabel classification for researchers and practitioners, but it is incremental as it applies existing CLIP to a specific task with optimizations.

The paper tackled multimodal multilabel classification by using CLIP as a feature extractor and fine-tuning it with various classification heads, fusion methods, and loss functions, achieving over 90% F1 score on a public Kaggle competition leaderboard.

Multimodal multilabel classification (MMC) is a challenging task that aims to design a learning algorithm to handle two data sources, the image and text, and learn a comprehensive semantic feature presentation across the modalities. In this task, we review the extensive number of state-of-the-art approaches in MMC and leverage a novel technique that utilises the Contrastive Language-Image Pre-training (CLIP) as the feature extractor and fine-tune the model by exploring different classification heads, fusion methods and loss functions. Finally, our best result achieved more than 90% F_1 score in the public Kaggle competition leaderboard. This paper provides detailed descriptions of novel training methods and quantitative analysis through the experimental results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes