CVAILGJun 1, 2025

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

arXiv:2506.03195v13 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of distinguishing visually similar subcategories in images for researchers and practitioners in computer vision, representing an incremental advance in unsupervised fine-grained classification.

The paper tackles the challenge of fine-grained image classification with multimodal LLMs by introducing AutoSEP, a self-supervised prompt learning framework that uses unlabeled data to improve accuracy, achieving an average 13% improvement over standard zero-shot classification and 5% over baselines.

Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP on average improves 13 percent over standard zero-shot classification and 5 percent over the best-performing baselines. Code is available at: https://github.com/yq-hong/AutoSEP

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes