CVOct 4, 2025

Zero-Shot Fine-Grained Image Classification Using Large Vision-Language Models

Md. Atabuzzaman, Andrew Zhang, Chris Thomas

arXiv:2510.03903v18.41 citationsh-index: 2Has CodeEMNLP

Originality Highly original

AI Analysis

This addresses the problem of precise image classification without labeled data for researchers and practitioners in computer vision, representing a novel application rather than an incremental improvement.

The paper tackles zero-shot fine-grained image classification by transforming it into a visual question-answering framework using large vision-language models, and it outperforms the current state-of-the-art method across multiple benchmarks.

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on vision-language reasoning tasks. However, their potential for zero-shot fine-grained image classification, a challenging task requiring precise differentiation between visually similar categories, remains underexplored. We present a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework, leveraging LVLMs' comprehensive understanding capabilities rather than relying on direct class name generation. We enhance model performance through a novel attention intervention technique. We also address a key limitation in existing datasets by developing more comprehensive and precise class description benchmarks. We validate the effectiveness of our method through extensive experimentation across multiple fine-grained image classification benchmarks. Our proposed method consistently outperforms the current state-of-the-art (SOTA) approach, demonstrating both the effectiveness of our method and the broader potential of LVLMs for zero-shot fine-grained classification tasks. Code and Datasets: https://github.com/Atabuzzaman/Fine-grained-classification

View on arXiv PDF Code

Similar