CVJan 24, 2024

Democratizing Fine-grained Visual Recognition with Large Language Models

arXiv:2401.13837v227 citationsICLR
Originality Incremental advance
AI Analysis

This work addresses the bottleneck of requiring expert annotations for FGVR, making it more accessible for real-world applications like species identification, though it is incremental as it builds on existing LLM capabilities.

The paper tackles the problem of fine-grained visual recognition (FGVR) by proposing Fine-grained Semantic Category Reasoning (FineR), which uses large language models (LLMs) to reason about subordinate-level categories from part-level visual attributes, eliminating the need for expert annotations. The training-free method outperforms state-of-the-art FGVR and vision-language models, showing promise for real-world applications in new domains.

Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes