CVSep 12, 2023

Zero-Shot Visual Classification with Guided Cropping

arXiv:2309.06581v11 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses a specific limitation in zero-shot visual classification for researchers and practitioners using models like CLIP, but it is incremental as it builds on existing methods with a preprocessing step.

The paper tackles the problem of degraded zero-shot classification performance in pretrained vision-language models like CLIP when objects of interest are small, by proposing CLIP with Guided Cropping (GC-CLIP) that uses an off-the-shelf zero-shot object detection model to crop images, resulting in improved classification results across architectures and datasets, especially for small objects.

Pretrained vision-language models, such as CLIP, show promising zero-shot performance across a wide variety of datasets. For closed-set classification tasks, however, there is an inherent limitation: CLIP image encoders are typically designed to extract generic image-level features that summarize superfluous or confounding information for the target tasks. This results in degradation of classification performance, especially when objects of interest cover small areas of input images. In this work, we propose CLIP with Guided Cropping (GC-CLIP), where we use an off-the-shelf zero-shot object detection model in a preprocessing step to increase focus of zero-shot classifier to the object of interest and minimize influence of extraneous image regions. We empirically show that our approach improves zero-shot classification results across architectures and datasets, favorably for small objects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes