Same or Not? Enhancing Visual Perception in Vision-Language Models
This addresses the problem of enhancing fine-grained visual perception in VLMs for applications requiring detailed recognition, though it is incremental as it builds on existing VLM architectures with new training data.
The paper tackles the problem of vision-language models (VLMs) being coarse-grained and missing subtle visual details by introducing TWIN, a large-scale dataset of 561,000 image-pair queries that tasks models to determine if two visually similar images depict the same object. Fine-tuning VLMs on TWIN yields gains of up to 19.3% on a new fine-grained benchmark without compromising general VQA performance.
Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/