CVNov 30, 2023

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

arXiv:2312.00081v253 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of evaluating and improving fine-grained visual-linguistic comprehension in vision language models, which is crucial for applications requiring detailed scene understanding, and while it offers a novel diagnostic method, it is incremental in optimizing existing models.

The paper tackled the challenge of fine-grained vision-language understanding by introducing a pipeline to synthesize images with specific attribute variations and a benchmark called SPEC to diagnose VLMs on object size, position, existence, and count, finding that leading VLMs performed near random guess and proposing an optimization approach that improved performance on SPEC and other benchmarks without harming zero-shot capabilities.

Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes