CVApr 11, 2024

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

arXiv:2404.07449v179 citationsh-index: 48CVPR
Originality Incremental advance
AI Analysis

This addresses a critical limitation in visual-LLMs for applications requiring precise spatial understanding, representing an incremental improvement over existing methods.

The paper tackled the problem of weak spatial reasoning and localization awareness in visual-LLMs, such as distinguishing left vs. right, by exploring image-space coordinate based instruction fine-tuning, resulting in improved spatial awareness, better VQA performance across domains, reduced hallucination, and enhanced object descriptions across 5 tasks and 14 datasets.

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes