CLAIJun 28, 2024

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

arXiv:2406.19934v227 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the problem of limited reasoning data for VLMs, offering a reproducible and cost-efficient solution, though it is incremental as it builds on existing paradigms and tools.

The paper tackles the challenge of multi-step reasoning in vision-language models by introducing a least-to-most paradigm and a data synthesis approach, resulting in a plug-and-play visual reasoner that significantly improves four VLMs on four VQA benchmarks.

We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct $50$k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at https://github.com/steven-ccq/VisualReasoner.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes