CVNov 15, 2024

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Tsinghua
arXiv:2411.10440v6485 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses the need for better reasoning in VLMs for tasks like visual question answering, though it appears incremental as it builds on existing chain-of-thought methods.

The paper tackles the problem of Vision-Language Models struggling with systematic reasoning in complex visual question-answering tasks by introducing LLaVA-CoT, which achieves a 9.4% improvement over its base model and outperforms larger models like Gemini-1.5-pro on multimodal reasoning benchmarks.

Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes