CVAILGROMar 27, 2025

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

arXiv:2503.22020v1414 citationsh-index: 12CVPR
Originality Highly original
AI Analysis

This addresses the problem of limited reasoning capabilities in VLAs for robotics, enabling better performance in manipulation tasks, though it is incremental as it builds on existing VLA paradigms.

The paper tackled the lack of intermediate reasoning in vision-language-action models (VLAs) for complex manipulation tasks by introducing CoT-VLA, which incorporates visual chain-of-thought reasoning through autoregressive future image prediction, resulting in a 17% improvement in real-world tasks and 6% in simulation benchmarks over the state-of-the-art VLA model.

Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes