CVNov 20, 2025

Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou

arXiv:2511.16670v18.42 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses efficiency issues in visual reasoning for AI systems, though it is incremental as it builds on existing RL methods and VLM frameworks.

The paper tackles the problem of excessive computational costs in visual language models (VLMs) by proposing a reinforcement learning approach that enables automatic switching between fast and slow thinking modes based on task difficulty, resulting in performance on par with state-of-the-art models while maintaining high token efficiency.

When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

View on arXiv PDF

Similar