Thinking with Images via Self-Calling Agent
This work addresses efficiency and data scarcity issues in visual reasoning for AI researchers, representing an incremental improvement over existing methods.
The paper tackles the challenge of optimizing interleaved multimodal Chain-of-Thought (iMCoT) for visual reasoning by proposing Self-Calling Chain-of-Thought (sCoT), which reformulates it as a language-only CoT with self-calling, resulting in a 1.9% performance improvement and 75% reduction in GPU hours on HR-Bench 4K.
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.