CVJun 29, 2025

Empowering Small VLMs to Think with Dynamic Memorization and Exploration

arXiv:2506.23061v15 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This work solves the problem of unreliable thinking in small VLMs for AI applications, representing an incremental advancement by combining existing methods in a novel way.

The paper tackles the challenge of enabling small-scale vision-language models (SVLMs) to perform reliable thinking by addressing their limited capacity and poor instruction-following, resulting in substantial performance improvements across diverse domains.

Empowering Small-scale Vision-Language Models (SVLMs) with reliable thinking capabilities remains fundamentally challenging due to their limited parameter capacity and weak instruction-following abilities. Existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capabilities of SVLMs. Consequently, directly applying these paradigms to SVLMs often suffers from severe pseudo thinking traces and advantage collapse, ultimately undermining both thinking reliability and task performance. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. However, the widely adopted two-stage training paradigm still performs poorly on SVLMs, as their tendency toward sub-optimal convergence hinders the trade-off and limits the benefits of the combination. To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) modes at each optimization step, ensuring that every update contributes to the trade-off. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME

View on arXiv PDF Code

Similar