VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
This addresses the problem of scarce high-quality image-text reasoning data for VLMs, enabling more efficient training by leveraging abundant text resources, though it is incremental as it builds on existing distillation and reinforcement learning methods.
The paper tackles the challenge of training vision-language models for complex reasoning by proposing VOLD, a framework that transfers reasoning capabilities from text-only teacher models to VLM students using reinforcement learning and on-policy distillation, resulting in significant performance gains over baselines and state-of-the-art improvements on benchmarks like MMMU-Pro and MathVista.
Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.