Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
This addresses the challenge of improving reasoning in resource-efficient MLLMs, which is incremental as it builds on existing methods like SFT and RL.
This paper tackles the problem of enhancing reasoning abilities in lightweight multimodal language models (MLLMs) with fewer than seven billion parameters, finding that Supervised Fine-Tuning with long Chain-of-Thought data significantly improves their reasoning, and that a subsequent Reinforcement Learning stage yields additional performance gains.
While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.