CVJan 29

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

arXiv:2601.21821v124 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of limited high-quality reasoning data for open-source VLMs, enabling improved performance in domains like STEM and visual puzzles, though it is incremental as it builds on existing data-centric methods.

The paper tackles the gap in multimodal reasoning for open-source Vision Language Models by introducing MMFineReason, a large-scale dataset with 1.8M samples and 5.1B solution tokens, which when used for fine-tuning achieves state-of-the-art results, such as MMFineReason-8B outperforming Qwen3-VL-30B-A3B-Thinking and approaching Qwen3-VL-32B-Thinking.

Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes