DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning
This dataset addresses the problem of constrained gains in multimodal reasoning for AI researchers and developers, though it is incremental as it builds on prior RLVR methods by providing better data.
The authors tackled the limited diversity and coverage in existing datasets for Reinforcement Learning with Verifiable Rewards (RLVR) by introducing DeepVision-103K, a comprehensive dataset covering K12 mathematical topics, which improved model performance on multimodal benchmarks and enhanced visual perception and reasoning capabilities.
Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce \textbf{DeepVision-103K}, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision's effectiveness for advancing multimodal reasoning. Data: \href{https://huggingface.co/datasets/skylenage/DeepVision-103K}{this url}.