CL CVJul 8, 2025

Skywork-R1V3 Technical Report

Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou

arXiv:2507.06167v317.010 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This represents a significant leap in multimodal reasoning for advancing open-source VLM capabilities, though it builds on existing methods with incremental innovations.

The paper tackles the problem of visual reasoning by introducing Skywork-R1V3, an open-source vision-language model that transfers reasoning skills from text-only LLMs to visual tasks using a post-training RL framework, achieving state-of-the-art results on MMMU with an improvement from 64.3% to 76.0% and matching entry-level human capabilities.

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.

View on arXiv PDF Code

Similar