AIAug 13, 2025

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li

arXiv:2508.09670v113 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses a bottleneck in RLVR for improving reasoning in LLMs, offering a domain-specific incremental advance.

The paper tackles the problem of reward sparsity in reinforcement learning with verifiable rewards (RLVR) for large language models by proposing MEML-GRPO, which uses diverse expert prompts and mutual learning to increase correct solutions, resulting in average performance gains of 4.89% with Qwen and 11.33% with Llama across reasoning benchmarks.

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

View on arXiv PDF

Similar