LGAIAug 25, 2025

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

arXiv:2508.17850v74 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses stable RL training for decentralized, heterogeneous computing environments, offering an incremental improvement over existing methods.

The paper tackles the problem of training instability in decentralized reinforcement learning due to network latency and resource heterogeneity, proposing GEPO which reduces performance drop to 3% under high latency and cuts the best-to-last gap by 85% compared to baselines.

As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized distributed training due to the tight coupling between parameter learning and rollout sampling. For this, we propose HeteroRL, a heterogeneous RL architecture that decouples these processes, enabling stable training across geographically distributed nodes connected via the Internet. The core component is Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency caused by network delays or heterogeneity in computational resources. Our study reveals that high latency significantly increases KL divergence, leading to higher variance of importance weights and training instability. GEPO mitigates this issue by using group expectation weighting to exponentially reduce the variance of importance weights, with theoretical guarantees. Experiments show GEPO achieves superior stability - only a 3% performance drop from online to 1800s latency-and reduces the best-to-last gap by 85% versus GSPO (1.8 vs. 12.0) while attaining the highest scores, highlighting its effectiveness in decentralized, resource-heterogeneous environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes