Hybrid Group Relative Policy Optimization: A Multi-Sample Approach to Enhancing Policy Optimization
This work addresses the challenge of balancing stability and efficiency in reinforcement learning for researchers and practitioners, though it is incremental as it builds on existing methods like PPO and GRPO.
The paper tackles the problem of improving policy optimization in reinforcement learning by introducing Hybrid GRPO, which combines empirical multi-sample action evaluation with value function-based learning to enhance stability and sample efficiency, achieving superior convergence speed and more stable policy updates compared to existing methods like PPO and DeepSeek GRPO.
Hybrid Group Relative Policy Optimization (Hybrid GRPO) is a reinforcement learning framework that extends Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) by incorporating empirical multi-sample action evaluation while preserving the stability of value function-based learning. Unlike DeepSeek GRPO, which eliminates the value function in favor of purely empirical reward estimation, Hybrid GRPO introduces a structured advantage computation method that balances empirical action sampling with bootstrapped value estimation. This approach enhances sample efficiency, improves learning stability, and mitigates variance amplification observed in purely empirical methods. A detailed mathematical comparison between PPO, DeepSeek GRPO, and Hybrid GRPO is presented, highlighting key differences in advantage estimation and policy updates. Experimental validation in a controlled reinforcement learning environment demonstrates that Hybrid GRPO achieves superior convergence speed, more stable policy updates, and improved sample efficiency compared to existing methods. Several extensions to Hybrid GRPO are explored, including entropy-regularized sampling, hierarchical multi-step sub-sampling, adaptive reward normalization, and value-based action selection. Beyond reinforcement learning in simulated environments, Hybrid GRPO provides a scalable framework for bridging the gap between large language models (LLMs) and real-world agent-based decision-making. By integrating structured empirical sampling with reinforcement learning stability mechanisms, Hybrid GRPO has potential applications in autonomous robotics, financial modeling, and AI-driven control systems. These findings suggest that Hybrid GRPO serves as a robust and adaptable reinforcement learning methodology, paving the way for further advancements in policy optimization.