LG AINov 24, 2025

Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning

arXiv:2511.18871v44.1

Originality Incremental advance

AI Analysis

This addresses training efficiency challenges for researchers and practitioners using on-policy RL methods like GRPO, though it appears incremental as it builds on existing separation strategies.

The paper tackles the computational bottleneck in LLM reinforcement learning where synchronous execution prevents concurrent inference and training, and by introducing a periodically asynchronous framework with data loader improvements, it achieves significant end-to-end training efficiency improvements on NPU platforms.

Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, our approach consistently delivers significant end-to-end training efficiency improvements on NPU platforms, indicating its potential for widespread application.

View on arXiv PDF

Similar