LGMar 24, 2025

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

arXiv:2503.18929v128 citations
Originality Incremental advance
AI Analysis

This addresses the problem of slow and inefficient RL for LLM post-training, offering a scalable solution for researchers and practitioners, though it appears incremental as it builds on existing methods like Trajectory Balance.

The paper tackles the incompatibility of on-policy RL algorithms with experience replay buffers in LLM post-training by proposing Trajectory Balance with Asynchrony (TBA), a scalable system that decouples exploration and learning, resulting in up to 4x faster training and improved performance on tasks like mathematical reasoning and preference-tuning.

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes