CLApr 27

DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

arXiv:2604.2432093.9Has Code
AI Analysis

This work addresses the limited exploration and incomplete environmental understanding in LLM agents by enabling parallel interaction, which is a novel approach for improving agent performance in complex tasks.

DPEPO introduces a new paradigm where LLM agents interact with multiple environments simultaneously, using a reinforcement learning algorithm with hierarchical rewards to promote diverse parallel exploration. It achieves state-of-the-art success rates on ALFWorld and ScienceWorld while maintaining efficiency comparable to sequential baselines.

Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes