CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning
This addresses a critical bottleneck for LLM agents in handling complex, multi-step tasks, offering a robust solution to context overload, though it is an incremental advancement in hierarchical agent design.
The paper tackles the problem of 'Context Explosion' in LLM agents, where long text outputs overwhelm context windows and cause reasoning failures, by introducing CoDA, a hierarchical RL framework that decouples planning from execution; it achieves significant performance improvements over SOTA baselines on complex multi-hop QA benchmarks and maintains stable performance in long-context scenarios while others degrade severely.
Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi-step tasks. However, their performance is often crippled by "Context Explosion", where the accumulation of long text outputs overwhelms the model's context window and leads to reasoning failures. To address this, we introduce CoDA, a Context-Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high-level planning from low-level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high-level Planner that decomposes tasks within a concise strategic context, and a low-level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end-to-end using PECO (Planner-Executor Co-Optimization), a reinforcement learning methodology that applies a trajectory-level reward to jointly optimize both roles, fostering seamless collaboration through context-dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state-of-the-art baselines on complex multi-hop question-answering benchmarks, and it exhibits strong robustness in long-context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.