Kaiqiang Ke

AI
h-index19
5papers
5citations
Novelty56%
AI Score51

5 Papers

LGMay 27
Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement Learning

Kaiqiang Ke, Shenghong He, Chengdong Xu et al.

Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarchical methods mitigate this difficulty by introducing intermediate subgoals, but fixed temporal abstractions or fixed hierarchy depths can be mismatched to state--goal pairs with different reachability horizons. We propose Coarse-to-Fine Hierarchical Goal Reinforcement Learning (CFHRL), a fully offline GCRL framework that adaptively refines distant goals before execution. Starting from the final goal, CFHRL recursively proposes intermediate targets, trained from replay-supported candidates, and stops refinement once the current target is estimated to be locally executable by a learned reachability cost. The key idea is that a subgoal need not be an exact midpoint or globally optimal waypoint; it only needs to provide reliable progress and reduce the remaining reaching difficulty, enabling subsequent refinement over shorter horizons. A stylized analysis further supports the robustness of approximate recursive contraction. Experiments on OGBench show substantial gains on several long-horizon tasks, with ablations validating the proposed refinement and stopping mechanisms

AIMay 9
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

Chengdong Xu, Kaiqiang Ke, Ziheng Liu et al.

Large language model (LLM)-based multi-agent systems have shown strong potential on complex tasks through agent specialization, tool use, and collaborative reasoning. However, most automated multi-agent system design methods still follow a one-shot paradigm: a workflow is optimized or selected before execution and then reused unchanged throughout the task. This static coordination strategy is ill-suited for long-horizon tasks whose subgoals, intermediate evidence, and information needs evolve over multiple execution stages. We propose EvoMAS, a framework for execution-time multi-agent workflow construction. EvoMAS formulates workflow construction as a meta-level sequential decision problem along a single task trajectory. At each stage, it constructs an explicit task state through a Planner-Evaluator-Updater pipeline and uses a learned Workflow Adapter to instantiate a stage-specific layered workflow from a fixed pool of candidate agents. The adapter is trained with policy gradients using sparse, verifiable terminal task success as the main supervision signal, while evaluator-based process reward is analyzed separately under very-hard sparse-reward settings. Experiments on GAIA, HLE, and DeepResearcher show that EvoMAS outperforms single-agent baselines and recent automated multi-agent workflow design methods. Our analyses further show that explicit task-state construction and learned workflow adaptation provide complementary benefits. Additional results indicate that process reward is most useful when terminal success is extremely sparse, and qualitative case studies illustrate that EvoMAS adapts agent coordination as the task state evolves.

AIDec 16, 2025
Context-Picker: Dynamic context selection using multi-stage reinforcement learning

Siyuan Zhu, Chengdong Xu, Kaiqiang Ke et al.

In long-context question answering, selecting the appropriate scope of context for a query remains a key and unresolved challenge. Insufficient context can lead to missing essential information, whereas excessive context often introduces noise and degrades answer quality. Conventional methods, such as retrieving a fixed number of passages or applying reranking, struggle to dynamically determine which context to include. This is especially problematic for factoid questions, which typically depend only on a few precise pieces of evidence. To overcome this limitation, we propose Context-Picker, a reasoning-aware framework that reframes context selection as the task of identifying a minimal sufficient evidence subset, moving beyond conventional similarity-based ranking. Context-Picker uses a human-inspired two-stage reinforcement learning schedule: stage 1 focuses on improving the recall rate of critical passages, and stage 2 prioritizes pruning redundancy to distill a compact evidence set. To resolve reward sparsity, we propose an offline evidence distillation pipeline that mines ``minimal sufficient sets" via a Leave-One-Out (LOO) procedure, providing dense and task-aligned supervision. Experiments on five long-context and multi-hop QA datasets demonstrate that our method outperforms strong RAG baselines and achieved higher answer accuracy. Ablation studies also indicate that our coarse-to-fine optimization schedule, the redundancy-aware reward shaping, along with the rationale generated by the policy, all contribute substantially to these gains.

AISep 16, 2025
H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

Shicheng Ye, Chao Yu, Kaiqiang Ke et al.

Large language model (LLM)-based agents have shown strong potential in multi-task scenarios, owing to their ability to transfer knowledge across diverse tasks. However, existing approaches often treat prior experiences and knowledge as monolithic units, leading to inefficient and coarse-grained knowledge transfer. In this work, we propose a novel hierarchical memory architecture that enables fine-grained knowledge transfer by decoupling high-level planning memory from low-level execution memory. To construct and refine these hierarchical memories, we introduce Hierarchical Hindsight Reflection (H$^2$R), a mechanism that distills reusable and hierarchical knowledge from past agent-environment interactions. At test time, H$^2$R performs retrievals of high-level and low-level memories separately, allowing LLM-based agents to efficiently access and utilize task-relevant knowledge for new tasks.Experimental results across two benchmarks demonstrate that H$^2$R can improve generalization and decision-making performance, outperforming prior baselines such as Expel.

LGAug 8, 2025
GCHR : Goal-Conditioned Hindsight Regularization for Sample-Efficient Reinforcement Learning

Xing Lei, Wenyan Yang, Kaiqiang Ke et al.

Goal-conditioned reinforcement learning (GCRL) with sparse rewards remains a fundamental challenge in reinforcement learning. While hindsight experience replay (HER) has shown promise by relabeling collected trajectories with achieved goals, we argue that trajectory relabeling alone does not fully exploit the available experiences in off-policy GCRL methods, resulting in limited sample efficiency. In this paper, we propose Hindsight Goal-conditioned Regularization (HGR), a technique that generates action regularization priors based on hindsight goals. When combined with hindsight self-imitation regularization (HSR), our approach enables off-policy RL algorithms to maximize experience utilization. Compared to existing GCRL methods that employ HER and self-imitation techniques, our hindsight regularizations achieve substantially more efficient sample reuse and the best performances, which we empirically demonstrate on a suite of navigation and manipulation tasks.