AI CLOct 29, 2025

KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo

arXiv:2510.25101v21 citationsh-index: 17

Originality Highly original

AI Analysis

This work addresses the challenge of improving agentic reasoning capabilities in KBQA for users needing accurate question-answering over structured knowledge bases, representing an incremental advancement over prior methods.

The paper tackles the problem of weak incentives for exploration in agentic reasoning for Knowledge Base Question Answering (KBQA) by proposing KnowCoder-A1, which trains an LLM with outcome-only supervision via multi-stage curriculum reinforcement learning, resulting in up to an 11.1% relative improvement on the zero-shot subset of GrailQA while using only one-twelfth of the training data.

Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

View on arXiv PDF

Similar