AIJan 16

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

arXiv:2601.11037v12 citationsh-index: 13
Originality Highly original
AI Analysis

This addresses reliability issues in agentic search for LLMs, which is crucial for real-world applications, though it is an incremental improvement focusing on boundary awareness.

The paper tackles the problem of unreliable answers in RL-based agentic search for LLMs, where agents rarely admit 'I DON'T KNOW' when evidence is insufficient, and proposes BAPO, a novel RL framework that enhances reliability without compromising accuracy, as demonstrated by substantial improvements in experiments on four benchmarks.

RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes