AIFeb 16

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

arXiv:2602.14643v2
AI Analysis

This addresses the challenge of reliable navigation in critical conversation flows for domains such as healthcare, offering a novel architectural solution with significant performance gains.

The paper tackles the problem of large language models struggling to adhere to structured workflows in high-stakes domains like healthcare triage, presenting Arbor, a framework that decomposes decision tree navigation into specialized tasks, resulting in a 29.4 percentage point improvement in mean turn accuracy, a 57.1% reduction in per-turn latency, and a 14.4x reduction in per-turn cost.

Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes