CLJun 21, 2024

FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents

arXiv:2406.14884v144 citations
Originality Incremental advance
AI Analysis

This work addresses the reliability of LLM-based agents in expertise-intensive tasks by providing a standardized benchmark, though it is incremental as it builds on existing workflow knowledge approaches.

The authors tackled the problem of planning hallucinations in LLM-based agents by formalizing workflow knowledge formats and introducing FlowBench, a benchmark covering 51 scenarios across 6 domains, which revealed that current agents require significant improvements for satisfactory planning.

LLM-based agents have emerged as promising tools, which are crafted to fulfill complex tasks by iterative planning and action. However, these agents are susceptible to undesired planning hallucinations when lacking specific knowledge for expertise-intensive tasks. To address this, preliminary attempts are made to enhance planning reliability by incorporating external workflow-related knowledge. Despite the promise, such infused knowledge is mostly disorganized and diverse in formats, lacking rigorous formalization and comprehensive comparisons. Motivated by this, we formalize different formats of workflow knowledge and present FlowBench, the first benchmark for workflow-guided planning. FlowBench covers 51 different scenarios from 6 domains, with knowledge presented in diverse formats. To assess different LLMs on FlowBench, we design a multi-tiered evaluation framework. We evaluate the efficacy of workflow knowledge across multiple formats, and the results indicate that current LLM agents need considerable improvements for satisfactory planning. We hope that our challenging benchmark can pave the way for future agent planning research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes