AIMay 30, 2025

FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model Evaluation

Vishal Pallagani, Nitin Gupta, John Aydin, Biplav Srivastava

arXiv:2505.24258v13.3h-index: 9Has Code

Originality Incremental advance

AI Analysis

This provides the first diagnostic benchmark for systematically evaluating data-flow reasoning in LLMs, which is important for researchers and developers working on procedural understanding in AI.

The authors tackled the problem of evaluating large language models' ability to reason about data flow in procedural text by introducing FABLE, a benchmark with 2,400 question-answer pairs across three domains, finding that a reasoning-focused model achieved higher accuracy but was over 20 times slower than other models that performed near random chance.

Understanding how data moves, transforms, and persists, known as data flow, is fundamental to reasoning in procedural tasks. Despite their fluency in natural and programming languages, large language models (LLMs), although increasingly being applied to decisions with procedural tasks, have not been systematically evaluated for their ability to perform data-flow reasoning. We introduce FABLE, an extensible benchmark designed to assess LLMs' understanding of data flow using structured, procedural text. FABLE adapts eight classical data-flow analyses from software engineering: reaching definitions, very busy expressions, available expressions, live variable analysis, interval analysis, type-state analysis, taint analysis, and concurrency analysis. These analyses are instantiated across three real-world domains: cooking recipes, travel routes, and automated plans. The benchmark includes 2,400 question-answer pairs, with 100 examples for each domain-analysis combination. We evaluate three types of LLMs: a reasoning-focused model (DeepSeek-R1 8B), a general-purpose model (LLaMA 3.1 8B), and a code-specific model (Granite Code 8B). Each model is tested using majority voting over five sampled completions per prompt. Results show that the reasoning model achieves higher accuracy, but at the cost of over 20 times slower inference compared to the other models. In contrast, the general-purpose and code-specific models perform close to random chance. FABLE provides the first diagnostic benchmark to systematically evaluate data-flow reasoning and offers insights for developing models with stronger procedural understanding.

View on arXiv PDF Code

Similar