CL LGAug 12, 2025

Complex Logical Instruction Generation

Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song

Microsoft

arXiv:2508.09125v14.91 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation of LLMs' instruction-following abilities on complex logic tasks, which is crucial for advancing reasoning and agentic capabilities, though it is incremental as it focuses on benchmarking rather than solving the underlying problem.

The authors tackled the problem of evaluating how well large language models (LLMs) follow complex logic-rich instructions by proposing LogicIFGen, a framework for generating verifiable instructions from code functions, and LogicIFEval, a benchmark of 426 such instructions. Their experiments showed that current state-of-the-art LLMs struggle, with most following fewer than 60% of the instructions, revealing significant deficiencies.

Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF

View on arXiv PDF Code

Similar