CL AIMar 9

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng

arXiv:2603.07886v17.61 citationsh-index: 1

Predicted impact top 79% in CL · last 90 daysOriginality Highly original

AI Analysis

This work addresses the problem of inadequately evaluating LLMs' ability to follow complex instructions for real-world industrial applications, revealing current models' limitations.

This paper introduces CCR-Bench, a new benchmark to evaluate Large Language Models (LLMs) on complex instructions involving entangled content and formatting, intricate control flows, and real-world industrial scenarios. Experiments show that state-of-the-art LLMs have substantial performance deficiencies on this benchmark, highlighting a significant gap in their ability to follow complex instructions.

Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.

View on arXiv PDF

Similar