CLJun 3

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

arXiv:2606.0331872.6Has Code
AI Analysis

For researchers and developers of LLM-based agents, this benchmark reveals a critical gap between idealized evaluations and real-world performance, highlighting the need for robustness against ambiguous and uncooperative user inputs.

The paper introduces RUT-Bench, a benchmark for evaluating LLMs in realistic tool-use scenarios with non-ideal user behaviors. Evaluations on 19 LLMs show no model exceeds 40% success rate, with significant performance drops under complex inputs.

Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non-ideal user inputs. Our code and data is available at https://github.com/Miaow-Lab/RUT-Bench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes