WildIFEval: Instruction Following in the Wild
This addresses the problem of evaluating and improving instruction-following in LLMs for real-world applications, but it is incremental as it builds on prior datasets by focusing on multi-constraint conditions.
The paper tackles the challenge of LLMs handling instructions with multiple constraints by introducing WildIFEval, a dataset of 7K real user instructions with diverse constraints, and benchmarks show models have large room for improvement, with performance varying by constraint number and type.
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.