LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models
This work addresses the problem of unreliable evaluation methods for LLM instruction-following, which is crucial for researchers and developers seeking to improve model controllability, though it is incremental as it builds on existing programmatic benchmarks.
The authors tackled the challenge of evaluating large language models' ability to follow complex lexical instructions by introducing LexInstructEval, a benchmark and framework that uses a formal grammar to generate diverse datasets and enable objective verification, resulting in a publicly released dataset and tools.
The ability of Large Language Models (LLMs) to precisely follow complex and fine-grained lexical instructions is a cornerstone of their utility and controllability. However, evaluating this capability remains a significant challenge. Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability. Existing programmatic benchmarks, while objective, often lack the expressiveness to test intricate, compositional constraints at a granular level. To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following. Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonical <Procedure, Relation, Value> triplet. This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline and facilitates objective verification via a transparent, programmatic engine. We release our dataset and open-source evaluation tools to facilitate further research into the controllability and reliability of LLMs.