CLAILGNov 14, 2023

Instruction-Following Evaluation for Large Language Models

DeepMind
arXiv:2311.07911v1917 citationsh-index: 29Has Code
Originality Synthesis-oriented
AI Analysis

This provides a reproducible evaluation method for LLM developers and researchers, though it is incremental as it builds on existing evaluation challenges.

The paper tackles the lack of standardized evaluation for large language models' instruction-following abilities by introducing IFEval, a benchmark using verifiable instructions, and reports results for two widely available LLMs.

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes