CLAIDec 17, 2024

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

arXiv:2412.13091v124 citationsh-index: 28
Originality Incremental advance
AI Analysis

This addresses the problem of costly and noisy human evaluation for language models in critical workflows, offering an incremental improvement over existing automated metrics.

The paper tackles the challenge of evaluating language models by introducing natural language unit tests to decompose response quality into testable criteria, and LMUnit, a scoring model that achieves state-of-the-art performance on benchmarks like FLASK and BigGenBench.

As language models become integral to critical workflows, assessing their behavior remains a fundamental challenge -- human evaluation is costly and noisy, while automated metrics provide only coarse, difficult-to-interpret signals. We introduce natural language unit tests, a paradigm that decomposes response quality into explicit, testable criteria, along with a unified scoring model, LMUnit, which combines multi-objective training across preferences, direct ratings, and natural language rationales. Through controlled human studies, we show this paradigm significantly improves inter-annotator agreement and enables more effective LLM development workflows. LMUnit achieves state-of-the-art performance on evaluation benchmarks (FLASK, BigGenBench) and competitive results on RewardBench. These results validate both our proposed paradigm and scoring model, suggesting a promising path forward for language model evaluation and development.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes