Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering
This work addresses the challenge of cost-effective and human-aligned evaluation metrics for LLMs in question answering, offering an incremental improvement by augmenting existing NLI techniques.
The paper tackled the problem of evaluating large language models (LLMs) in question answering by revisiting Natural Language Inference (NLI) scoring, finding it matches GPT-4o's accuracy (89.9%) on long-form QA with far fewer parameters. They introduced DIVER-QA, a 3000-sample human-annotated benchmark, to test metric alignment and show NLI-based evaluation remains competitive.
Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.