CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering
This addresses the problem of unreliable automated evaluation for researchers and practitioners in QA, offering a more efficient and aligned method, though it is incremental as it builds on existing classifier approaches.
The paper tackles the misalignment between automated answer equivalence evaluation and human judgments in open-domain question answering, particularly for verbose answers from large language models, by introducing CFMatch, a lightweight classifier-based matching method under 1 MB that improves evaluation accuracy according to expert rules.
Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from large language models (LLM). There are two challenges: a lack of data and that models are too big: LLM-based scorers can correlate better with human judges, but this task has only been tested on limited QA datasets, and even when available, update of the model is limited because LLMs are large and often expensive. We rectify both of these issues by providing clear and consistent guidelines for evaluating AE in machine QA adopted from professional human QA contests. We also introduce a combination of standard evaluation and a more efficient, robust, and lightweight discriminate AE classifier-based matching method (CFMatch, smaller than 1 MB), trained and validated to more accurately evaluate answer correctness in accordance with adopted expert AE rules that are more aligned with human judgments.