Improving Model Factuality with Fine-grained Critique-based Evaluator
This work addresses the issue of factuality in language models for users relying on accurate information, representing an incremental improvement over existing methods.
The paper tackles the problem of factual errors in language models by training a factuality evaluator, FenCE, which provides claim-level feedback, and uses it to improve model factuality through a training framework, resulting in accuracy improvements of 2.9% on LLM-AggreFact and factuality rate increases of 16.86% and 14.45% for Llama models on FActScore.
Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama2-7B-chat and Llama3-8B-chat's factuality rate by 16.86% and 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 8.83% and 6.96%.