MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents
This addresses the efficiency problem for NLP tasks like retrieval-augmented generation and summarization, offering a significant cost reduction but is incremental as it builds on existing fact-checking methods.
The paper tackles the high computational cost of fact-checking LLM outputs against grounding documents by introducing MiniCheck, a small model that achieves GPT-4-level accuracy at 400x lower cost, as demonstrated by outperforming comparable systems on the new LLM-AggreFact benchmark.
Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to a model to check a single response. In this work, we show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.