FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain
This provides a domain-specific benchmark for improving fact-checking in medical LLMs, but it is incremental as it builds on existing techniques.
The authors tackled the problem of evaluating factuality in LLM-generated medical text by creating FActBench, a benchmark covering four tasks and six models, and found that unanimous voting of two fact-checking techniques best correlates with expert evaluation.
Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.