Core: Robust Factual Precision with Informative Sub-Claim Identification
This addresses the challenge of reliable factual assessment for LLM applications, though it is incremental as it builds on existing decomposition-based metrics.
The paper tackles the problem of factual precision evaluation in large language models being vulnerable to manipulation by redundant subclaims, and introduces Core, a subclaim selection component that improves robustness across knowledge domains.
Hallucinations pose a challenge to the application of large language models (LLMs) thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as \FActScore, can be manipulated by adding obvious or repetitive subclaims to artificially inflate scores. This observation motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. We show that many popular factual precision metrics augmented by Core are substantially more robust on a wide range of knowledge domains. We release an evaluation framework supporting easy and modular use of Core and various decomposition strategies, which we recommend adoption by the community. We also release an expansion of the FActScore biography dataset to facilitate further studies of decomposition-based factual precision evaluation.