CLNov 22, 2022

HaRiM$^+$: Evaluating Summary Quality with Hallucination Risk

Seonil Son, Junsoo Park, Jeong-in Hwang, Junghwa Lee, Hyungjong Noh, Yeonsoo Lee

arXiv:2211.12118v224.3297 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of evaluating summary quality for researchers and developers in natural language processing, though it is incremental as it builds on prior methods for hallucination risk measurement.

The authors tackled the problem of measuring factual inconsistency in generated summaries by proposing HaRiM+, a reference-free metric that uses token likelihoods from an off-the-shelf summarization model to estimate hallucination risk, achieving state-of-the-art correlation to human judgment on datasets like FRANK, QAGS, and SummEval.

One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.

View on arXiv PDF Code

Similar