CL AI LGFeb 28, 2025

Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models

arXiv:2502.20647v12 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating summary consistency for NLP researchers and practitioners, but it is incremental as it applies existing methods to a new dataset and introduces a meta-evaluation score.

The paper evaluated the consistency of news article summaries generated by various language models, including TextRank, BART, Mistral-7B-Instruct, and GPT-3.5-Turbo, using metrics like ROUGE, BERT, and LLM-powered methods, finding that all models produced summaries more consistent than reference summaries on the XL-Sum dataset.

Text summarizing is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Large Language Models (LLMs) have shown remarkable promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. Regardless of the method of generating a summary, high quality automated evaluations remain an open area of investigation. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. The generated summaries are evaluated using traditional metrics such as the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score and Bidirectional Encoder Representations from Transformers (BERT) Score, as well as LLM-powered evaluation methods that directly assess a generated summary's consistency with the source text. We introduce a meta evaluation score which directly assesses the performance of the LLM evaluation system (prompt + model). We find that that all summarization models produce consistent summaries when tested on the XL-Sum dataset, exceeding the consistency of the reference summaries.

View on arXiv PDF

Similar