CLJul 15, 2025

Real-World Summarization: When Evaluation Reaches Its Limits

Patrícia Schmidtová, Ondřej Dušek, Saad Mahamood

arXiv:2507.11508v14.92 citationsh-index: 12EMNLP

Originality Synthesis-oriented

AI Analysis

This addresses evaluation challenges for businesses using LLM-generated summaries, but it is incremental as it focuses on a specific domain and existing methods.

The paper tackled the problem of evaluating faithfulness in LLM-generated hotel highlights, finding that simpler metrics like word overlap correlate well with human judgments (Spearman correlation rank of 0.63) and outperform complex methods on out-of-domain data, while LLMs are unreliable for evaluation due to annotation errors.

We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (Spearman correlation rank of 0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.

View on arXiv PDF

Similar