CL AISep 30, 2024

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs

Yuho Lee, Taewon Yun, Jason Cai, Hang Su, Hwanjun Song

Amazon

arXiv:2409.19898v217.132 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better summarization evaluation tools for researchers and developers, though it is incremental as it builds on existing benchmark concepts.

The authors tackled the problem of limited and coarse summarization evaluation benchmarks by creating UniSumEval, a unified, fine-grained, multi-dimensional benchmark, and used it to benchmark nine latest language models and compare state-of-the-art automated evaluators.

Existing benchmarks for summarization quality evaluation often lack diverse input scenarios, focus on narrowly defined dimensions (e.g., faithfulness), and struggle with subjective and coarse-grained annotation schemes. To address these shortcomings, we create UniSumEval benchmark, which extends the range of input context (e.g., domain, length) and provides fine-grained, multi-dimensional annotations. We use AI assistance in data creation, identifying potentially hallucinogenic input texts, and also helping human annotators reduce the difficulty of fine-grained annotation tasks. With UniSumEval, we benchmark nine latest language models as summarizers, offering insights into their performance across varying input contexts and evaluation dimensions. Furthermore, we conduct a thorough comparison of SOTA automated summary evaluators. Our benchmark data will be available at https://github.com/DISL-Lab/UniSumEval-v1.0.

View on arXiv PDF Code

Similar