CLMar 2, 2024

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

arXiv:2403.01061v335 citationsh-index: 32TACL
Originality Incremental advance
AI Analysis

This work addresses the problem of evaluating LLMs on unseen, author-curated short story summarization for researchers and practitioners, highlighting limitations in automatic metrics.

The study evaluated GPT-4, Claude-2.1, and LLama-2-70B on summarizing short stories with nuanced subtext, finding that all models made faithfulness mistakes in over 50% of summaries and struggled with specificity and interpretation.

We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes