CLAIDec 29, 2025

Not too long do read: Evaluating LLM-generated extreme scientific summaries

arXiv:2512.23206v1h-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of evaluating LLM summarization for science communication, but it is incremental as it focuses on dataset creation and basic analysis without major methodological breakthroughs.

The paper tackles the lack of a high-quality dataset for evaluating LLM-generated scientific summaries by introducing BiomedTLDR, a novel dataset of researcher-authored summaries, and finds that LLMs tend to be more extractive and less abstractive than humans, with some models successfully producing human-like summaries.

High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs' summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors' comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at https://github.com/netknowledge/LLM_summarization (Lyu and Ke, 2025).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes