How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description
For bibliometric researchers, this work provides evidence that hybrid workflows combining algorithmic structure with LLM-generated descriptions are more reliable than LLMs alone.
The paper evaluates whether bibliometric structure improves LLM-assisted synthesis by comparing six pipelines for generating cluster descriptions. Results show LLMs produce semantically close descriptions but are unreliable when inferring structure from scratch; performance improves when bibliometric algorithms define clusters and LLMs interpret them.
Large language models (LLMs) can support scientific literature synthesis, but remain prone to hallucinated references, uneven coverage, and weakly grounded thematic organization. We evaluate whether bibliometric structure improves LLM-assisted synthesis by comparing six pipelines for generating cluster descriptions under different levels of evidence and structure. Using 100 published bibliometric analyses, we reconstruct Scopus corpora, extract human-written cluster descriptions, and assess outputs by human alignment, semantic coverage, clustering quality, graph quality, and reference grounding. Results show that LLMs produce descriptions semantically close to human-written ones, but are unreliable when asked to infer bibliometric structure from scratch. Performance improves when bibliometric algorithms define the clusters and the LLM interprets them. Overall, LLM-assisted bibliometric synthesis is most promising as a hybrid workflow in which algorithms provide auditable structure and LLMs generate readable descriptions.