CL CYDec 22, 2025

Counterfactual LLM-based Framework for Measuring Rhetorical Style

arXiv:2512.19908v14.9

Originality Incremental advance

AI Analysis

This addresses the issue of distinguishing hype from substance in AI research for the scientific community, though it is incremental as it builds on existing LLM and modeling techniques.

The authors tackled the problem of quantifying rhetorical style in machine learning papers independently of content by introducing a counterfactual LLM-based framework, finding that visionary framing predicts downstream attention like citations and media attention, with a sharp rise in rhetorical strength after 2023 driven by LLM-based writing assistance.

The rise of AI has fueled growing concerns about ``hype'' in machine learning papers, yet a reliable way to quantify rhetorical style independently of substantive content has remained elusive. Because bold language can stem from either strong empirical results or mere rhetorical style, it is often difficult to distinguish between the two. To disentangle rhetorical style from substantive content, we introduce a counterfactual, LLM-based framework: multiple LLM rhetorical personas generate counterfactual writings from the same substantive content, an LLM judge compares them through pairwise evaluations, and the outcomes are aggregated using a Bradley--Terry model. Applying this method to 8,485 ICLR submissions sampled from 2017 to 2025, we generate more than 250,000 counterfactual writings and provide a large-scale quantification of rhetorical style in ML papers. We find that visionary framing significantly predicts downstream attention, including citations and media attention, even after controlling for peer-review evaluations. We also observe a sharp rise in rhetorical strength after 2023, and provide empirical evidence showing that this increase is largely driven by the adoption of LLM-based writing assistance. The reliability of our framework is validated by its robustness to the choice of personas and the high correlation between LLM judgments and human annotations. Our work demonstrates that LLMs can serve as instruments to measure and improve scientific evaluation.

View on arXiv PDF

Similar