CL AIJun 16, 2025

AI-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns

arXiv:2506.13172v2h-index: 1

Originality Incremental advance

AI Analysis

This work addresses the need for reliable automated analysis of academic texts, but it is incremental as it builds on existing prompting methods and highlights model-specific dependencies.

The researchers tackled the problem of analyzing scholarly manuscripts for unsubstantiated claims and ambiguous pronouns using structured prompts with LLMs, finding that model performance varied significantly by task and context, with success rates ranging from 0% to 100% depending on conditions.

We present and evaluate a suite of proof-of-concept (PoC), structured workflow prompts designed to elicit human-like hierarchical reasoning while guiding Large Language Models (LLMs) in the high-level semantic and linguistic analysis of scholarly manuscripts. The prompts target two non-trivial analytical tasks within academic summaries (abstracts and conclusions): identifying unsubstantiated claims (informational integrity) and flagging semantically confusing ambiguous pronoun references (linguistic clarity). We conducted a systematic, multi-run evaluation on two frontier models (Gemini Pro 2.5 Pro and ChatGPT Plus o3) under varied context conditions. Our results for the informational integrity task reveal a significant divergence in model performance: while both models successfully identified an unsubstantiated head of a noun phrase (95% success), ChatGPT consistently failed (0% success) to identify an unsubstantiated adjectival modifier that Gemini correctly flagged (95% success), raising a question regarding the potential influence of the target's syntactic role. For the linguistic analysis task, both models performed well (80-90% success) with full manuscript context. Surprisingly, in a summary-only setting, Gemini's performance was substantially degraded, while ChatGPT achieved a perfect (100%) success rate. Our findings suggest that while structured prompting is a viable methodology for complex textual analysis, prompt performance may be highly dependent on the interplay between the model, task type, and context, highlighting the need for rigorous, model-specific testing.

View on arXiv PDF

Similar