CRMar 30

Misleading Large Language Models used (or misused) in Scientific Peer-Reviewing via Hidden Prompt-Injection Attacks

Matteo Gioele Collu, Umberto Salviati, Roberto Confalonieri, Mauro Conti, Giovanni Apruzzese

arXiv:2508.2086369.53 citationsh-index: 7

Predicted impact top 35% in CR · last 90 daysOriginality Incremental advance

AI Analysis

This work highlights a security vulnerability for the scientific community increasingly relying on LLMs for peer review, though the threat models are formalized and the attack is demonstrated in controlled settings.

The authors demonstrate that hidden prompt injection attacks can reliably mislead LLMs used in scientific peer review, achieving high success rates across different reviewing prompts, LLM systems, and papers. They propose and evaluate methods to reduce detectability of such attacks.

Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.

View on arXiv PDF

Similar