CRCLJul 31, 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

arXiv:2507.23453v11 citationsh-index: 7IJCNLP-AACL
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in LLM-based evaluation systems for users relying on automated assessments, but it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of defending LLM-based evaluation systems against blind prompt injection attacks by proposing a framework that combines Standard Evaluation with Counterfactual Evaluation, which re-evaluates submissions against false ground-truth answers to detect attacks. Experiments show this approach significantly improves security with minimal performance trade-offs, though specific numerical results are not provided.

This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes