LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests
This work addresses the issue of benchmark contamination for researchers and practitioners, but it is incremental as it builds on prior concerns about memorization and shortcuts.
The study tackled the problem of inflated benchmark scores in Large Language Models due to memorization by evaluating models on paraphrased versions of benchmark questions, finding a non-trivial accuracy drop that indicates surface-form brittleness.
Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.