LGCLJul 11, 2025

One Token to Fool LLM-as-a-Judge

arXiv:2507.08794v251 citationsh-index: 19Has Code
Originality Incremental advance
AI Analysis

This addresses a critical reliability problem for researchers and practitioners relying on LLM judges in evaluation and training, such as in Reinforcement Learning with Verifiable Rewards, though it is incremental as it builds on existing reward model frameworks.

The paper tackles the vulnerability of large language models (LLMs) used as automated judges, showing that superficial inputs like non-word symbols or generic reasoning openers can consistently elicit false positive rewards, affecting models including GPT-o1 and Claude-4. It proposes a data augmentation strategy using truncated outputs as adversarial examples, resulting in Master Reward Models that achieve state-of-the-art robustness against these attacks while maintaining high performance.

Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes