CL LGMay 15

Judge Circuits

Nils Feldhus, Tanja Baeumel, Elena Golimblevskaia, Qianli Wang, Van Bach Nguyen, Aaron Louis Eidt, Christopher Ebert, Wojciech Samek, Jing Yang, Vera Schmitt, Sebastian Möller, Simon Ostermann

arXiv:2605.1602395.5

AI Analysis

For researchers using LLM-as-a-judge, this work provides a mechanistic explanation of format-induced inconsistency, showing that benchmark comparisons across formats may be confounded by formatting artifacts.

The paper identifies that LLM judges assign inconsistent scores across output formats (e.g., rating vs. True/False) due to a shared latent evaluator sub-graph in mid-to-late MLPs, which is then mapped through fragile, format-specific terminal branches. This implies that cross-format reliability comparisons partially measure formatter geometry rather than evaluation quality.

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

View on arXiv PDF

Similar