CLAIApr 22

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

arXiv:2604.2072663.3Has Code
AI Analysis

This is an incremental improvement for researchers and practitioners using LLM-as-a-Judge in legal AI, focusing on prompt optimization and judge selection.

This work tackles the problem of improving LLM-as-a-Judge evaluations for legal question answering by optimizing task prompts automatically, finding that it consistently outperforms human-centered design with lenient judge feedback yielding higher and more consistent gains.

This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at https://github.com/TUMLegalTech/icail2026-llm-judge-gaming.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes