MA CLJun 4

MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

Ali Keramati, Shiyuan Zhou, Sharad Mehrotra, Mark Warschauer

arXiv:2606.0675413.9

Originality Incremental advance

AI Analysis

Provides a training-free method for reliable automated essay scoring, addressing bias and instability in LLM-as-judge approaches.

MADRAG combines multi-agent debate with retrieval-augmented generation to score analytic essays without training, outperforming prompt-based baselines and approaching supervised system performance.

We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.

View on arXiv PDF

Similar