AIApr 4

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Yulong He, Ivan Smirnov, Dmitry Fedrushkov, Sergey Kovalchuk, Ilya Revin

arXiv:2604.0374251.2h-index: 1Has Code

Predicted impact top 71% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners needing reliable LLM evaluation, this work provides a structured, uncertainty-aware method that improves judgment consistency and outperforms direct scoring.

The authors adapt Analytic Hierarchy Process (AHP) and propose a confidence-aware Fuzzy AHP (FAHP) for LLM evaluation, showing that both crisp and fuzzy AHP outperform direct scoring on JudgeBench, with FAHP improving stability. Their hybrid framework DualJudge achieves state-of-the-art performance.

Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose \textbf{DualJudge}, a hybrid framework inspired by Dual-Process Theory that adaptively fuses holistic direct scores with structured AHP outputs via consistency-aware weighting. DualJudge achieves state-of-the-art performance, underscoring the complementary strengths of intuitive and deliberative evaluation paradigms. These results establish uncertainty-aware structured reasoning as a principled pathway toward more reliable LLM assessment. Code is available at https://github.com/hreyulog/AHP_llm_judge.

View on arXiv PDF Code

Similar