CLIRDec 7, 2024

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

arXiv:2412.05579v2480 citationsh-index: 19Has Code
Originality Synthesis-oriented
AI Analysis

It organizes existing knowledge on LLM-based evaluation methods, which is useful for researchers and practitioners in AI and NLP, but is incremental as a survey.

This paper surveys the use of Large Language Models (LLMs) as evaluators of natural language responses, covering functionality, methodology, applications, meta-evaluation, and limitations to provide insights for research and practice.

The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ''LLMs-as-judges''. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at https://github.com/CSHaitao/Awesome-LLMs-as-Judges.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes