AI CLNov 25, 2024

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng

arXiv:2411.16594v750.7542 citationsh-index: 16Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

It provides a comprehensive overview for researchers and practitioners in AI/NLP, but it is incremental as a survey rather than presenting new methods or results.

This paper surveys the 'LLM-as-a-judge' paradigm, which uses large language models for scoring, ranking, or selection in AI and NLP evaluation, addressing challenges in open-ended scenarios.

Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). Traditional methods, usually matching-based or small model-based, often fall short in open-ended and dynamic scenarios. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection for various machine learning evaluation scenarios. This paper presents a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to review this evolving field. We first provide the definition from both input and output perspectives. Then we introduce a systematic taxonomy to explore LLM-as-a-judge along three dimensions: what to judge, how to judge, and how to benchmark. Finally, we also highlight key challenges and promising future directions for this emerging area. More resources on LLM-as-a-judge are on the website: https://llm-as-a-judge.github.io and https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge.

View on arXiv PDF Code

Similar