CL AIDec 10, 2024

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation

Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, Alice Oh

CMU

arXiv:2412.10424v318 citationsh-index: 48Has CodeACL

Originality Highly original

AI Analysis

This addresses the need for more realistic and comprehensive evaluation of LLMs for researchers and practitioners, though it is incremental as it builds on existing LLM-as-a-Judge methods.

The paper tackles the problem of evaluating large language models (LLMs) by introducing LLM-as-an-Interviewer, a dynamic framework using multi-turn interactions with feedback and follow-up questions, and applies it to six models on MATH and DepthQA tasks, showing it provides insights into performance aspects like adaptability and addresses limitations of conventional methods.

We introduce LLM-as-an-Interviewer, a novel paradigm for evaluating large language models (LLMs). This approach leverages multi-turn interactions where the LLM interviewer actively provides feedback on responses and poses follow-up questions to the evaluated LLM. At the start of the interview, the LLM interviewer dynamically modifies datasets to generate initial questions, mitigating data contamination. We apply the LLM-as-an-Interviewer framework to evaluate six models on the MATH and DepthQA tasks. Our results show that the framework effectively provides insights into LLM performance, including the quality of initial responses, adaptability to feedback, and ability to address follow-up queries like clarification or additional knowledge requests. The framework also addresses key limitations of conventional methods like LLM-as-a-Judge, including verbosity bias and inconsistency across runs. Finally, we propose the Interview Report, which aggregates insights from the interview process, providing examples and a comprehensive analysis of the LLM's strengths and weaknesses. This report offers a detailed snapshot of the model's real-world applicability. The code for our framework is publicly available at https://github.com/interview-eval/.

View on arXiv PDF Code

Similar