CLOct 14, 2024

Large Language Models Are Active Critics in NLG Evaluation

arXiv:2410.10724v24.24 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses the challenge of adapting NLG evaluation to diverse scenarios for developers and users, though it is incremental as it builds on existing LLM methods.

The paper tackled the problem of rigid LLM-based evaluation in NLG by introducing Active-Critic, which adapts to diverse tasks using limited data, achieving superior alignment with human judgments across multiple tasks.

The conventional paradigm of using large language models (LLMs) for natural language generation (NLG) evaluation relies on pre-defined task definitions and evaluation criteria, positioning LLMs as "passive critics" that strictly follow developer-provided guidelines. However, human evaluators often apply implicit criteria, and their expectations in practice can vary widely based on specific end-user needs. Consequently, these rigid evaluation methods struggle to adapt to diverse scenarios without extensive prompt customization. To address this, we introduce Active-Critic, a novel LLM-based evaluator that transforms LLMs into "active critics'' capable of adapting to diverse NLG tasks using limited example data. Active-Critic consists of two stages: (1) self-inferring the target NLG task and relevant evaluation criteria, and (2) dynamically optimizing prompts to produce human-aligned scores along with detailed justifications. Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria, enabling it to achieve superior alignment with human judgments across multiple tasks.

View on arXiv PDF

Similar