AICLSep 25, 2024

AXCEL: Automated eXplainable Consistency Evaluation using LLMs

arXiv:2409.16984v125 citationsh-index: 19Has Code
Originality Incremental advance
AI Analysis

It addresses the problem of explainable and generalizable consistency evaluation for LLM users, though it is incremental as it builds on existing prompt-based methods.

The paper tackles the challenge of evaluating text consistency in LLM-generated responses by introducing AXCEL, a prompt-based metric that provides explanations and outperforms SOTA metrics by 8.7% in summarization, 6.2% in free text generation, and 29.4% in data-to-text tasks.

Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, require domain-specific training due to poor cross-domain generalization, and lack explainability. More recently, prompt-based metrics using LLMs as evaluators have emerged; while they are easier to implement, they still lack explainability and depend on task-specific prompts, which limits their generalizability. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL), a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning and pinpointing inconsistent text spans. AXCEL is also a generalizable metric which can be adopted to multiple tasks without changing the prompt. AXCEL outperforms both non-prompt and prompt-based state-of-the-art (SOTA) metrics in detecting inconsistencies across summarization by 8.7%, free text generation by 6.2%, and data-to-text conversion tasks by 29.4%. We also evaluate the influence of underlying LLMs on prompt based metric performance and recalibrate the SOTA prompt-based metrics with the latest LLMs for fair comparison. Further, we show that AXCEL demonstrates strong performance using open source LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes