AI CLSep 25, 2024

AXCEL: Automated eXplainable Consistency Evaluation using LLMs

P Aditya Sreekar, Sahil Verma, Suransh Chopra, Sarik Ghazarian, Abhishek Persad, Narayanan Sadagopan

arXiv:2409.16984v122.025 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

It addresses the problem of explainable and generalizable consistency evaluation for LLM users, though it is incremental as it builds on existing prompt-based methods.

The paper tackles the challenge of evaluating text consistency in LLM-generated responses by introducing AXCEL, a prompt-based metric that provides explanations and outperforms SOTA metrics by 8.7% in summarization, 6.2% in free text generation, and 29.4% in data-to-text tasks.

Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, require domain-specific training due to poor cross-domain generalization, and lack explainability. More recently, prompt-based metrics using LLMs as evaluators have emerged; while they are easier to implement, they still lack explainability and depend on task-specific prompts, which limits their generalizability. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL), a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning and pinpointing inconsistent text spans. AXCEL is also a generalizable metric which can be adopted to multiple tasks without changing the prompt. AXCEL outperforms both non-prompt and prompt-based state-of-the-art (SOTA) metrics in detecting inconsistencies across summarization by 8.7%, free text generation by 6.2%, and data-to-text conversion tasks by 29.4%. We also evaluate the influence of underlying LLMs on prompt based metric performance and recalibrate the SOTA prompt-based metrics with the latest LLMs for fair comparison. Further, we show that AXCEL demonstrates strong performance using open source LLMs.

View on arXiv PDF

Similar