CLFeb 10

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick

arXiv:2602.10017v10.6h-index: 6

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation frameworks for LLMs in high-stakes applications like natural hazard response, though it is incremental as it builds on existing evaluation methods by adding new dimensions.

The authors tackled the problem of evaluating LLM outputs in high-stakes, domain-specific settings by proposing a multi-dimensional, reference-free framework that assesses specificity, robustness, relevance, and context utilization, using a curated dataset of 1,412 question-answer pairs and human evaluation to show that no single metric suffices for answer quality.

Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.

View on arXiv PDF

Similar