CLAIJun 23, 2025

Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics

arXiv:2506.18387v1h-index: 1
Originality Synthesis-oriented
AI Analysis

It addresses the problem of metric selection for evaluating causal explanations in medical reports, which is incremental as it compares existing metrics rather than introducing new ones.

This study compared six metrics for evaluating causal explanations in diagnostic reports, finding that GPT-Black had the strongest discriminative power for identifying coherent and clinically valid narratives, while similarity-based metrics diverged from clinical reasoning quality.

This study investigates how accurately different evaluation metrics capture the quality of causal explanations in automatically generated diagnostic reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec, GPT-White, GPT-Black, and expert qualitative assessment across two input types: observation-based and multiple-choice-based report generation. Two weighting strategies are applied: one reflecting task-specific priorities, and the other assigning equal weights to all metrics. Our results show that GPT-Black demonstrates the strongest discriminative power in identifying logically coherent and clinically valid causal narratives. GPT-White also aligns well with expert evaluations, while similarity-based metrics diverge from clinical reasoning quality. These findings emphasize the impact of metric selection and weighting on evaluation outcomes, supporting the use of LLM-based evaluation for tasks requiring interpretability and causal reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes