Inference to the Best Explanation in Large Language Models
This work addresses the need for interpretable and efficient evaluation of LLM explanations, which is crucial for improving trust and verification in AI applications, though it is incremental as it builds on existing philosophical concepts.
The paper tackles the problem of evaluating explanations generated by large language models (LLMs) by proposing IBE-Eval, a framework that estimates explanation plausibility using logical and linguistic features, achieving up to 77% accuracy in selecting the best causal explanation, which is 27% above random and 17% better than a baseline.
While Large Language Models (LLMs) have found success in real-world applications, their underlying explanatory process is still poorly understood. This paper proposes IBE-Eval, a framework inspired by philosophical accounts on Inference to the Best Explanation (IBE) to advance the interpretation and evaluation of LLMs' explanations. IBE-Eval estimates the plausibility of natural language explanations through a combination of explicit logical and linguistic features including: consistency, parsimony, coherence, and uncertainty. Extensive experiments are conducted on Causal Question Answering (CQA), where \textit{IBE-Eval} is tasked to select the most plausible causal explanation amongst competing ones generated by LLMs (i.e., GPT 3.5 and Llama 2). The experiments reveal that IBE-Eval can successfully identify the best explanation with up to 77\% accuracy ($\approx 27\%$ above random), improving upon a GPT 3.5-as-a-Judge baseline ($\approx+17\%$) while being intrinsically more efficient and interpretable. Additional analyses suggest that, despite model-specific variances, LLM-generated explanations tend to conform to IBE criteria and that IBE-Eval is significantly correlated with human judgment, opening up opportunities for future development of automated explanation verification tools.