CL AIApr 17, 2025

Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

Sudesh Ramesh Bhagat, Ibne Farabi Shihab, Anuj Sharma

arXiv:2504.13068v24.93 citationsh-index: 4

Originality Incremental advance

AI Analysis

It addresses the problem of evaluating safety-critical NLP tasks by highlighting the insufficiency of accuracy alone and advocating for expert agreement in model assessment.

This study found an inverse relationship between deep learning model accuracy and expert agreement in classifying crash narratives, with large language models showing stronger expert alignment despite lower accuracy.

This study investigates the relationship between deep learning (DL) model accuracy and expert agreement in classifying crash narratives. We evaluate five DL models -- including BERT variants, USE, and a zero-shot classifier -- against expert labels and narratives, and extend the analysis to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our findings reveal an inverse relationship: models with higher technical accuracy often show lower agreement with human experts, while LLMs demonstrate stronger expert alignment despite lower accuracy. We use Cohen's Kappa and Principal Component Analysis (PCA) to quantify and visualize model-expert agreement, and employ SHAP analysis to explain misclassifications. Results show that expert-aligned models rely more on contextual and temporal cues than location-specific keywords. These findings suggest that accuracy alone is insufficient for safety-critical NLP tasks. We argue for incorporating expert agreement into model evaluation frameworks and highlight the potential of LLMs as interpretable tools in crash analysis pipelines.

View on arXiv PDF

Similar