CL HCApr 25

VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

Yurui Xiang, Xingyi Mao, Rui Sheng, Zixin Chen, Zelin Zang, Yuyang Wu, Haipeng Zeng, Huamin Qu, Yushi Sun, Yanna Lin

arXiv:2604.2335683.2

Predicted impact top 58% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For developers of medical LLMs, this system addresses the challenge of debugging clinically grounded reasoning without requiring deep medical expertise.

VeriLLMed is a visual analytics system that uses knowledge graphs to help developers debug medical LLMs by identifying three recurring error types (relation, branch, missing errors). Expert evaluation shows it helps detect clinically implausible reasoning and generate actionable insights for model improvement.

Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.

View on arXiv PDF

Similar