CVApr 10, 2024

Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding

Ke Zou, Yang Bai, Bo Liu, Yidi Chen, Zhihao Chen, Yang Zhou, Xuedong Yuan, Meng Wang, Xiaojing Shen, Xiaochun Cao, Yih Chung Tham, Huazhu Fu

arXiv:2404.06798v313.512 citationsh-index: 105IEEE Trans Pattern Anal Mach Intell

Originality Incremental advance

AI Analysis

This addresses the workload and trust issues for clinicians in medical image analysis, though it appears incremental as it builds on existing multimodal large language models.

The paper tackles the problem of manually extracting key phrases from medical reports for image analysis by introducing Medical Report Grounding (MRG), an end-to-end task to identify diagnostic phrases and their corresponding grounding boxes, and proposes uMedGround, a framework that outperforms state-of-the-art methods.

Medical phrase grounding is crucial for identifying relevant regions in medical images based on phrase queries, facilitating accurate image analysis and diagnosis. However, current methods rely on manual extraction of key phrases from medical reports, reducing efficiency and increasing the workload for clinicians. Additionally, the lack of model confidence estimation limits clinical trust and usability. In this paper, we introduce a novel task called Medical Report Grounding (MRG), which aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner. To address this challenge, we propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases by embedding a unique token, <BOX>, into the vocabulary to enhance detection capabilities. A vision encoder-decoder processes the embedded token and input image to generate grounding boxes. Critically, uMedGround incorporates an uncertainty-aware prediction model, significantly improving the robustness and reliability of grounding predictions. Experimental results demonstrate that uMedGround outperforms state-of-the-art medical phrase grounding methods and fine-tuned large visual-language models, validating its effectiveness and reliability. This study represents a pioneering exploration of the MRG task, marking the first-ever endeavor in this domain. Additionally, we demonstrate the applicability of uMedGround in medical visual question answering and class-based localization tasks, where it highlights visual evidence aligned with key diagnostic phrases, supporting clinicians in interpreting various types of textual inputs, including free-text reports, visual question answering queries, and class labels.

View on arXiv PDF

Similar