CVAIJan 13

GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

arXiv:2601.08183v2h-index: 35
Originality Incremental advance
AI Analysis

This work addresses the clinical utility of MLLMs in gastroenterology by benchmarking them comprehensively, highlighting critical limitations like spatial grounding and hallucinations, which is significant for medical AI applications but incremental as it builds on existing evaluation frameworks.

The paper tackled the problem of evaluating multimodal large language models (MLLMs) in gastrointestinal endoscopy against clinical standards, revealing that while top models like Gemini-3-Pro outperformed trainees in diagnostic reasoning (Macro-F1 0.641 vs. 0.492) and rivaled junior endoscopists (0.727), they suffered from a spatial grounding bottleneck (human mIoU >0.506 vs. best model 0.345) and a fluency-accuracy paradox with lower factual correctness despite better readability.

Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical "spatial grounding bottleneck" persisted; human lesion localization (mIoU >0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a "fluency-accuracy paradox": models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to "over-interpretation" and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes