CVMar 1

GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

arXiv:2603.01108v1h-index: 39Has Code
Originality Incremental advance
AI Analysis

This addresses the need for clinically realistic evaluation of vision-language models in surgical AI systems, though it is incremental as it builds on existing segmentation benchmarks by adding language conditioning.

The authors tackled the problem of surgical tool segmentation by introducing GroundedSurg, a benchmark for language-conditioned, instance-level grounding, which revealed substantial performance gaps in modern models across diverse surgical procedures.

Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at https://github.com/gaash-lab/GroundedSurg

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes