SURGIVID: Annotation-Efficient Surgical Video Object Discovery
This addresses the need for efficient surgical video analysis in medical settings, reducing annotation demands, but it is incremental as it builds on existing self-supervised and weakly supervised methods.
The paper tackles the problem of pixel-wise localization of tools and anatomical structures in surgical videos, which typically requires extensive annotations. It proposes an annotation-efficient framework using self-supervised object discovery and minimal supervision, achieving comparable performance to fully-supervised models with only 36 labels and a ~2% improvement in tool localization with weak labels.
Surgical scenes convey crucial information about the quality of surgery. Pixel-wise localization of tools and anatomical structures is the first task towards deeper surgical analysis for microscopic or endoscopic surgical views. This is typically done via fully-supervised methods which are annotation greedy and in several cases, demanding medical expertise. Considering the profusion of surgical videos obtained through standardized surgical workflows, we propose an annotation-efficient framework for the semantic segmentation of surgical scenes. We employ image-based self-supervised object discovery to identify the most salient tools and anatomical structures in surgical videos. These proposals are further refined within a minimally supervised fine-tuning step. Our unsupervised setup reinforced with only 36 annotation labels indicates comparable localization performance with fully-supervised segmentation models. Further, leveraging surgical phase labels as weak labels can better guide model attention towards surgical tools, leading to $\sim 2\%$ improvement in tool localization. Extensive ablation studies on the CaDIS dataset validate the effectiveness of our proposed solution in discovering relevant surgical objects with minimal or no supervision.