CVAIJul 29, 2024

Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets

arXiv:2407.19714v14 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses surgical scene understanding for enabling intelligent systems in medical interventions, representing an incremental improvement with domain-specific impact.

The paper tackled semantic segmentation in surgical scenes by proposing SurgDepth, a multi-modal RGB-D fusion framework based on Vision Transformers, achieving state-of-the-art results with a 0.86 IoU on EndoVis2022 and outperforming previous methods by at least 4%.

Surgical scene understanding is a key technical component for enabling intelligent and context aware systems that can transform various aspects of surgical interventions. In this work, we focus on the semantic segmentation task, propose a simple yet effective multi-modal (RGB and depth) training framework called SurgDepth, and show state-of-the-art (SOTA) results on all publicly available datasets applicable for this task. Unlike previous approaches, which either fine-tune SOTA segmentation models trained on natural images, or encode RGB or RGB-D information using RGB only pre-trained backbones, SurgDepth, which is built on top of Vision Transformers (ViTs), is designed to encode both RGB and depth information through a simple fusion mechanism. We conduct extensive experiments on benchmark datasets including EndoVis2022, AutoLapro, LapI2I and EndoVis2017 to verify the efficacy of SurgDepth. Specifically, SurgDepth achieves a new SOTA IoU of 0.86 on EndoVis 2022 SAR-RARP50 challenge and outperforms the current best method by at least 4%, using a shallow and compute efficient decoder consisting of ConvNeXt blocks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes