Revisiting the Evaluation Bias Introduced by Frame Sampling Strategies in Surgical Video Segmentation Using SAM2
This work addresses evaluation bias in surgical video AI, which is crucial for developing reliable real-time segmentation tools for surgeons and medical professionals.
The study investigated how inconsistent annotation densities and frame rate sampling in surgical video datasets affect the evaluation of zero-shot segmentation models like SAM2, finding that sparse sampling can misleadingly show better performance due to smoothing effects, while real-time streaming reveals higher frame rates provide superior segmentation stability for dynamic objects.
Real-time video segmentation is a promising opportunity for AI-assisted surgery, offering intraoperative guidance by identifying tools and anatomical structures. Despite growing interest in surgical video segmentation, annotation protocols vary widely across datasets -- some provide dense, frame-by-frame labels, while others rely on sparse annotations sampled at low frame rates such as 1 FPS. In this study, we investigate how such inconsistencies in annotation density and frame rate sampling influence the evaluation of zero-shot segmentation models, using SAM2 as a case study for cholecystectomy procedures. Surprisingly, we find that under conventional sparse evaluation settings, lower frame rates can appear to outperform higher ones due to a smoothing effect that conceals temporal inconsistencies. However, when assessed under real-time streaming conditions, higher frame rates yield superior segmentation stability, particularly for dynamic objects like surgical graspers. To understand how these differences align with human perception, we conducted a survey among surgeons, nurses, and machine learning engineers and found that participants consistently preferred high-FPS segmentation overlays, reinforcing the importance of evaluating every frame in real-time applications rather than relying on sparse sampling strategies. Our findings highlight the risk of evaluation bias that is introduced by inconsistent dataset protocols and bring attention to the need for temporally fair benchmarking in surgical video AI.