Revisiting 3D Medical Scribble Supervision: Benchmarking Beyond Cardiac Segmentation
This work addresses the issue of misleading performance claims in medical image segmentation for researchers and practitioners, highlighting incremental improvements in benchmarking and evaluation standards.
The paper tackled the problem of overfitting and lack of generalization in 3D medical scribble supervision methods, which are predominantly tested on cardiac datasets, by introducing ScribbleBench, a benchmark across seven diverse datasets, and found that simpler approaches like nnU-Net with a partial loss outperform specialized methods.
Scribble supervision has emerged as a promising approach for reducing annotation costs in medical 3D segmentation by leveraging sparse annotations instead of voxel-wise labels. While existing methods report strong performance, a closer analysis reveals that the majority of research is confined to the cardiac domain, predominantly using ACDC and MSCMR datasets. This over-specialization has resulted in severe overfitting, misleading claims of performance improvements, and a lack of generalization across broader segmentation tasks. In this work, we formulate a set of key requirements for practical scribble supervision and introduce ScribbleBench, a comprehensive benchmark spanning over seven diverse medical imaging datasets, to systematically evaluate the fulfillment of these requirements. Consequently, we uncover a general failure of methods to generalize across tasks and that many widely used novelties degrade performance outside of the cardiac domain, whereas simpler overlooked approaches achieve superior generalization. Finally, we raise awareness for a strong yet overlooked baseline, nnU-Net coupled with a partial loss, which consistently outperforms specialized methods across a diverse range of tasks. By identifying fundamental limitations in existing research and establishing a new benchmark-driven evaluation standard, this work aims to steer scribble supervision toward more practical, robust, and generalizable methodologies for medical image segmentation.