nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image Segmentation
This addresses validation standards for researchers in medical image segmentation, highlighting an innovation bias toward novel architectures.
The study found that many recent claims of superior performance over U-Net baselines in 3D medical image segmentation fail under rigorous validation, and that state-of-the-art results are achieved by using CNN-based U-Net models with the nnU-Net framework and scaling to modern hardware.
The release of nnU-Net marked a paradigm shift in 3D medical image segmentation, demonstrating that a properly configured U-Net architecture could still achieve state-of-the-art results. Despite this, the pursuit of novel architectures, and the respective claims of superior performance over the U-Net baseline, continued. In this study, we demonstrate that many of these recent claims fail to hold up when scrutinized for common validation shortcomings, such as the use of inadequate baselines, insufficient datasets, and neglected computational resources. By meticulously avoiding these pitfalls, we conduct a thorough and comprehensive benchmarking of current segmentation methods including CNN-based, Transformer-based, and Mamba-based approaches. In contrast to current beliefs, we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models, including ResNet and ConvNeXt variants, 2) using the nnU-Net framework, and 3) scaling models to modern hardware resources. These results indicate an ongoing innovation bias towards novel architectures in the field and underscore the need for more stringent validation standards in the quest for scientific progress.