Can segmentation models be trained with fully synthetically generated data?
This addresses data scarcity and variability issues in medical imaging, enabling more accessible model training without real patient data, though it is incremental as it builds on existing generative methods.
The authors tackled the problem of limited medical image data for training segmentation models by proposing brainSPADE, a model that generates fully synthetic brain MRI images with controllable labels and styles, achieving segmentation performance comparable to models trained on real data.
In order to achieve good performance and generalisability, medical image segmentation models should be trained on sizeable datasets with sufficient variability. Due to ethics and governance restrictions, and the costs associated with labelling data, scientific development is often stifled, with models trained and tested on limited data. Data augmentation is often used to artificially increase the variability in the data distribution and improve model generalisability. Recent works have explored deep generative models for image synthesis, as such an approach would enable the generation of an effectively infinite amount of varied data, addressing the generalisability and data access problems. However, many proposed solutions limit the user's control over what is generated. In this work, we propose brainSPADE, a model which combines a synthetic diffusion-based label generator with a semantic image generator. Our model can produce fully synthetic brain labels on-demand, with or without pathology of interest, and then generate a corresponding MRI image of an arbitrary guided style. Experiments show that brainSPADE synthetic data can be used to train segmentation models with performance comparable to that of models trained on real data.