CVAIJul 31, 2024

Synthetic Simplicity: Unveiling Bias in Medical Data Augmentation

arXiv:2407.21674v11 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses a critical bias problem for researchers and practitioners using synthetic data in data-scarce fields like medical imaging, though it is incremental as it builds on known issues of spurious correlations.

The study investigated how synthetic data in medical imaging can lead to poor deployment performance due to models exploiting spurious correlations between data source and task labels, demonstrating this issue in digit classification and cardiac view classification tasks.

Synthetic data is becoming increasingly integral in data-scarce fields such as medical imaging, serving as a substitute for real data. However, its inherent statistical characteristics can significantly impact downstream tasks, potentially compromising deployment performance. In this study, we empirically investigate this issue and uncover a critical phenomenon: downstream neural networks often exploit spurious distinctions between real and synthetic data when there is a strong correlation between the data source and the task label. This exploitation manifests as \textit{simplicity bias}, where models overly rely on superficial features rather than genuine task-related complexities. Through principled experiments, we demonstrate that the source of data (real vs.\ synthetic) can introduce spurious correlating factors leading to poor performance during deployment when the correlation is absent. We first demonstrate this vulnerability on a digit classification task, where the model spuriously utilizes the source of data instead of the digit to provide an inference. We provide further evidence of this phenomenon in a medical imaging problem related to cardiac view classification in echocardiograms, particularly distinguishing between 2-chamber and 4-chamber views. Given the increasing role of utilizing synthetic datasets, we hope that our experiments serve as effective guidelines for the utilization of synthetic datasets in model training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes