IVCVMED-PHFeb 4, 2025

Style transfer as data augmentation: evaluating unpaired image-to-image translation models in mammography

arXiv:2502.02475v11 citationsh-index: 6EMBC
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of poor generalizability in breast cancer detection models for medical imaging, though it is incremental as it focuses on evaluation methods rather than introducing new models.

The paper tackles the problem of evaluating unpaired image-to-image translation models for style transfer in mammography, highlighting key considerations for metrics and analyzing CycleGAN and SynDiff models across three datasets to show that multiple metrics are needed for comprehensive assessment.

Several studies indicate that deep learning models can learn to detect breast cancer from mammograms (X-ray images of the breasts). However, challenges with overfitting and poor generalisability prevent their routine use in the clinic. Models trained on data from one patient population may not perform well on another due to differences in their data domains, emerging due to variations in scanning technology or patient characteristics. Data augmentation techniques can be used to improve generalisability by expanding the diversity of feature representations in the training data by altering existing examples. Image-to-image translation models are one approach capable of imposing the characteristic feature representations (i.e. style) of images from one dataset onto another. However, evaluating model performance is non-trivial, particularly in the absence of ground truths (a common reality in medical imaging). Here, we describe some key aspects that should be considered when evaluating style transfer algorithms, highlighting the advantages and disadvantages of popular metrics, and important factors to be mindful of when implementing them in practice. We consider two types of generative models: a cycle-consistent generative adversarial network (CycleGAN) and a diffusion-based SynDiff model. We learn unpaired image-to-image translation across three mammography datasets. We highlight that undesirable aspects of model performance may determine the suitability of some metrics, and also provide some analysis indicating the extent to which various metrics assess unique aspects of model performance. We emphasise the need to use several metrics for a comprehensive assessment of model performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes