LG CVNov 26, 2023

How much data do I need? A case study on medical data

arXiv:2311.15331v13.83 citationsh-index: 16

Originality Synthesis-oriented

AI Analysis

This work addresses data efficiency and transfer learning strategies for practitioners in medical and general deep learning contexts, though it is incremental as it tests existing methods on new datasets.

The study challenges common assumptions in deep learning by analyzing the impact of data quantity and transfer learning on performance, finding that more data can lead to diminishing returns and that transfer learning can sometimes worsen results, with specific datasets showing unexpected outcomes.

The collection of data to train a Deep Learning network is costly in terms of effort and resources. In many cases, especially in a medical context, it may have detrimental impacts. Such as requiring invasive medical procedures or processes which could in themselves cause medical harm. However, Deep Learning is seen as a data hungry method. Here, we look at two commonly held adages i) more data gives better results and ii) transfer learning will aid you when you don't have enough data. These are widely assumed to be true and used as evidence for choosing how to solve a problem when Deep Learning is involved. We evaluate six medical datasets and six general datasets. Training a ResNet18 network on varying subsets of these datasets to evaluate `more data gives better results'. We take eleven of these datasets as the sources for Transfer Learning on subsets of the twelfth dataset -- Chest -- in order to determine whether Transfer Learning is universally beneficial. We go further to see whether multi-stage Transfer Learning provides a consistent benefit. Our analysis shows that the real situation is more complex than these simple adages -- more data could lead to a case of diminishing returns and an incorrect choice of dataset for transfer learning can lead to worse performance, with datasets which we would consider highly similar to the Chest dataset giving worse results than datasets which are more dissimilar. Multi-stage transfer learning likewise reveals complex relationships between datasets.

View on arXiv PDF

Similar