CVLGAug 20, 2021

Is it Time to Replace CNNs with Transformers for Medical Images?

arXiv:2108.09038v1203 citations
Originality Synthesis-oriented
AI Analysis

This addresses the choice of model architecture for medical imaging practitioners, but it is incremental as it compares existing methods on standard datasets.

The study investigated whether vision transformers (ViTs) can replace CNNs for medical image diagnosis, finding that ViTs pretrained with self-supervision outperform CNNs, while off-the-shelf ViTs are on par with CNNs when pretrained on ImageNet.

Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis. Recently, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding similar levels of performance while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore whether it is time to move to transformer-based models or if we should keep working with CNNs - can we trivially switch to transformers? If so, what are the advantages and drawbacks of switching to ViTs for medical image diagnosis? We consider these questions in a series of experiments on three mainstream medical image datasets. Our findings show that, while CNNs perform better when trained from scratch, off-the-shelf vision transformers using default hyperparameters are on par with CNNs when pretrained on ImageNet, and outperform their CNN counterparts when pretrained using self-supervision.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes