CVAIApr 6

I Can't Believe TTA Is Not Better: When Test-Time Augmentation Hurts Medical Image Classification

arXiv:2604.096970.7h-index: 1
AI Analysis

This paper warns medical imaging practitioners that TTA should not be used as a default post-hoc improvement without validation on the specific model-dataset combination.

Test-time augmentation (TTA) consistently degrades classification accuracy in medical imaging, with drops up to 31.6 percentage points, contrary to common belief. The degradation is caused by distribution shift from augmented inputs, especially with batch normalization mismatch.

Test-time augmentation (TTA)--aggregating predictions over multiple augmented copies of a test input--is widely assumed to improve classification accuracy, particularly in medical imaging where it is routinely deployed in production systems and competition solutions. We present a systematic empirical study challenging this assumption across three MedMNIST v2 benchmarks and four architectures spanning three orders of magnitude in parameter count (21K to 11M). Our principal finding is that TTA with standard augmentation pipelines consistently degrades accuracy relative to single-pass inference, with drops as severe as 31.6 percentage points for ResNet-18 on pathology images. This degradation affects all architectures, including convolutional models, and worsens with more augmented views. The sole exception is ResNet-18 on dermatology images, which gains a modest +1.6%. We identify the distribution shift between augmented and training-time inputs--amplified by batch normalization statistics mismatch--as the primary mechanism. Our ablation studies show that augmentation strategy matters critically: intensity-only augmentations preserve more performance than geometric transforms, and including the original unaugmented image partially mitigates but does not eliminate the accuracy drop. These findings serve as a cautionary note for practitioners: TTA should not be applied as a default post-hoc improvement but must be validated on the specific model-dataset combination.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes