CVAILGSep 29, 2025

Convolutional Neural Nets vs Vision Transformers: A SpaceNet Case Study with Balanced vs Imbalanced Regimes

arXiv:2510.03297v1
Originality Synthesis-oriented
AI Analysis

This work provides a controlled benchmark for practitioners choosing between CNNs and Vision Transformers in remote sensing, but it is incremental as it applies existing methods to new data regimes.

The study compared EfficientNet-B0 and ViT-Base on SpaceNet under imbalanced and balanced label regimes, finding that both achieved high accuracy (e.g., 93% on imbalanced, up to 99% on balanced), with CNNs maintaining efficiency advantages.

We present a controlled comparison of a convolutional neural network (EfficientNet-B0) and a Vision Transformer (ViT-Base) on SpaceNet under two label-distribution regimes: a naturally imbalanced five-class split and a balanced-resampled split with 700 images per class (70:20:10 train/val/test). With matched preprocessing (224x224, ImageNet normalization), lightweight augmentations, and a 40-epoch budget on a single NVIDIA P100, we report accuracy, macro-F1, balanced accuracy, per-class recall, and deployment metrics (model size and latency). On the imbalanced split, EfficientNet-B0 reaches 93% test accuracy with strong macro-F1 and lower latency; ViT-Base is competitive at 93% with a larger parameter count and runtime. On the balanced split, both models are strong; EfficientNet-B0 reaches 99% while ViT-Base remains competitive, indicating that balancing narrows architecture gaps while CNNs retain an efficiency edge. We release manifests, logs, and per-image predictions to support reproducibility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes