When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT
For practitioners in volumetric medical imaging, this work provides a resource-performance frontier and failure taxonomy, but the results are preliminary with wide confidence intervals and no definitive superiority claims.
This paper investigates the trade-offs between input dimensionality (2D, 2.5D, 3D) for CNNs and Vision Transformers in lung CT classification, finding that 2.5D CNNs offer the best balance of discrimination and stability (ROC-AUC 0.682) while 3D models suffer from threshold instability and transformers produce degenerate predictions.
Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.