CVMay 18

A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition

Edwin Arkel Rios, Augusto Christian Surya, Oswin Gosal, Fernando Mikael, Mary Madeline Nicole, Kisoon Jang, Bo-Cheng Lai, Min-Chun Hu

arXiv:2605.187009.4Has Code

AI Analysis

For practitioners of fine-grained image recognition, this work provides a comprehensive analysis of training and evaluation settings, enabling more cost-effective model deployment.

This study investigates accuracy vs cost trade-offs in fine-grained image recognition across 2000+ experiments, finding that data-aware augmentations during training alone can achieve high accuracy without costly inference crops, reducing inference costs significantly.

Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \url{https://github.com/arkel23/FGIR-Backbones}

View on arXiv PDF Code

Similar