SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data
This work provides a more robust data augmentation technique for fine-grained recognition, which is crucial for tasks requiring subtle visual distinctions, benefiting researchers and practitioners in computer vision.
This paper addresses the issue of label noise in data mixing augmentation for fine-grained recognition, where existing methods often mix labels based on pixel proportions, leading to inaccuracies. The authors propose SnapMix, a novel scheme that uses class activation maps (CAM) to estimate intrinsic semantic composition for target label generation, outperforming existing mixed-based approaches across various datasets and network depths.
Data mixing augmentation has proved effective in training deep models. Recent methods mix labels mainly based on the mixture proportion of image pixels. As the main discriminative information of a fine-grained image usually resides in subtle regions, methods along this line are prone to heavy label noise in fine-grained recognition. We propose in this paper a novel scheme, termed as Semantically Proportional Mixing (SnapMix), which exploits class activation map (CAM) to lessen the label noise in augmenting fine-grained data. SnapMix generates the target label for a mixed image by estimating its intrinsic semantic composition, and allows for asymmetric mixing operations and ensures semantic correspondence between synthetic images and target labels. Experiments show that our method consistently outperforms existing mixed-based approaches on various datasets and under different network depths. Furthermore, by incorporating the mid-level features, the proposed SnapMix achieves top-level performance, demonstrating its potential to serve as a solid baseline for fine-grained recognition. Our code is available at https://github.com/Shaoli-Huang/SnapMix.git.