DiffProb: Data Pruning for Face Recognition
This addresses efficiency and privacy issues for researchers and practitioners in face recognition by reducing reliance on massive datasets, though it is incremental as it applies pruning to a specific domain.
The paper tackles the problem of high computational cost and storage in face recognition training by introducing DiffProb, a data pruning method that removes redundant and mislabeled samples, achieving up to 50% dataset reduction while maintaining or improving verification accuracies on benchmarks like LFW and IJB-C.
Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb's robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.