Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection
For safety-critical hand detection applications facing distribution shifts due to accessories like gloves, the paper demonstrates that careful training schedules can extract practical benefit from synthetic data.
The paper investigates whether generative inpainting of hand accessories (gloves, tattoos, etc.) can improve hand detection in occupational safety settings. Results show that a two-stage training schedule (real+synthetic pre-training then real-only fine-tuning) increases mAP@0.5 over the real-only baseline and reduces the out-of-distribution gap for gloved hands, while a three-stage schedule achieves the highest mAP@0.5:0.95.
Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.