Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
This addresses the issue of synthesis biases in training models with synthetic data, offering a direct method to improve robustness, though it is incremental as it builds on existing synthetic data approaches.
The paper tackles the problem of models learning spurious correlations from synthetic data by proposing a framework that uses provenance information to guide input gradients, suppressing reliance on non-target regions and promoting discriminative representations for target regions. Experiments show effectiveness across tasks like object localization and image classification.
Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.