Evaluating the Temporal Detection Capability of Integrated Gradients Applied on Sound Classifier
For researchers in audio event detection, this work shows that post-hoc attribution methods can localize sound events without temporal labels, though performance is still below supervised approaches.
Integrated gradients applied to a sound classifier trained without temporal supervision achieves temporal event detection with mean IoU of 0.39, frame-level F1 of 0.52, and Pointing Game accuracy of 82.6%, approaching the performance of weakly and strongly supervised framewise CNNs.
Gradient-based attribution methods can highlight input regions important for neural network predictions, but their effectiveness for temporal sound event detection in audio classification has not been systematically evaluated. This paper assesses whether integrated gradients (IG) can temporally detect sound events when applied to a classifier trained without temporal supervision. We use synthetic polyphonic audio with ground truth timestamps to measure alignment between IG attributions and event boundaries. On a 10-class domestic sound dataset, IG achieves mean Intersection over Union (IoU) of 0.39, frame-level F1 of 0.52, and Pointing Game accuracy of 82.6\%. For comparison, a framewise CNN trained with weak supervision (FW-WS, clip-level training labels) achieves 0.42 IoU, 0.55 F1, and 97.3\% PG, while a strongly supervised variant (FW-SS, frame-level training labels) reaches 0.45 IoU, 0.58 F1, and 97.9\% PG. Overall, these results suggest that post-hoc IG captures meaningful temporal activity patterns of sound events, with localization performance approaching models that explicitly produce frame-level predictions. All methods substantially outperform random and energy-based baselines.