An Explainable Proxy Model for Multiabel Audio Segmentation
This provides an explainable AI solution for audio indexing applications, though it is incremental as it adapts existing methods for transparency.
The paper tackles the problem of multi-label audio segmentation for speech, music, noise, and overlapped speech detection by proposing an explainable proxy model based on non-negative matrix factorization, achieving similar performance to a pre-trained black-box model on two datasets.
Audio signal segmentation is a key task for automatic audio indexing. It consists of detecting the boundaries of class-homogeneous segments in the signal. In many applications, explainable AI is a vital process for transparency of decision-making with machine learning. In this paper, we propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD) simultaneously. This proxy uses the non-negative matrix factorization (NMF) to map the embedding used for the segmentation to the frequency domain. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features. Specifically, the frequency bins used for the decision can be easily identified at both the segment level (local explanations) and global level (class prototypes).