ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
This addresses inefficiencies in audio representation learning for tasks like classification and keyword spotting, though it appears incremental as it modifies an existing attention mechanism.
The paper tackled the problem of irrelevant attention allocation in Transformers for audio self-supervised learning by introducing a differential attention mechanism, achieving state-of-the-art results such as 49.0% mAP on AS-2M and 98.3% accuracy on SPC-2.
In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.