Understanding Self-Attention of Self-Supervised Audio Transformers
This work addresses the lack of interpretability in self-supervised audio transformers for researchers and practitioners in speech processing, but it is incremental as it builds on existing models without introducing a new paradigm.
The paper tackled the problem of understanding how self-attention mechanisms work in self-supervised audio transformers, which are widely used in speech applications like ASR, by developing analysis strategies including categorization, visualization, importance ranking, and refinement techniques to improve model performance.
Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet. In this work, we present multiple strategies for the analysis of attention mechanisms in SAT. We categorize attentions into explainable categories, where we discover each category possesses its own unique functionality. We provide a visualization tool for understanding multi-head self-attention, importance ranking strategies for identifying critical attention, and attention refinement techniques to improve model performance.