SD CL ASMar 3, 2023

DWFormer: Dynamic Window transFormer for Speech Emotion Recognition

Shuaiqi Chen, Xiaofen Xing, Weibin Zhang, Weidong Chen, Xiangmin Xu

arXiv:2303.01694v110.631 citationsh-index: 37Has Code

Originality Incremental advance

AI Analysis

This work addresses a domain-specific problem in speech emotion recognition for human-computer interaction, with incremental improvements over existing transformer-based models.

The authors tackled the problem of precisely locating important temporal regions at varying scales in speech emotion recognition by proposing DWFormer, a dynamic window transformer architecture, which achieved better performance than previous state-of-the-art methods on the IEMOCAP and MELD datasets.

Speech emotion recognition is crucial to human-computer interaction. The temporal regions that represent different emotions scatter in different parts of the speech locally. Moreover, the temporal scales of important information may vary over a large range within and across speech segments. Although transformer-based models have made progress in this field, the existing models could not precisely locate important regions at different temporal scales. To address the issue, we propose Dynamic Window transFormer (DWFormer), a new architecture that leverages temporal importance by dynamically splitting samples into windows. Self-attention mechanism is applied within windows for capturing temporal important information locally in a fine-grained way. Cross-window information interaction is also taken into account for global communication. DWFormer is evaluated on both the IEMOCAP and the MELD datasets. Experimental results show that the proposed model achieves better performance than the previous state-of-the-art methods.

View on arXiv PDF Code

Similar