Multi-Scale Self-Attention for Text Classification
This work addresses the need for improved text classification performance on limited data, though it appears incremental as it builds upon existing Transformer architectures.
The paper tackled the problem of capturing features at different scales in text classification by introducing multi-scale structure into self-attention modules, resulting in a Multi-Scale Transformer that consistently and significantly outperforms the standard Transformer on small and moderate size datasets across 21 datasets.
In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.