NEApr 13

Winner-Take-All Spiking Transformer for Language Modeling

Chenlin Zhou, Sihang Guo, Jiaqi Wang, Dongyang Ma, Kaiwei Che, Baiyu Chen, Qingyan Meng, Zhengyu Ma, Yonghong Tian

arXiv:2604.1132176.0h-index: 8

Predicted impact top 3% in NE · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the high energy cost and neuromorphic deployment challenges of spiking transformers for language modeling, offering a more efficient alternative for energy-efficient AI.

The authors introduce Winner-Take-All mechanisms into spiking transformers to create softmax-free, spike-driven self-attention modules, achieving competitive performance on 16 NLP datasets while reducing energy costs compared to softmax-based spiking transformers.

Spiking Transformers, which combine the scalability of Transformers with the sparse, energy-efficient property of Spiking Neural Networks (SNNs), have achieved impressive results in neuromorphic and vision tasks and attracted increasing attention. However, existing directly trained spiking transformers primarily focus on vision tasks. For language modeling with spiking transformer, convergence relies heavily on softmax-based spiking self-attention, which incurs high energy costs and poses challenges for neuromorphic deployment. To address this issue, we introduce Winner-Take-All (WTA) mechanisms into spiking transformers and propose two novel softmax-free, spike-driven self-attention modules: WTA Spiking Self-Attention (WSSA) and Causal WTA Spiking Self-Attention (CWSSA). Based on them, we design WTA-based Encoder-only Spiking Transformer (WE-Spikingformer) for masked language modeling and WTA-based Decoder-only Spiking Transformer (WD-Spikingformer) for causal language modeling, systematically exploring softmax-free, spiking-driven Transformer architectures trained end-to-end for natural language processing tasks. Extensive experiments on 16 datasets spanning natural language understanding, question-answering tasks, and commonsense reasoning tasks validate the effectiveness of our approach and highlight the promise of spiking transformers for general language modeling and energy-efficient artificial intelligence.

View on arXiv PDF

Similar