CL LGNov 17, 2023

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

Vukasin Bozic, Danilo Dordevic, Daniele Coppola, Joseph Thommes, Sidak Pal Singh

ETH Zurich

arXiv:2311.10642v424 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses the complexity of Transformers for sequence-to-sequence tasks, offering a potential streamlining method, though it appears incremental.

The paper tackles the problem of simplifying Transformers by replacing attention layers with shallow feed-forward networks, achieving performance rivaling the original on the IWSLT2017 dataset.

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.

View on arXiv PDF

Similar