CLAISep 4, 2023

One Wide Feedforward is All You Need

arXiv:2309.01826v2141 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses efficiency and performance improvements for large language models, though it appears incremental as it modifies an existing architecture rather than introducing a new paradigm.

The paper tackled the redundancy of feedforward networks (FFNs) in Transformers by removing decoder FFNs and sharing a single FFN across the encoder, reducing parameters with minimal accuracy drop, and scaling it back to original size to achieve substantial gains in accuracy and latency over Transformer Big.

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes