LGOct 9, 2022

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

arXiv:2210.04243v1306 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the efficiency bottleneck in large language models for real-time applications, though it appears incremental as it builds on prior work on attention substitutes.

The paper tackles the high computational complexity of autoregressive Transformers during token generation by proposing decaying fast weights as a simpler alternative to existing kernel-based methods, achieving O(1) complexity while retaining 99% of GPT-2's attention performance and showing competitive results on WikiText-103.

Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention's performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes