LGSep 30, 2021

Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems

arXiv:2109.15142v331 citations
Originality Incremental advance
AI Analysis

This work addresses the high parameter and computational costs of Transformers, which is a critical issue for scaling and efficiency in machine learning, though it is incremental as it builds on existing dynamical system analogies.

The authors tackled the problem of reducing the parameter count and computational complexity of Transformers by approximating self-attention and feed-forward components using insights from multi-particle dynamical systems, resulting in TransEvolve, which matches Transformer performance in encoder-decoder tasks and outperforms it in encoder-only tasks.

The Transformer and its variants have been proven to be efficient sequence learners in many different domains. Despite their staggering success, a critical issue has been the enormous number of parameters that must be trained (ranging from $10^7$ to $10^{11}$) along with the quadratic complexity of dot-product attention. In this work, we investigate the problem of approximating the two central components of the Transformer -- multi-head self-attention and point-wise feed-forward transformation, with reduced parameter space and computational complexity. We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations. Taking advantage of an analogy between Transformer stages and the evolution of a dynamical system of multiple interacting particles, we formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers. We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks. We observe that the degree of approximation (or inversely, the degree of parameter reduction) has different effects on the performance, depending on the task. While in the encoder-decoder regime, TransEvolve delivers performances comparable to the original Transformer, in encoder-only tasks it consistently outperforms Transformer along with several subsequent variants.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes