LGFeb 14, 2024

Transformers, parallel computation, and logarithmic depth

arXiv:2402.09268v169 citationsh-index: 26ICML
Originality Highly original
AI Analysis

This establishes parallelism as a key distinguishing property of transformers, addressing computational efficiency in neural sequence modeling.

The paper demonstrates that transformers with constant self-attention layers can efficiently simulate and be simulated by constant-round Massively Parallel Computation, showing that logarithmic depth suffices for transformers to solve computational tasks that other neural sequence models and sub-quadratic approximations cannot.

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes