Saturated Transformers are Constant-Depth Threshold Circuits
This work addresses the circuit complexity of transformers for NLP researchers, providing theoretical insights into practical attention mechanisms, though it is incremental as it builds on prior analyses of hard-attention transformers.
The paper tackles the theoretical power of transformers with saturated attention, showing they overcome limitations of hard-attention transformers and can be simulated by constant-depth threshold circuits, establishing an upper bound of TC^0 for the formal languages they recognize.
Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND/OR circuits (Hao et al. 2021). However, hard attention is a strong assumption, which may complicate the relevance of these results in practice. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We first show that saturated transformers transcend the known limitations of hard-attention transformers. We then prove saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class $\mathsf{TC}^0$ as an upper bound on the formal languages they recognize.