Softmax Transformers are Turing-Complete
This resolves a foundational theoretical question in AI about the computational limits of widely used transformer architectures, with implications for understanding their reasoning capabilities.
The paper tackles the open problem of whether softmax attention Chain-of-Thought transformers are Turing-complete, proving that length-generalizable versions are Turing-complete for letter-bounded languages and with extensions for arbitrary languages, and empirically validates this with transformers trained on complex arithmetic reasoning tasks.
Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-complete. In this paper, we prove a stronger result that length-generalizable softmax CoT transformers are Turing-complete. More precisely, our Turing-completeness proof goes via the CoT extension of the Counting RASP (C-RASP), which correspond to softmax CoT transformers that admit length generalization. We prove Turing-completeness for CoT C-RASP with causal masking over a unary alphabet (more generally, for letter-bounded languages). While we show this is not Turing-complete for arbitrary languages, we prove that its extension with relative positional encoding is Turing-complete for arbitrary languages. We empirically validate our theory by training transformers for languages requiring complex (non-linear) arithmetic reasoning.