Parity, Sensitivity, and Transformers
This work addresses a foundational problem in machine learning by clarifying the capabilities of transformers, which is incremental as it builds on prior constructions but introduces more realistic constraints.
The paper tackled the problem of understanding the computational limits of transformers by investigating whether they can solve the PARITY problem, and presented a new transformer construction that solves PARITY with practical features like softmax and length-independent positional encoding, while also proving a lower bound that one-layer, one-head transformers cannot solve it.
The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY -- or more generally -- which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY -- by showing that it cannot be done with only one layer and one head.