Simulating Hard Attention Using Soft Attention
This work addresses theoretical limitations in transformer models for computational linguistics, providing insights into their expressiveness and simulation capabilities, but it is incremental as it builds on existing attention mechanisms.
The paper investigates conditions under which transformers with soft attention can simulate hard attention, effectively focusing on a subset of positions, by showing how they can compute languages defined in linear temporal logic variants and simulate general hard-attention transformers using techniques like unbounded positional embeddings or temperature scaling.
We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several subclasses of languages recognized by hard-attention transformers, which can be defined in variants of linear temporal logic. We demonstrate how soft-attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate general hard-attention transformers, using a temperature that depends on the minimum gap between the maximum attention scores and other attention scores.