Evaluating Transformer's Ability to Learn Mildly Context-Sensitive Languages
This addresses the theoretical limitations of Transformers in modeling natural language, which is hypothesized to be mildly context-sensitive, but is incremental as it builds on prior work on regular and context-free languages.
The study tested Transformers' ability to learn mildly context-sensitive languages, finding they generalize well to in-distribution data but extrapolate worse to longer strings than LSTMs, with analyses showing learned attention patterns modeled dependencies and counting behavior.
Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test the Transformer's ability to learn mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.