How Can Self-Attention Networks Recognize Dyck-n Languages?
This addresses the challenge of learning hierarchical structures in formal languages for natural language processing, though it is incremental as it builds on prior work on self-attention and language recognition.
The paper tackled the problem of recognizing Dyck-n languages with self-attention networks, showing that a variant with a starting symbol (SA+) achieves 58.82% accuracy on D2 for long sequences and generalizes better than a variant without it (SA-).
We focus on the recognition of Dyck-n ($\mathcal{D}_n$) languages with self-attention (SA) networks, which has been deemed to be a difficult task for these networks. We compare the performance of two variants of SA, one with a starting symbol (SA$^+$) and one without (SA$^-$). Our results show that SA$^+$ is able to generalize to longer sequences and deeper dependencies. For $\mathcal{D}_2$, we find that SA$^-$ completely breaks down on long sequences whereas the accuracy of SA$^+$ is 58.82$\%$. We find attention maps learned by $\text{SA}{^+}$ to be amenable to interpretation and compatible with a stack-based language recognizer. Surprisingly, the performance of SA networks is at par with LSTMs, which provides evidence on the ability of SA to learn hierarchies without recursion.