CLOct 3, 2023

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

arXiv:2310.01749v216 citationsh-index: 4
AI Analysis

This addresses a fundamental problem in natural language processing by enhancing transformers' ability to handle syntactic structures without supervision, though it is incremental as it builds on existing attention mechanisms.

The paper tackled the limitation of standard attention in transformers for modeling hierarchical patterns by proposing stack attention, which incorporates stacks inspired by context-free languages. The result showed that transformers with stack attention achieved strong performance on challenging context-free languages and improved natural language modeling under constrained parameters, with specific gains in machine translation tasks.

Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes