CLLGNEFeb 3, 2017

Structured Attention Networks

arXiv:1702.00887v3492 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more expressive attention mechanisms in deep learning for NLP tasks, offering a method to model structural biases while maintaining end-to-end training, which is incremental as it builds on existing attention frameworks.

The authors tackled the problem of incorporating richer structural dependencies into attention networks without losing end-to-end trainability, by using graphical models like linear-chain CRFs and graph-based parsers as neural layers. They showed that structured attention networks outperform baseline attention models on tasks such as tree transduction, neural machine translation, question answering, and natural language inference, though specific numerical gains are not provided.

Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning end-to-end training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes