CLLGNov 11, 2019

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

arXiv:1911.04070v185 citations
Originality Incremental advance
AI Analysis

This addresses the computational bottleneck for NLP researchers and practitioners working with long documents, offering a more efficient alternative to standard self-attention, though it is an incremental improvement over existing sparse attention methods.

The paper tackles the problem of Transformer's quadratic complexity limiting its use on long texts by proposing BP-Transformer, which uses binary partitioning for fine-to-coarse attention, achieving O(k·n log(n/k)) connections and showing superior performance in experiments on text classification, machine translation, and language modeling for long text.

The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), we propose BP-Transformer (BPT for short). BPT yields $O(k\cdot n\log (n/k))$ connections where $k$ is a hyperparameter to control the density of attention. BPT has a good balance between computation complexity and model capacity. A series of experiments on text classification, machine translation and language modeling shows BPT has a superior performance for long text than previous self-attention models. Our code, hyperparameters and CUDA kernels for sparse attention are available in PyTorch.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes