CLApr 14, 2022

Revisiting Transformer-based Models for Long Document Classification

arXiv:2204.06683v2313 citationsh-index: 31
Originality Synthesis-oriented
AI Analysis

It addresses the computational inefficiency of vanilla Transformers for multi-page documents, offering practical guidance for real-world applications.

The paper compared sparse attention and hierarchical encoding methods for Transformer-based long document classification, finding that processing longer text improves performance across four datasets.

The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods. We examine several aspects of sparse attention (e.g., size of local attention window, use of global attention) and hierarchical (e.g., document splitting strategy) transformers on four document classification datasets covering different domains. We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes