LGCLMLJun 5, 2020

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

arXiv:2006.03236v1265 citationsHas Code
Originality Highly original
AI Analysis

This addresses the need for more efficient and scalable architectures for language processing, particularly for tasks requiring single-vector representations, and is incremental as it builds on the standard Transformer.

The paper tackles the problem of computational inefficiency in language models by proposing Funnel-Transformer, which compresses token-level representations to reduce redundancy, resulting in improved performance on tasks like text classification and language understanding with comparable or fewer FLOPs.

With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at https://github.com/laiguokun/Funnel-Transformer.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes