LGMLJun 18, 2020

I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths

arXiv:2006.10220v2
AI Analysis

This addresses a limitation in Transformer models for tasks requiring inductive generalization, such as algorithmic procedures, which is incremental by modifying existing architectures.

The paper tackled the problem of Transformer models performing poorly on algorithmic tasks requiring generalization to unseen input lengths, proposing I-BERT which replaces positional encodings with a recurrent layer and achieved state-of-the-art results on four out of five tasks.

Self-attention has emerged as a vital component of state-of-the-art sequence-to-sequence models for natural language processing in recent years, brought to the forefront by pre-trained bi-directional Transformer models. Its effectiveness is partly due to its non-sequential architecture, which promotes scalability and parallelism but limits the model to inputs of a bounded length. In particular, such architectures perform poorly on algorithmic tasks, where the model must learn a procedure which generalizes to input lengths unseen in training, a capability we refer to as inductive generalization. Identifying the computational limits of existing self-attention mechanisms, we propose I-BERT, a bi-directional Transformer that replaces positional encodings with a recurrent layer. The model inductively generalizes on a variety of algorithmic tasks where state-of-the-art Transformer models fail to do so. We also test our method on masked language modeling tasks where training and validation sets are partitioned to verify inductive generalization. Out of three algorithmic and two natural language inductive generalization tasks, I-BERT achieves state-of-the-art results on four tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes