CL LGJun 15, 2023

Block-State Transformers

Mahan Fathi, Jonathan Pilault, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, Ross Goroshin

MILA

arXiv:2306.09539v47.630 citationsh-index: 54

Originality Incremental advance

AI Analysis

This work addresses the problem of improving efficiency and performance in language modeling for AI researchers, offering an incremental hybrid approach.

The authors tackled the performance gap of state space models (SSMs) in language modeling by proposing Block-State Transformer (BST), a hybrid layer combining SSMs for long-range context and block transformers for short-term sequences, which outperformed similar Transformer-based architectures on perplexity and showed over tenfold speed increase in parallelized settings.

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

View on arXiv PDF

Similar