CLSep 5, 2019

Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network

arXiv:1909.02279v14 citations
Originality Incremental advance
AI Analysis

This addresses the inference bottleneck in Transformer-based models for machine translation, offering a practical speed improvement with incremental methodological innovation.

The paper tackled the slow inference speed of Transformers in machine translation by proposing a hybrid network combining self-attention encoders and RNN decoders, achieving a 4-times faster decoding speed while maintaining comparable translation quality through knowledge distillation.

Due to the highly parallelizable architecture, Transformer is faster to train than RNN-based models and popularly used in machine translation tasks. However, at inference time, each output word requires all the hidden states of the previously generated words, which limits the parallelization capability, and makes it much slower than RNN-based ones. In this paper, we systematically analyze the time cost of different components of both the Transformer and RNN-based model. Based on it, we propose a hybrid network of self-attention and RNN structures, in which, the highly parallelizable self-attention is utilized as the encoder, and the simpler RNN structure is used as the decoder. Our hybrid network can decode 4-times faster than the Transformer. In addition, with the help of knowledge distillation, our hybrid network achieves comparable translation quality to the original Transformer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes