LGCVDec 9, 2023

TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

arXiv:2312.05605v11 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses the need for more efficient sequence models for researchers and practitioners handling long sequences, though it is incremental as it builds directly on MEGA.

The paper tackled the problem of scalable sequence processing by proposing TCNCA, which replaces linear recurrence in MEGA with a temporal convolutional network to reduce computational complexity from O(L log L) to O(L). The result showed TCNCA outperformed MEGA on EnWik8 with lower loss and up to 7.07x faster forward passes, achieved similar accuracy with 1.28x speed-up on LRA, and remained competitive on associative recall.

MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with $1.37\times$/$1.24\times$ faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to $7.07\times$/$2.86\times$ faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, $1.28\times$ speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes