CVDec 20, 2023

Cached Transformers: Improving Transformers with Differentiable Memory Cache

Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo

Tencent

arXiv:2312.12742v15.07 citationsh-index: 44AAAI

Originality Highly original

AI Analysis

This work addresses the challenge of long-range dependencies in Transformers for applications in language and vision, representing a novel method rather than an incremental improvement.

The paper tackles the problem of limited receptive fields in Transformers by introducing Cached Transformer with GRC attention, which uses a differentiable memory cache to attend to past and current tokens, achieving significant advancements in six language and vision tasks.

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

View on arXiv PDF

Similar