CVAILGJun 6, 2024

Vision-LSTM: xLSTM as Generic Vision Backbone

arXiv:2406.04303v3100 citations
Originality Synthesis-oriented
AI Analysis

This work proposes a potential alternative to Transformers for vision tasks, but it appears incremental as it adapts an existing method to a new domain without demonstrating broad SOTA results.

The authors tackled the problem of using Transformers as generic vision backbones by adapting xLSTM, a scalable LSTM variant, into Vision-LSTM (ViL) for computer vision, with experiments indicating promise for deployment as a new backbone.

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes