CVAIMay 26

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

arXiv:2605.2768679.11 citationsh-index: 6
Predicted impact top 28% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For video understanding and occlusion-sensitive reasoning, this provides a lightweight, plug-in module to augment Transformers with persistent spatial memory, though the gains are incremental over existing methods.

Transformers lack persistent spatial memory, hindering long-horizon video tasks. Tensor Memory introduces a fixed-size recurrent 3D memory tensor that decouples state capacity from sequence length, achieving strong results on video benchmarks while integrating seamlessly into existing Transformers.

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes