CVAINov 25, 2025

MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing

arXiv:2511.19963v1
Originality Incremental advance
AI Analysis

This work addresses a fundamental limitation in computer vision by enabling size-agnostic processing, which could benefit applications requiring flexible image inputs, though it appears incremental as it builds on existing Mamba-based methods.

The paper tackled the problem of creating a visual encoder that is agnostic to input size, a key feature of human vision, by proposing MambaEye, a causal sequential encoder that leverages a Mamba2 backbone and relative move embedding, achieving robust performance across various image resolutions, such as 1536^2 on ImageNet-1K classification, with linear time and memory complexity.

Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbf{MambaEye}, a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone. Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models, enabling the model to generate a prediction at any point in its input sequence. A core innovation is our use of relative move embedding, which encodes the spatial shift between consecutive patches, providing a strong inductive bias for translation invariance and making the model inherently adaptable to arbitrary image resolutions and scanning patterns. To achieve this, we introduce a novel diffusion-inspired loss function that provides dense, step-wise supervision, training the model to build confidence as it gathers more visual evidence. We demonstrate that MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $1536^2$ on the ImageNet-1K classification task. This feat is achieved while maintaining linear time and memory complexity relative to the number of patches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes