NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures
This work addresses change detection in remote sensing imagery, offering a competitive alternative to State Space Models with practical inference latency, though it is incremental as it builds on existing modern vision architectures.
The paper tackled the problem of remote sensing change detection by proposing NeXt2Former-CD, an end-to-end framework that integrates ConvNeXt, deformable attention, and Mask2Former, achieving the best results on datasets like LEVIR-CD with improvements in F1 score and IoU over Mamba-based baselines.
State Space Models (SSMs) have recently gained traction in remote sensing change detection (CD) for their favorable scaling properties. In this paper, we explore the potential of modern convolutional and attention-based architectures as a competitive alternative. We propose NeXt2Former-CD, an end-to-end framework that integrates a Siamese ConvNeXt encoder initialized with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder. This design is intended to better tolerate residual co-registration noise and small object-level spatial shifts, as well as semantic ambiguity in bi-temporal imagery. Experiments on LEVIR-CD, WHU-CD, and CDD datasets show that our method achieves the best results among the evaluated methods, improving over recent Mamba-based baselines in both F1 score and IoU. Furthermore, despite a larger parameter count, our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.