CVAIFeb 11, 2025

SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer

arXiv:2502.07216v17 citationsh-index: 25MM
Originality Incremental advance
AI Analysis

It addresses the problem of inaccurate and inefficient object detection in HRW shots for computer vision applications, representing a domain-specific advancement.

The paper tackles object detection in high-resolution wide (HRW) shots, which present challenges like extreme sparsity and huge scale changes, by proposing SparseFormer, a model-agnostic sparse vision transformer that improves detection accuracy by up to 5.8% and speed by up to 3x over state-of-the-art methods on benchmarks like PANDA and DOTA-v1.0.

Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes