CVOct 9, 2023

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation

arXiv:2310.05920v4h-index: 67Has Code
Originality Incremental advance
AI Analysis

This addresses the need for efficient and scalable object detection and segmentation models for computer vision applications, though it appears incremental as it builds on existing transformer-based detectors.

The paper tackles the problem of object detection and segmentation by proposing SimPLR, a plain transformer architecture that uses single-scale features with scale-aware attention instead of multi-scale designs. The result is a model that achieves consistently better accuracy and faster runtime compared to state-of-the-art multi-scale and single-scale alternatives, with improved scaling using bigger capacity models and more pre-training data.

The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation, as well as panoptic segmentation. Code is released at https://github.com/kienduynguyen/SimPLR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes