CVFeb 28, 2025

RTGen: Real-Time Generative Detection Transformer

arXiv:2502.20622v2
Originality Incremental advance
AI Analysis

This work addresses the problem of slow inference speeds in open-vocabulary object detection for real-time applications, representing an incremental improvement in efficiency.

The paper tackles the latency and structural redundancy in generative object detectors by proposing RTGen, a real-time model with a Region-Language Decoder and non-autoregressive DAG-based naming, achieving 131.3 FPS on T4 GPUs, which is over 270x faster than prior work.

Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes