CVNov 24, 2025

DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

arXiv:2511.18814v1
Originality Incremental advance
AI Analysis

This work addresses the challenge of reliable 4D object perception for real-world applications like autonomous systems, though it is incremental by building on existing open-set detection methods with a new dataset and model.

The paper tackles the problem of 4D object detection in streaming video by introducing DetAny4D, an end-to-end framework that predicts 3D bounding boxes from sequential inputs, achieving competitive detection accuracy and significantly improving temporal stability to reduce jitter and inconsistency.

Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes