CVGRJul 3, 2025

PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection

arXiv:2507.02393v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the challenge of high annotation costs and 2D-to-3D ambiguity for scalable monocular 3D object detection, though it is incremental as it builds on existing pseudo-labeling methods.

The paper tackles the problem of data scarcity in monocular 3D object detection by proposing a pseudo-labeling framework that uses video data to aggregate pseudo-LiDARs across frames via object tracking, achieving reliable accuracy and strong scalability without requiring multi-view setups or additional sensors.

Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes