CVSep 9, 2025

MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

Saad Lahlali, Alexandre Fournier Montgieux, Nicolas Granger, Hervé Le Borgne, Quoc Cuong Pham

arXiv:2509.07507v13.61 citationsh-index: 18Has Code

Originality Incremental advance

AI Analysis

This addresses the annotation bottleneck for 3D object detection in autonomous driving, offering a practical solution with incremental improvements over existing weakly supervised methods.

The paper tackles the problem of costly 3D data annotation for object detection by proposing MVAT, a weakly supervised method that uses 2D box annotations and temporal multi-view data to reduce projection ambiguities and improve accuracy, achieving state-of-the-art performance on nuScenes and Waymo Open datasets and narrowing the gap with fully supervised methods.

Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % \footnote{Code available upon acceptance} Our code is available in our public repository (\href{https://github.com/CEA-LIST/MVAT}{code}).

View on arXiv PDF Code

Similar