CVJan 20Code
Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous DrivingAlexandre Justo Miro, Ludvig af Klinteberg, Bogdan Timus et al.
Accurate ground truth annotations are critical to supervised learning and evaluating the performance of autonomous vehicle systems. These vehicles are typically equipped with active sensors, such as LiDAR, which scan the environment in predefined patterns. 3D box annotation based on data from such sensors is challenging in dynamic scenarios, where objects are observed at different timestamps, hence different positions. Without proper handling of this phenomenon, systematic errors are prone to being introduced in the box annotations. Our work is the first to discover such annotation errors in widely used, publicly available datasets. Through our novel offline estimation method, we correct the annotations so that they follow physically feasible trajectories and achieve spatial and temporal consistency with the sensor data. For the first time, we define metrics for this problem; and we evaluate our method on the Argoverse 2, MAN TruckScenes, and our proprietary datasets. Our approach increases the quality of box annotations by more than 17% in these datasets. Furthermore, we quantify the annotation errors in them and find that the original annotations are misplaced by up to 2.5 m, with highly dynamic objects being the most affected. Finally, we test the impact of the errors in benchmarking and find that the impact is larger than the improvements that state-of-the-art methods typically achieve with respect to the previous state-of-the-art methods; showing that accurate annotations are essential for correct interpretation of performance. Our code is available at https://github.com/alexandre-justo-miro/annotation-correction-3D-boxes.
CVOct 21, 2025Code
BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal PretrainingAjinkya Khoche, Gergő László Nagy, Maciej Wozniak et al.
Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets -- consisting of a point cloud, image, and text description -- mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5\% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27\%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3\% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on https://github.com/kesu1/BlendCLIP.
ROSep 10, 2019
Visual Area Coverage with Attitude-Dependent Camera Footprints by Particle HarvestingSina Sharif Mansouri, Pantelis Sopasakis, George Georgoulas et al.
In aerial visual area coverage missions, the camera footprint changes over time based on the camera position and orientation -- a fact that complicates the whole process of coverage and path planning. This article proposes a solution to the problem of visual coverage by filling the target area with a set of randomly distributed particles and harvesting them by camera footprints. This way, high coverage is obtained at a low computational cost. In this approach, the path planner considers six degrees of freedom (DoF) for the camera movement and commands thrust and attitude references to a lower layer controller, while maximizing the covered area and coverage quality. The proposed method requires a priori information of the boundaries of the target area and can handle areas of very complex and highly non-convex geometry. The effectiveness of the approach is demonstrated in multiple simulations in terms of computational efficiency and coverage.