6.1CVJun 4
VZCrash: A Large-Scale IMU Dataset of Ego-Vehicle CrashesTommaso Bianconcini, Henrique Piñeiro Monteagudo, Aurel Pjetri et al.
We introduce VZCrash, the largest publicly available dataset of real-world vehicle collision data featuring Inertial Measurement Unit (IMU) telemetry. The dataset contains more than 31,000 validated crashes and 158,000 negative samples, including hard cases and distractors. Each sample includes acceleration and angular velocity at 100 Hz, and GPS speed at 1 Hz. Events in VZCrash were captured by devices installed on a fleet of 73,010 commercial vehicles of different sizes driving in the United States over the span of several years. We also present an extensive experimental study enabled by the volume of the dataset. We first benchmark several different approaches, from a simple threshold-based heuristic to state-of-the-art deep learning models. Then, we present an experiment demonstrating the importance of scaling data to train high-quality crash detection models, and we show that scale is especially important when these models need to be deployed into a real-world environment.
53.1CVMay 21Code
Enhancing Multimodal Large Language Models for Safety-Critical Driving Video AnalysisTomaso Trinci, Henrique Piñeiro Monteagudo, Leonardo Taccari
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.
CVSep 26, 2024
ViewpointDepth: A New Dataset for Monocular Depth Estimation Under Viewpoint ShiftsAurel Pjetri, Stefano Caprasecca, Leonardo Taccari et al.
Monocular depth estimation is a critical task for autonomous driving and many other computer vision applications. While significant progress has been made in this field, the effects of viewpoint shifts on depth estimation models remain largely underexplored. This paper introduces a novel dataset and evaluation methodology to quantify the impact of different camera positions and orientations on monocular depth estimation performance. We propose a ground truth strategy based on homography estimation and object detection, eliminating the need for expensive LIDAR sensors. We collect a diverse dataset of road scenes from multiple viewpoints and use it to assess the robustness of a modern depth estimation model to geometric shifts. After assessing the validity of our strategy on a public dataset, we provide valuable insights into the limitations of current models and highlight the importance of considering viewpoint variations in real-world applications.
CVFeb 20, 2025
RendBEV: Semantic Novel View Synthesis for Self-Supervised Bird's Eye View SegmentationHenrique Piñeiro Monteagudo, Leonardo Taccari, Aurel Pjetri et al.
Bird's Eye View (BEV) semantic maps have recently garnered a lot of attention as a useful representation of the environment to tackle assisted and autonomous driving tasks. However, most of the existing work focuses on the fully supervised setting, training networks on large annotated datasets. In this work, we present RendBEV, a new method for the self-supervised training of BEV semantic segmentation networks, leveraging differentiable volumetric rendering to receive supervision from semantic perspective views computed by a 2D semantic segmentation model. Our method enables zero-shot BEV semantic segmentation, and already delivers competitive results in this challenging setting. When used as pretraining to then fine-tune on labeled BEV ground-truth, our method significantly boosts performance in low-annotation regimes, and sets a new state of the art when fine-tuning on all available labels.
CVFeb 6, 2025
An object detection approach for lane change and overtake detection from motion profilesAndrea Benericetti, Niccolò Bellaccini, Henrique Piñeiro Monteagudo et al.
In the application domain of fleet management and driver monitoring, it is very challenging to obtain relevant driving events and activities from dashcam footage while minimizing the amount of information stored and analyzed. In this paper, we address the identification of overtake and lane change maneuvers with a novel object detection approach applied to motion profiles, a compact representation of driving video footage into a single image. To train and test our model we created an internal dataset of motion profile images obtained from a heterogeneous set of dashcam videos, manually labeled with overtake and lane change maneuvers by the ego-vehicle. In addition to a standard object-detection approach, we show how the inclusion of CoordConvolution layers further improves the model performance, in terms of mAP and F1 score, yielding state-of-the art performance when compared to other baselines from the literature. The extremely low computational requirements of the proposed solution make it especially suitable to run in device.