CVMar 6
FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation ModelsAndrew Caunes, Thierry Chateau, Vincent Fremont
Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle's surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.
CVFeb 19, 2025
Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X CollaborationKatie Z Luo, Minh-Quan Dao, Zhenzhen Liu et al.
Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. The Mixed Signals dataset is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. Dataset website is available at https://mixedsignalsdataset.cs.cornell.edu/.
CVMay 21, 2025
Multi-View Projection for Unsupervised Domain Adaptation in 3D Semantic SegmentationAndrew Caunes, Thierry Chateau, Vincent Fremont
3D semantic segmentation is essential for autonomous driving and road infrastructure analysis, but state-of-the-art 3D models suffer from severe domain shift when applied across datasets. We propose a multi-view projection framework for unsupervised domain adaptation (UDA). Our method aligns LiDAR scans into coherent 3D scenes and renders them from multiple virtual camera poses to generate large-scale synthetic 2D datasets (PC2D) in various modalities. An ensemble of 2D segmentation models is trained on these modalities, and during inference, hundreds of views per scene are processed, with logits back-projected to 3D using an occlusion-aware voting scheme to produce point-wise labels. These labels can be used directly or to fine-tune a 3D segmentation model in the target domain. We evaluate our approach in both Real-to-Real and Simulation-to-Real UDA, achieving state-of-the-art performance in the Real-to-Real setting. Furthermore, we show that our framework enables segmentation of rare classes, leveraging only 2D annotations for those classes while relying on 3D annotations for others in the source domain.
CVJan 28, 2022
3D-FlowNet: Event-based optical flow estimation with 3D representationHaixin Sun, Minh-Quan Dao, Vincent Fremont
Event-based cameras can overpass frame-based cameras limitations for important tasks such as high-speed motion detection during self-driving cars navigation in low illumination conditions. The event cameras' high temporal resolution and high dynamic range, allow them to work in fast motion and extreme light scenarios. However, conventional computer vision methods, such as Deep Neural Networks, are not well adapted to work with event data as they are asynchronous and discrete. Moreover, the traditional 2D-encoding representation methods for event data, sacrifice the time resolution. In this paper, we first improve the 2D-encoding representation by expanding it into three dimensions to better preserve the temporal distribution of the events. We then propose 3D-FlowNet, a novel network architecture that can process the 3D input representation and output optical flow estimations according to the new encoding methods. A self-supervised training strategy is adopted to compensate the lack of labeled datasets for the event-based camera. Finally, the proposed network is trained and evaluated with the Multi-Vehicle Stereo Event Camera (MVSEC) dataset. The results show that our 3D-FlowNet outperforms state-of-the-art approaches with less training epoch (30 compared to 100 of Spike-FlowNet).
ROJul 24, 2017
A Loosely-Coupled Approach for Metric Scale Estimation in Monocular Vision-Inertial SystemsAriane Spaenlehauer, Vincent Fremont, Y. Ahmet Sekercioglu et al.
In monocular vision systems, lack of knowledge about metric distances caused by the inherent scale ambiguity can be a strong limitation for some applications. We offer a method for fusing inertial measurements with monocular odometry or tracking to estimate metric distances in inertial-monocular systems and to increase the rate of pose estimates. As we performed the fusion in a loosely-coupled manner, each input block can be easily replaced with one's preference, which makes our method quite flexible. We experimented our method using the ORB-SLAM algorithm for the monocular tracking input and Euler forward integration to process the inertial measurements. We chose sets of data recorded on UAVs to design a suitable system for flying robots.
ROJul 19, 2017
Relative Pose Based Redundancy Removal: Collaborative RGB-D Data Transmission in Mobile Visual Sensor NetworksXiaoqin Wang, Y. Ahmet Sekercioglu, Tom Drummond et al.
The Relative Pose based Redundancy Removal(RPRR) scheme is presented, which has been designed for mobile RGB-D sensor networks operating under bandwidth-constrained operational scenarios. Participating sensor nodes detect the redundant visual and depth information to prevent their transmission leading to a significant improvement in wireless channel usage efficiency and power savings. Experimental results show that wireless channel utilization is improved by 250% and battery consumption is halved when the RPRR scheme is used instead of sending the sensor images independently.
HCSep 27, 2012
DAARIA: Driver Assistance by Augmented Reality for Intelligent AutomobilePaul George, Indira Thouvenin, Vincent Fremont et al.
Taking into account the drivers' state is a major challenge for designing new advanced driver assistance systems. In this paper we present a driver assistance system strongly coupled to the user. DAARIA 1 stands for Driver Assistance by Augmented Reality for Intelligent Automobile. It is an augmented reality interface powered by several sensors. The detection has two goals: one is the position of obstacles and the quantification of the danger represented by them. The other is the driver's behavior. A suitable visualization metaphor allows the driver to perceive at any time the location of the relevant hazards while keeping his eyes on the road. First results show that our method could be applied to a vehicle but also to aerospace, fluvial or maritime navigation.