CVNov 20, 2022
IC3D: Image-Conditioned 3D Diffusion for Shape GenerationCristian Sbrolli, Paolo Cudrano, Matteo Frosi et al.
In recent years, Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated exceptional performance in various 2D generative tasks. Following this success, DDPMs have been extended to 3D shape generation, surpassing previous methodologies in this domain. While many of these models are unconditional, some have explored the potential of using guidance from different modalities. In particular, image guidance for 3D generation has been explored through the utilization of CLIP embeddings. However, these embeddings are designed to align images and text, and do not necessarily capture the specific details needed for shape generation. To address this limitation and enhance image-guided 3D DDPMs with augmented 3D understanding, we introduce CISP (Contrastive Image-Shape Pre-training), obtaining a well-structured image-shape joint embedding space. Building upon CISP, we then introduce IC3D, a DDPM that harnesses CISP's guidance for 3D shape generation from single-view images. This generative diffusion model outperforms existing benchmarks in both quality and diversity of generated 3D shapes. Moreover, despite IC3D's generative nature, its generated shapes are preferred by human evaluators over a competitive single-view 3D reconstruction model. These properties contribute to a coherent embedding space, enabling latent interpolation and conditioned generation also from out-of-distribution images. We find IC3D able to generate coherent and diverse completions also when presented with occluded views, rendering it applicable in controlled real-world scenarios.
ROSep 13, 2023
RadarLCD: Learnable Radar-based Loop Closure Detection PipelineMirko Usuelli, Matteo Frosi, Paolo Cudrano et al.
Loop Closure Detection (LCD) is an essential task in robotics and computer vision, serving as a fundamental component for various applications across diverse domains. These applications encompass object recognition, image retrieval, and video analysis. LCD consists in identifying whether a robot has returned to a previously visited location, referred to as a loop, and then estimating the related roto-translation with respect to the analyzed location. Despite the numerous advantages of radar sensors, such as their ability to operate under diverse weather conditions and provide a wider range of view compared to other commonly used sensors (e.g., cameras or LiDARs), integrating radar data remains an arduous task due to intrinsic noise and distortion. To address this challenge, this research introduces RadarLCD, a novel supervised deep learning pipeline specifically designed for Loop Closure Detection using the FMCW Radar (Frequency Modulated Continuous Wave) sensor. RadarLCD, a learning-based LCD methodology explicitly designed for radar systems, makes a significant contribution by leveraging the pre-trained HERO (Hybrid Estimation Radar Odometry) model. Being originally developed for radar odometry, HERO's features are used to select key points crucial for LCD tasks. The methodology undergoes evaluation across a variety of FMCW Radar dataset scenes, and it is compared to state-of-the-art systems such as Scan Context for Place Recognition and ICP for Loop Closure. The results demonstrate that RadarLCD surpasses the alternatives in multiple aspects of Loop Closure Detection.
51.0ROApr 20
EmbodiedLGR: Integrating Lightweight Graph Representation and Retrieval for Semantic-Spatial Memory in Robotic AgentsPaolo Riva, Leonardo Gargani, Matteo Frosi et al.
As the world of agentic artificial intelligence applied to robotics evolves, the need for agents capable of building and retrieving memories and observations efficiently is increasing. Robots operating in complex environments must build memory structures to enable useful human-robot interactions by leveraging the mnemonic representation of the current operating context. People interacting with robots may expect the embodied agent to provide information about locations, events, or objects, which requires the agent to provide precise answers within human-like inference times to be perceived as responsive. We propose the Embodied Light Graph Retrieval Agent (EmbodiedLGR-Agent), a visual-language model (VLM)-driven agent architecture that constructs dense and efficient representations of robot operating environments. EmbodiedLGR-Agent directly addresses the need for an efficient memory representation of the environment by providing a hybrid building-retrieval approach built on parameter-efficient VLMs that store low-level information about objects and their positions in a semantic graph, while retaining high-level descriptions of the observed scenes with a traditional retrieval-augmented architecture. EmbodiedLGR-Agent is evaluated on the popular NaVQA dataset, achieving state-of-the-art performance in inference and querying times for embodied agents, while retaining competitive accuracy on the global task relative to the current state-of-the-art approaches. Moreover, EmbodiedLGR-Agent was successfully deployed on a physical robot, showing practical utility in real-world contexts through human-robot interaction, while running the visual-language model and the building-retrieval pipeline locally.
42.2CVMar 14
SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric RefinementYiran Guo, Simone Mentasti, Xiaofeng Jin et al.
3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$\%$ and a semantic mIoU of 49.89$\%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$\%$ and 3.69$\%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$\%$ SC-IoU and 46.22$\%$ mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.
CVApr 27, 2025
OpenFusion++: An Open-vocabulary Real-time Scene Understanding SystemXiaofeng Jin, Matteo Frosi, Matteo Matteucci
Real-time open-vocabulary scene understanding is essential for efficient 3D perception in applications such as vision-language navigation, embodied intelligence, and augmented reality. However, existing methods suffer from imprecise instance segmentation, static semantic updates, and limited handling of complex queries. To address these issues, we present OpenFusion++, a TSDF-based real-time 3D semantic-geometric reconstruction system. Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual-path encoding framework that integrates object attributes with environmental context for precise query responses. Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.
CVApr 27, 2025
Rendering Anywhere You See: Renderability Field-guided Gaussian SplattingXiaofeng Jin, Yan Fang, Matteo Frosi et al.
Scene view synthesis, which generates novel views from limited perspectives, is increasingly vital for applications like virtual reality, augmented reality, and robotics. Unlike object-based tasks, such as generating 360° views of a car, scene view synthesis handles entire environments where non-uniform observations pose unique challenges for stable rendering quality. To address this issue, we propose a novel approach: renderability field-guided gaussian splatting (RF-GS). This method quantifies input inhomogeneity through a renderability field, guiding pseudo-view sampling to enhanced visual consistency. To ensure the quality of wide-baseline pseudo-views, we train an image restoration model to map point projections to visible-light styles. Additionally, our validated hybrid data optimization strategy effectively fuses information of pseudo-view angles and source view textures. Comparative experiments on simulated and real-world data show that our method outperforms existing approaches in rendering stability.
ROSep 12, 2021
ART-SLAM: Accurate Real-Time 6DoF LiDAR SLAMMatteo Frosi, Matteo Matteucci
Real-time six degree-of-freedom pose estimation with ground vehicles represents a relevant and well studied topic in robotics, due to its many applications, such as autonomous driving and 3D mapping. Although some systems exist already, they are either not accurate or they struggle in real-time setting. In this paper, we propose a fast, accurate and modular LiDAR SLAM system for both batch and online estimation. We first apply downsampling and outlier removal, to filter out noise and reduce the size of the input point clouds. Filtered clouds are then used for pose tracking and floor detection, to ground-optimize the estimated trajectory. The availability of a pre-tracker, working in parallel with the filtering process, allows to obtain pre-computed odometries, to be used as aids when performing tracking. Efficient loop closure and pose optimization, achieved through a g2o pose graph, are the last steps of the proposed SLAM pipeline. We compare the performance of our system with state-of-the-art point cloud based methods, LOAM, LeGO-LOAM, A-LOAM, LeGO-LOAM-BOR and HDL, and show that the proposed system achieves equal or better accuracy and can easily handle even cases without loops. The comparison is done evaluating the estimated trajectory displacement using the KITTI and RADIATE datasets.