Gonzalo Ferrer

h-index15

24papers

175citations

Novelty40%

AI Score48

Ranked #29,492 of 194,257 authors (top 15%)#10,556 in CV (top 18%)

24 Papers

3.7CVAug 2, 2022Code

T4DT: Tensorizing Time for Learning Temporal 3D Visual Data

Mikhail Usvyatsov, Rafael Ballester-Rippoll, Lina Bashaeva et al.

Unlike 2D raster images, there is no single dominant representation for 3D visual data processing. Different formats like point clouds, meshes, or implicit functions each have their strengths and weaknesses. Still, grid representations such as signed distance functions have attractive properties also in 3D. In particular, they offer constant-time random access and are eminently suitable for modern machine learning. Unfortunately, the storage size of a grid grows exponentially with its dimension. Hence they often exceed memory limits even at moderate resolution. This work proposes using low-rank tensor formats, including the Tucker, tensor train, and quantics tensor train decompositions, to compress time-varying 3D data. Our method iteratively computes, voxelizes, and compresses each frame's truncated signed distance function and applies tensor rank truncation to condense all frames into a single, compressed tensor that represents the entire 4D scene. We show that low-rank tensor compression is extremely compact to store and query time-varying signed distance functions. It significantly reduces the memory footprint of 4D scenes while remarkably preserving their geometric quality. Unlike existing, iterative learning-based approaches like DeepSDF and NeRF, our method uses a closed-form algorithm with theoretical guarantees.

4.0ROJun 29, 2022

Conditioned Human Trajectory Prediction using Iterative Attention Blocks

Aleksey Postnikov, Aleksander Gamayunov, Gonzalo Ferrer

Human motion prediction is key to understand social environments, with direct applications in robotics, surveillance, etc. We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians positions prediction in urban-like environments conditioned by the environment: map and surround agents. Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion, allowing to capture the important features in the environment that improve prediction. We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models, which makes our approach easily extendable and configurable, depending on the data available. We report results performing similarly with SoTA models on publicly available and extensible-used datasets with unimodal prediction metrics ADE and FDE.

12.2ROApr 19

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

Timur Akhtyamov, Mohamad Al Mdfaa, Javier Antonio Ramirez Benavides et al.

Data-driven navigation algorithms are critically dependent on large-scale, high-quality real-world data collection for successful training and robust performance in realistic and uncontrolled conditions. To enhance the growing family of navigation-related real-world datasets, we introduce EgoWalk - a dataset of 50 hours of human navigation in a diverse set of indoor/outdoor, varied seasons, and location environments. Along with the raw and Imitation Learning-ready data, we introduce several pipelines to automatically create subsidiary datasets for other navigation-related tasks, namely natural language goal annotations and traversability segmentation masks. Diversity studies, use cases, and benchmarks for the proposed dataset are provided to demonstrate its practical applicability. We openly release all data processing pipelines and the description of the hardware platform used for data collection to support future research and development in robot navigation systems.

6.5CVApr 21, 2022

SmartPortraits: Depth Powered Handheld Smartphone Dataset of Human Portraits for State Estimation, Reconstruction and Synthesis

Anastasiia Kornilova, Marsel Faizullin, Konstantin Pakulev et al.

We present a dataset of 1000 video sequences of human portraits recorded in real and uncontrolled conditions by using a handheld smartphone accompanied by an external high-quality depth camera. The collected dataset contains 200 people captured in different poses and locations and its main purpose is to bridge the gap between raw measurements obtained from a smartphone and downstream applications, such as state estimation, 3D reconstruction, view synthesis, etc. The sensors employed in data collection are the smartphone's camera and Inertial Measurement Unit (IMU), and an external Azure Kinect DK depth camera software synchronized with sub-millisecond precision to the smartphone system. During the recording, the smartphone flash is used to provide a periodic secondary source of lightning. Accurate mask of the foremost person is provided as well as its impact on the camera alignment accuracy. For evaluation purposes, we compare multiple state-of-the-art camera alignment methods by using a Motion Capture system. We provide a smartphone visual-inertial benchmark for portrait capturing, where we report results for multiple methods and motivate further use of the provided trajectories, available in the dataset, in view synthesis and 3D reconstruction tasks.

2.6CVApr 12, 2022

EVOPS Benchmark: Evaluation of Plane Segmentation from RGBD and LiDAR Data

Anastasiia Kornilova, Dmitrii Iarosh, Denis Kukushkin et al.

This paper provides the EVOPS dataset for plane segmentation from 3D data, both from RGBD images and LiDAR point clouds. We have designed two annotation methodologies (RGBD and LiDAR) running on well-known and widely-used datasets for SLAM evaluation and we have provided a complete set of benchmarking tools including point, planes and segmentation metrics. The data includes a total number of 10k RGBD and 7K LiDAR frames over different selected scenes which consist of high quality segmented planes. The experiments report quality of SOTA methods for RGBD plane segmentation on our annotated data. We also have provided learnable baseline for plane segmentation in LiDAR point clouds. All labeled data and benchmark tools used have been made publicly available at https://evops.netlify.app/.

1.9ROJan 18, 2023

DDPEN: Trajectory Optimisation With Sub Goal Generation Model

Aleksander Gamayunov, Aleksey Postnikov, Gonzalo Ferrer

Differential dynamic programming (DDP) is a widely used and powerful trajectory optimization technique, however, due to its internal structure, it is not exempt from local minima. In this paper, we present Differential Dynamic Programming with Escape Network (DDPEN) - a novel approach to avoid DDP local minima by utilising an additional term used in the optimization criteria pointing towards the direction where robot should move in order to escape local minima. In order to produce the aforementioned directions, we propose to utilize a deep model that takes as an input the map of the environment in the form of a costmap together with the desired goal position. The Model produces possible future directions that will lead to the goal, avoiding local minima which is possible to run in real time conditions. The model is trained on a synthetic dataset and overall the system is evaluated at the Gazebo simulator. In this work we show that our proposed method allows avoiding local minima of trajectory optimization algorithm and successfully execute a trajectory 278 m long with various convex and nonconvex obstacles.

2.8CVMar 9, 2023Code

EVOLIN Benchmark: Evaluation of Line Detection and Association

Kirill Ivanov, Gonzalo Ferrer, Anastasiia Kornilova

Lines are interesting geometrical features commonly seen in indoor and urban environments. There is missing a complete benchmark where one can evaluate lines from a sequential stream of images in all its stages: Line detection, Line Association and Pose error. To do so, we present a complete and exhaustive benchmark for visual lines in a SLAM front-end, both for RGB and RGBD, by providing a plethora of complementary metrics. We have also labelled data from well-known SLAM datasets in order to have all in one poses and accurately annotated lines. In particular, we have evaluated 17 line detection algorithms, 5 line associations methods and the resultant pose error for aligning a pair of frames with several combinations of detector-association. We have packaged all methods and evaluations metrics and made them publicly available on web-page https://prime-slam.github.io/evolin/.

3.9CVMar 9, 2023

Dominating Set Database Selection for Visual Place Recognition

Anastasiia Kornilova, Ivan Moskalenko, Timofei Pushkin et al.

This paper presents an approach for creating a visual place recognition (VPR) database for localization in indoor environments from RGBD scanning sequences. The proposed approach is formulated as a minimization problem in terms of dominating set algorithm for graph, constructed from spatial information, and referred as DominatingSet. Our algorithm shows better scene coverage in comparison to other methodologies that are used for database creation. Also, we demonstrate that using DominatingSet, a database size could be up to 250-1400 times smaller than the original scanning sequence while maintaining a recall rate of more than 80% on testing sequences. We evaluated our algorithm on 7-scenes and BundleFusion datasets and an additionally recorded sequence in a highly repetitive office setting. In addition, the database selection can produce weakly-supervised labels for fine-tuning neural place recognition algorithms to particular settings, improving even more their accuracy. The paper also presents a fully automated pipeline for VPR database creation from RGBD scanning sequences, as well as a set of metrics for VPR database evaluation. The code and released data are available on our web-page~ -- https://prime-slam.github.io/place-recognition-db/

5.9CVJul 3, 2023Code

NeSS-ST: Detecting Good and Stable Keypoints with a Neural Stability Score and the Shi-Tomasi Detector

Konstantin Pakulev, Alexander Vakhitov, Gonzalo Ferrer

Learning a feature point detector presents a challenge both due to the ambiguity of the definition of a keypoint and, correspondingly, the need for specially prepared ground truth labels for such points. In our work, we address both of these issues by utilizing a combination of a hand-crafted Shi-Tomasi detector, a specially designed metric that assesses the quality of keypoints, the stability score (SS), and a neural network. We build on the principled and localized keypoints provided by the Shi-Tomasi detector and learn the neural network to select good feature points via the stability score. The neural network incorporates the knowledge from the training targets in the form of the neural stability score (NeSS). Therefore, our method is named NeSS-ST since it combines the Shi-Tomasi detector and the properties of the neural stability score. It only requires sets of images for training without dataset pre-labeling or the need for reconstructed correspondence labels. We evaluate NeSS-ST on HPatches, ScanNet, MegaDepth and IMC-PT demonstrating state-of-the-art performance and good generalization on downstream tasks.

1.5CVFeb 9

Thegra: Graph-based SLAM for Thermal Imagery

Anastasiia Kornilova, Ivan Moskalenko, Arabella Gromova et al.

Thermal imaging provides a practical sensing modality for visual SLAM in visually degraded environments such as low illumination, smoke, or adverse weather. However, thermal imagery often exhibits low texture, low contrast, and high noise, complicating feature-based SLAM. In this work, we propose a sparse monocular graph-based SLAM system for thermal imagery that leverages general-purpose learned features -- the SuperPoint detector and LightGlue matcher, trained on large-scale visible-spectrum data to improve cross-domain generalization. To adapt these components to thermal data, we introduce a preprocessing pipeline to enhance input suitability and modify core SLAM modules to handle sparse and outlier-prone feature matches. We further incorporate keypoint confidence scores from SuperPoint into a confidence-weighted factor graph to improve estimation robustness. Evaluations on public thermal datasets demonstrate that the proposed system achieves reliable performance without requiring dataset-specific training or fine-tuning a desired feature detector, given the scarcity of quality thermal data. Code will be made available upon publication.

9.6CVJun 2, 2024Code

Visual place recognition for aerial imagery: A survey

Ivan Moskalenko, Anastasiia Kornilova, Gonzalo Ferrer

Aerial imagery and its direct application to visual localization is an essential problem for many Robotics and Computer Vision tasks. While Global Navigation Satellite Systems (GNSS) are the standard default solution for solving the aerial localization problem, it is subject to a number of limitations, such as, signal instability or solution unreliability that make this option not so desirable. Consequently, visual geolocalization is emerging as a viable alternative. However, adapting Visual Place Recognition (VPR) task to aerial imagery presents significant challenges, including weather variations and repetitive patterns. Current VPR reviews largely neglect the specific context of aerial data. This paper introduces a methodology tailored for evaluating VPR techniques specifically in the domain of aerial imagery, providing a comprehensive assessment of various methods and their performance. However, we not only compare various VPR methods, but also demonstrate the importance of selecting appropriate zoom and overlap levels when constructing map tiles to achieve maximum efficiency of VPR algorithms in the case of aerial imagery. The code is available on our GitHub repository -- https://github.com/prime-slam/aero-vloc.

2.6CVNov 5, 2021Code

SmartDepthSync: Open Source Synchronized Video Recording System of Smartphone RGB and Depth Camera Range Image Frames with Sub-millisecond Precision

Marsel Faizullin, Anastasiia Kornilova, Azat Akhmetyanov et al.

Nowadays, smartphones can produce a synchronized (synced) stream of high-quality data, including RGB images, inertial measurements, and other data. Therefore, smartphones are becoming appealing sensor systems in the robotics community. Unfortunately, there is still the need for external supporting sensing hardware, such as a depth camera precisely synced with the smartphone sensors. In this paper, we propose a hardware-software recording system that presents a heterogeneous structure and contains a smartphone and an external depth camera for recording visual, depth, and inertial data that are mutually synchronized. The system is synced at the time and the frame levels: every RGB image frame from the smartphone camera is exposed at the same moment of time with a depth camera frame with sub-millisecond precision. We provide a method and a tool for sync performance evaluation that can be applied to any pair of depth and RGB cameras. Our system could be replicated, modified, or extended by employing our open-sourced materials.

7.3ROAug 3, 2021Code

Comparison of modern open-source visual SLAM approaches

Dinar Sharafutdinov, Mark Griguletskii, Pavel Kopanev et al.

SLAM is one of the most fundamental areas of research in robotics and computer vision. State of the art solutions has advanced significantly in terms of accuracy and stability. Unfortunately, not all the approaches are available as open-source solutions and free to use. The results of some of them are difficult to reproduce, and there is a lack of comparison on common datasets. In our work, we make a comparative analysis of state of the art open-source methods. We assess the algorithms based on accuracy, computational performance, robustness, and fault tolerance. Moreover, we present a comparison of datasets as well as an analysis of algorithms from a practical point of view. The findings of the work raise several crucial questions for SLAM researchers.

10.4ROJul 6, 2021Code

Open-Source LiDAR Time Synchronization System by Mimicking GNSS-clock

Marsel Faizullin, Anastasiia Kornilova, Gonzalo Ferrer

Data fusion algorithms that employ LiDAR measurements, such as Visual-LiDAR, LiDAR-Inertial, or Multiple LiDAR Odometry and simultaneous localization and mapping (SLAM) rely on precise timestamping schemes that grant synchronicity to data from LiDAR and other sensors. Poor synchronization performance, due to incorrect timestamping procedure, may negatively affect the algorithms' state estimation results. To provide highly accurate and precise synchronization between the sensors, we introduce an open-source hardware-software LiDAR to other sensors time synchronization system that exploits a dedicated hardware LiDAR time synchronization interface by providing emulated GNSS-clock to this interface, no physical GNSS-receiver is needed. The emulator is based on a general-purpose microcontroller and, due to concise hardware and software architecture, can be easily modified or extended for synchronization of sets of different sensors such as cameras, inertial measurement units (IMUs), wheel encoders, other LiDARs, etc. In the paper, we provide an example of such a system with synchronized LiDAR and IMU sensors. We conducted an evaluation of the sensors synchronization accuracy and precision, and state 1 microsecond performance. We compared our results with timestamping provided by ROS software and by a LiDAR inner clocking scheme to underline clear advantages over these two baseline methods.

3.7CVMay 3, 2024

Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models

Mohamad Al Mdfaa, Raghad Salameh, Geesara Kulathunga et al.

In robotics and computer vision, semantic mapping remains a critical challenge for machines to comprehend complex environments. Traditional panoptic mapping approaches are constrained by fixed labels, limiting their ability to handle novel objects. We present Unified Promptable Panoptic Mapping (UPPM), which leverages foundation models for dynamic labeling without additional training. UPPM is evaluated across three comprehensive levels: Segmentation-to-Map, Map-to-Map, and Segmentation-to-Segmentation. Results demonstrate UPPM attains exceptional geometry reconstruction accuracy (0.61cm on the Flat dataset), the highest panoptic quality (0.414), and better performance compared to state-of-the-art segmentation methods. Furthermore, ablation studies validate the contributions of unified semantics, custom NMS, and blurry frame filtering, with the custom NMS improving the completion ratio by 8.27% on the Flat dataset. UPPM demonstrates effective scene reconstruction with rich semantic labeling across diverse datasets.

3.2ROOct 1, 2025

VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs

Mohamad Al Mdfaa, Svetlana Lukina, Timur Akhtyamov et al.

Vision-language models (VLMs) have shown potential for robot navigation but encounter fundamental limitations: they lack persistent scene memory, offer limited spatial reasoning, and do not scale effectively with video duration for real-time application. We present VL-KnG, a Visual Scene Understanding system that tackles these challenges using spatiotemporal knowledge graph construction and computationally efficient query processing for navigation goal identification. Our approach processes video sequences in chunks utilizing modern VLMs, creates persistent knowledge graphs that maintain object identity over time, and enables explainable spatial reasoning through queryable graph structures. We also introduce WalkieKnowledge, a new benchmark with about 200 manually annotated questions across 8 diverse trajectories spanning approximately 100 minutes of video data, enabling fair comparison between structured approaches and general-purpose VLMs. Real-world deployment on a differential drive robot demonstrates practical applicability, with our method achieving 77.27% success rate and 76.92% answer accuracy, matching Gemini 2.5 Pro performance while providing explainable reasoning supported by the knowledge graph, computational efficiency for real-time deployment across different tasks, such as localization, navigation and planning. Code and dataset will be released after acceptance.

7.3RODec 8, 2021

Transformer based trajectory prediction

Aleksey Postnikov, Aleksander Gamayunov, Gonzalo Ferrer

To plan a safe and efficient route, an autonomous vehicle should anticipate future motions of other agents around it. Motion prediction is an extremely challenging task which recently gained significant attention of the research community. In this work, we present a simple and yet strong baseline for uncertainty aware motion prediction based purely on transformer neural networks, which has shown its effectiveness in conditions of domain change. While being easy-to-implement, the proposed approach achieves competitive performance and ranks 1$^{st}$ on the 2021 Shifts Vehicle Motion Prediction Competition.

6.5CVSep 7, 2021

CovarianceNet: Conditional Generative Model for Correct Covariance Prediction in Human Motion Prediction

Aleksey Postnikov, Aleksander Gamayunov, Gonzalo Ferrer

The correct characterization of uncertainty when predicting human motion is equally important as the accuracy of this prediction. We present a new method to correctly predict the uncertainty associated with the predicted distribution of future trajectories. Our approach, CovariaceNet, is based on a Conditional Generative Model with Gaussian latent variables in order to predict the parameters of a bi-variate Gaussian distribution. The combination of CovarianceNet with a motion prediction model results in a hybrid approach that outputs a uni-modal distribution. We will show how some state of the art methods in motion prediction become overconfident when predicting uncertainty, according to our proposed metric and validated in the ETH data-set \cite{pellegrini2009you}. CovarianceNet correctly predicts uncertainty, which makes our method suitable for applications that use predicted distributions, e.g., planning or decision making.

3.0ROJul 6, 2021

Best Axes Composition: Multiple Gyroscopes IMU Sensor Fusion to Reduce Systematic Error

Marsel Faizullin, Gonzalo Ferrer

In this paper, we propose an algorithm to combine multiple cheap Inertial Measurement Unit (IMU) sensors to calculate 3D-orientations accurately. Our approach takes into account the inherent and non-negligible systematic error in the gyroscope model and provides a solution based on the error observed during previous instants of time. Our algorithm, the Best Axes Composition (BAC), chooses dynamically the most fitted axes among IMUs to improve the estimation performance. We compare our approach with a probabilistic Multiple IMU (MIMU) approach, and we validate our algorithm in our collected dataset. As a result, it only takes as few as 2 IMUs to significantly improve accuracy, while other MIMU approaches need a higher number of sensors to achieve the same results.

2.6CVJul 2, 2021

Sub-millisecond Video Synchronization of Multiple Android Smartphones

Azat Akhmetyanov, Anastasiia Kornilova, Marsel Faizullin et al.

This paper addresses the problem of building an affordable easy-to-setup synchronized multi-view camera system, which is in demand for many Computer Vision and Robotics applications in high-dynamic environments. In our work, we propose a solution for this problem -- a publicly-available Android application for synchronized video recording on multiple smartphones with sub-millisecond accuracy. We present a generalized mathematical model of timestamping for Android smartphones and prove its applicability on 47 different physical devices. Also, we estimate the time drift parameter for those smartphones, which is less than 1.2 msec per minute for most of the considered devices, that makes smartphones' camera system a worthy analog for professional multi-view systems. Finally, we demonstrate Android-app performance on the camera system built from Android smartphones quantitatively on setup with lights and qualitatively -- on panorama stitching task.

7.3ROJun 21, 2021Code

Be your own Benchmark: No-Reference Trajectory Metric on Registered Point Clouds

Anastasiia Kornilova, Gonzalo Ferrer

This paper addresses the problem of assessing trajectory quality in conditions when no ground truth poses are available or when their accuracy is not enough for the specific task - for example, small-scale mapping in outdoor scenes. In our work, we propose a no-reference metric, Mutually Orthogonal Metric (MOM), that estimates the quality of the map from registered point clouds via the trajectory poses. MOM strongly correlates with full-reference trajectory metric Relative Pose Error, making it a trajectory benchmarking tool on setups where 3D sensing technologies are employed. We provide a mathematical foundation for such correlation and confirm it statistically in synthetic environments. Furthermore, since our metric uses a subset of points from mutually orthogonal surfaces, we provide an algorithm for the extraction of such subset and evaluate its performance in synthetic CARLA environment and on KITTI dataset. The code of the proposed metric is publicly available as pip-package.

7.9CVDec 17, 2020

Relightable 3D Head Portraits from a Smartphone Video

Artem Sevastopolsky, Savva Ignatiev, Gonzalo Ferrer et al.

In this work, a system for creating a relightable 3D portrait of a human head is presented. Our neural pipeline operates on a sequence of frames captured by a smartphone camera with the flash blinking (flash-no flash sequence). A coarse point cloud reconstructed via structure-from-motion software and multi-view denoising is then used as a geometric proxy. Afterwards, a deep rendering network is trained to regress dense albedo, normals, and environmental lighting maps for arbitrary new viewpoints. Effectively, the proxy geometry and the rendering network constitute a relightable 3D portrait model, that can be synthesized from an arbitrary viewpoint and under arbitrary lighting, e.g. directional light, point light, or an environment map. The model is fitted to the sequence of frames with human face-specific priors that enforce the plausibility of albedo-lighting decomposition and operates at the interactive frame rate. We evaluate the performance of the method under varying lighting conditions and at the extrapolated viewpoints and compare with existing relighting methods.

2.2RONov 1, 2020

Random Fourier Features based SLAM

Yermek Kapushev, Anastasia Kishkun, Gonzalo Ferrer et al.

This work is dedicated to simultaneous continuous-time trajectory estimation and mapping based on Gaussian Processes (GP). State-of-the-art GP-based models for Simultaneous Localization and Mapping (SLAM) are computationally efficient but can only be used with a restricted class of kernel functions. This paper provides the algorithm based on GP with Random Fourier Features (RFF) approximation for SLAM without any constraints. The advantages of RFF for continuous-time SLAM are that we can consider a broader class of kernels and, at the same time, maintain computational complexity at reasonably low level by operating in the Fourier space of features. The accuracy-speed trade-off can be controlled by the number of features. Our experimental results on synthetic and real-world benchmarks demonstrate the cases in which our approach provides better results compared to the current state-of-the-art.

4.2CVSep 9, 2020

HSFM-$Σ$nn: Combining a Feedforward Motion Prediction Network and Covariance Prediction

A. Postnikov, A. Gamayunov, G. Ferrer

In this paper, we propose a new method for motion prediction: HSFM-$Σ$nn. Our proposed method combines two different approaches: a feedforward network whose layers are model-based transition functions using the HSFM and a Neural Network (NN), on each of these layers, for covariance prediction. We will compare our method with classical methods for covariance estimation showing their limitations. We will also compare with a learning-based approach, social-LSTM, showing that our method is more precise and efficient.