Rafael Barea

CV
h-index2
5papers
29citations
Novelty51%
AI Score42

5 Papers

ROSep 6, 2023Code
Efficient Baselines for Motion Prediction in Autonomous Driving

Carlos Gómez-Huélamo, Marcos V. Conde, Rafael Barea et al.

Motion Prediction (MP) of multiple surroundings agents is a crucial task in arbitrarily complex environments, from simple robots to Autonomous Driving Stacks (ADS). Current techniques tackle this problem using end-to-end pipelines, where the input data is usually a rendered top-view of the physical information and the past trajectories of the most relevant agents; leveraging this information is a must to obtain optimal performance. In that sense, a reliable ADS must produce reasonable predictions on time. However, despite many approaches use simple ConvNets and LSTMs to obtain the social latent features, State-Of-The-Art (SOTA) models might be too complex for real-time applications when using both sources of information (map and past trajectories) as well as little interpretable, specially considering the physical information. Moreover, the performance of such models highly depends on the number of available inputs for each particular traffic scenario, which are expensive to obtain, particularly, annotated High-Definition (HD) maps. In this work, we propose several efficient baselines for the well-known Argoverse 1 Motion Forecasting Benchmark. We aim to develop compact models using SOTA techniques for MP, including attention mechanisms and GNNs. Our lightweight models use standard social information and interpretable map information such as points from the driveable area and plausible centerlines by means of a novel preprocessing step based on kinematic constraints, in opposition to black-box CNN-based or too-complex graphs methods for map encoding, to generate plausible multimodal trajectories achieving up-to-pair accuracy with less operations and parameters than other SOTA methods. Our code is publicly available at https://github.com/Cram3r95/mapfe4mp .

CVSep 26, 2022
Exploring Attention GAN for Vehicle Motion Prediction

Carlos Gómez-Huélamo, Marcos V. Conde, Miguel Ortiz et al.

The design of a safe and reliable Autonomous Driving stack (ADS) is one of the most challenging tasks of our era. These ADS are expected to be driven in highly dynamic environments with full autonomy, and a reliability greater than human beings. In that sense, to efficiently and safely navigate through arbitrarily complex traffic scenarios, ADS must have the ability to forecast the future trajectories of surrounding actors. Current state-of-the-art models are typically based on Recurrent, Graph and Convolutional networks, achieving noticeable results in the context of vehicle prediction. In this paper we explore the influence of attention in generative models for motion prediction, considering both physical and social context to compute the most plausible trajectories. We first encode the past trajectories using a LSTM network, which serves as input to a Multi-Head Self-Attention module that computes the social context. On the other hand, we formulate a weighted interpolation to calculate the velocity and orientation in the last observation frame in order to calculate acceptable target points, extracted from the driveable of the HDMap information, which represents our physical context. Finally, the input of our generator is a white noise vector sampled from a multivariate normal distribution while the social and physical context are its conditions, in order to predict plausible trajectories. We validate our method using the Argoverse Motion Forecasting Benchmark 1.1, achieving competitive unimodal results.

CVJul 18, 2024
PO-GUISE+: Pose and object guided transformer token selection for efficient driver action recognition

Ricardo Pizarro, Roberto Valle, Rafael Barea et al.

We address the task of identifying distracted driving by analyzing in-car videos using efficient transformers. Although transformer models have achieved outstanding performance in human action recognition tasks, their high computational costs limit their application onboard a vehicle. We introduce POGUISE+, a multi-task video transformer that, given an input clip, predicts the distracted driving action, the driver's pose, and the interacting object. Our enhanced features for token selection are specifically adapted to driver actions by leveraging information about object interaction and the driver's pose. With POGUISE+, we significantly reduce the model's computational demands while maintaining or improving baseline accuracy across various computational budgets. Additionally, to evaluate our model's performance in real-world scenarios, we have developed benchmarks on a Jetson computing platform, demonstrating its effectiveness across different configurations and computational budgets. Our model outperforms current state-of-the-art results on the Drive&Act, 100-Driver, and 3MDAD datasets, while having superior efficiency compared to existing video transformer-based methods.

CVNov 11, 2024Code
Fast and Efficient Transformer-based Method for Bird's Eye View Instance Prediction

Miguel Antunes-García, Luis M. Bergasa, Santiago Montiel-Marín et al.

Accurate object detection and prediction are critical to ensure the safety and efficiency of self-driving architectures. Predicting object trajectories and occupancy enables autonomous vehicles to anticipate movements and make decisions with future information, increasing their adaptability and reducing the risk of accidents. Current State-Of-The-Art (SOTA) approaches often isolate the detection, tracking, and prediction stages, which can lead to significant prediction errors due to accumulated inaccuracies between stages. Recent advances have improved the feature representation of multi-camera perception systems through Bird's-Eye View (BEV) transformations, boosting the development of end-to-end systems capable of predicting environmental elements directly from vehicle sensor data. These systems, however, often suffer from high processing times and number of parameters, creating challenges for real-world deployment. To address these issues, this paper introduces a novel BEV instance prediction architecture based on a simplified paradigm that relies only on instance segmentation and flow prediction. The proposed system prioritizes speed, aiming at reduced parameter counts and inference times compared to existing SOTA architectures, thanks to the incorporation of an efficient transformer-based architecture. Furthermore, the implementation of the proposed architecture is optimized for performance improvements in PyTorch version 2.1. Code and trained models are available at https://github.com/miguelag99/Efficient-Instance-Prediction

CVApr 3
BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving

Miguel Antunes-García, Santiago Montiel-Marín, Fabio Sánchez-García et al.

A robust awareness of how dynamic scenes evolve is essential for Autonomous Driving systems, as they must accurately detect, track, and predict the behaviour of surrounding obstacles. Traditional perception pipelines that rely on modular architectures tend to suffer from cumulative errors and latency. Instance Prediction models provide a unified solution, performing Bird's-Eye-View segmentation and motion estimation across current and future frames using information directly obtained from different sensors. However, a key challenge in these models lies in the effective processing of the dense spatial and temporal information inherent in dynamic driving environments. This level of complexity demands architectures capable of capturing fine-grained motion patterns and long-range dependencies without compromising real-time performance. We introduce BEVPredFormer, a novel camera-only architecture for BEV instance prediction that uses attention-based temporal processing to improve temporal and spatial comprehension of the scene and relies on an attention-based 3D projection of the camera information. BEVPredFormer employs a recurrent-free design that incorporates gated transformer layers, divided spatio-temporal attention mechanisms, and multi-scale head tasks. Additionally, we incorporate a difference-guided feature extraction module that enhances temporal representations. Extensive ablation studies validate the effectiveness of each architectural component. When evaluated on the nuScenes dataset, BEVPredFormer was on par or surpassed State-Of-The-Art methods, highlighting its potential for robust and efficient Autonomous Driving perception.