13.9CVJun 1
Hand Trajectory Fusion for Egocentric Natural Language Query GroundingEnmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar et al.
Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.
OCAug 8, 2014
Optimal polygonal L1 linearization and fast interpolation of nonlinear systemsGuillermo Gallego, Daniel Berjón, Narciso García
The analysis of complex nonlinear systems is often carried out using simpler piecewise linear representations of them. A principled and practical technique is proposed to linearize and evaluate arbitrary continuous nonlinear functions using polygonal (continuous piecewise linear) models under the L1 norm. A thorough error analysis is developed to guide an optimal design of two kinds of polygonal approximations in the asymptotic case of a large budget of evaluation subintervals N. The method allows the user to obtain the level of linearization (N) for a target approximation error and vice versa. It is suitable for, but not limited to, an efficient implementation in modern Graphics Processing Units (GPUs), allowing real-time performance of computationally demanding applications. The quality and efficiency of the technique has been measured in detail on two nonlinear functions that are widely used in many areas of scientific computing and are expensive to evaluate
26.1ROApr 18
NaviFormer: A Deep Reinforcement Learning Transformer-like Model to Holistically Solve the Navigation ProblemDaniel Fuertes, Andrea Cavallaro, Carlos R. del-Blanco et al.
Path planning is usually solved by addressing either the (high-level) route planning problem (waypoint sequencing to achieve the final goal) or the (low-level) path planning problem (trajectory prediction between two waypoints avoiding collisions). However, real-world problems usually require simultaneous solutions to the route and path planning subproblems with a holistic and efficient approach. In this paper, we introduce NaviFormer, a deep reinforcement learning model based on a Transformer architecture that solves the global navigation problem by predicting both high-level routes and low-level trajectories. To evaluate NaviFormer, several experiments have been conducted, including comparisons with other algorithms. Results show competitive accuracy from NaviFormer since it can understand the constraints and difficulties of each subproblem and act consequently to improve performance. Moreover, its superior computation speed proves its suitability for real-time missions.
20.7ROApr 18
Multi-stage Planning for Multi-target Surveillance using Aircrafts Equipped with Synthetic Aperture Radars Aware of Target VisibilityDaniel Fuertes, Carlos R. del-Blanco, Fernando Jaureguizar et al.
Generating trajectories for synthetic aperture radar (SAR)-equipped aircraft poses significant challenges due to terrain constraints, and the need for straight-flight segments to ensure high-quality imaging. Related works usually focus on trajectory optimization for predefined straight-flight segments that do not adapt to the target visibility, which depends on the 3D terrain and aircraft orientation. In addition, this assumption does not scale well for the multi-target problem, where multiple straight-flight segments that maximize target visibility must be defined for real-time operations. For this purpose, this paper presents a multi-stage planning system. First, the waypoint sequencing to visit all the targets is estimated. Second, straight-flight segments maximizing target visibility according to the 3D terrain are predicted using a novel neural network trained with deep reinforcement learning. Finally, the segments are connected to create a trajectory via optimization that imposes 3D Dubins curves. Evaluations demonstrate the robustness of the system for SAR missions since it ensures high-quality multi-target SAR image acquisition aware of 3D terrain and target visibility, and real-time performance.
AINov 30, 2023
TOP-Former: A Multi-Agent Transformer Approach for the Team Orienteering ProblemDaniel Fuertes, Carlos R. del-Blanco, Fernando Jaureguizar et al.
Route planning for a fleet of vehicles is an important task in applications such as package delivery, surveillance, or transportation, often integrated within larger Intelligent Transportation Systems (ITS). This problem is commonly formulated as a Vehicle Routing Problem (VRP) known as the Team Orienteering Problem (TOP). Existing solvers for this problem primarily rely on either linear programming, which provides accurate solutions but requires computation times that grow with the size of the problem, or heuristic methods, which typically find suboptimal solutions in a shorter time. In this paper, we introduce TOP-Former, a multi-agent route planning neural network designed to efficiently and accurately solve the Team Orienteering Problem. The proposed algorithm is based on a centralized Transformer neural network capable of learning to encode the scenario (modeled as a graph) and analyze the complete context of all agents to deliver fast, precise, and collaborative solutions. Unlike other neural network-based approaches that adopt a more local perspective, TOP-Former is trained to understand the global situation of the vehicle fleet and generate solutions that maximize long-term expected returns. Extensive experiments demonstrate that the presented system outperforms most state-of-the-art methods in terms of both accuracy and computation speed.
CVApr 30, 2025
AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior AnalysisEnmin Zhong, Carlos R. del-Blanco, Daniel Berjón et al.
Recently, there has been a surge of interest in applying deep learning techniques to animal behavior recognition, particularly leveraging pre-trained visual language models, such as CLIP, due to their remarkable generalization capacity across various downstream tasks. However, adapting these models to the specific domain of animal behavior recognition presents two significant challenges: integrating motion information and devising an effective temporal modeling scheme. In this paper, we propose AnimalMotionCLIP to address these challenges by interleaving video frames and optical flow information in the CLIP framework. Additionally, several temporal modeling schemes using an aggregation of classifiers are proposed and compared: dense, semi dense, and sparse. As a result, fine temporal actions can be correctly recognized, which is of vital importance in animal behavior analysis. Experiments on the Animal Kingdom dataset demonstrate that AnimalMotionCLIP achieves superior performance compared to state-of-the-art approaches.
CVAug 14, 2021
Soccer line mark segmentation and classification with stochastic watershed transformDaniel Berjón, Carlos Cuevas, Narciso García
Augmented reality applications are beginning to change the way sports are broadcast, providing richer experiences and valuable insights to fans. The first step of augmented reality systems is camera calibration, possibly based on detecting the line markings of the playing field. Most existing proposals for line detection rely on edge detection and Hough transform, but radial distortion and extraneous edges cause inaccurate or spurious detections of line markings. We propose a novel strategy to automatically and accurately segment and classify line markings. First, line points are segmented thanks to a stochastic watershed transform that is robust to radial distortions, since it makes no assumptions about line straightness, and is unaffected by the presence of players or the ball. The line points are then linked to primitive structures (straight lines and ellipses) thanks to a very efficient procedure that makes no assumptions about the number of primitives that appear in each image. The strategy has been tested on a new and public database composed by 60 annotated images from matches in five stadiums. The results obtained have proven that the proposed strategy is more robust and accurate than existing approaches, achieving successful line mark detection even in challenging conditions.
MMMar 3, 2021
Methodology to Assess Quality, Presence, Empathy, Attitude, and Attention in 360-degree Videos for Immersive CommunicationsMarta Orduna, Pablo Pérez, Jesús Gutiérrez et al.
This paper analyzes the joint assessment of quality, spatial and social presence, empathy, attitude, and attention in three conditions: (A)visualizing and rating the quality of contents in a Head-Mounted Display (HMD), (B)visualizing the contents in an HMD,and (C)visualizing the contents in an HMD where participants can see their hands and take notes. The experiment simulates an immersive communication where participants attend conversations of different genres and from different acquisition perspectives in the context of international experiences. Video quality is evaluated with Single-Stimulus Discrete Quality Evaluation (SSDQE) methodology. Spatial and social presence are evaluated with questionnaires adapted from the literature. Initial empathy is assessed with Interpersonal Reactivity Index(IRI) and a questionnaire is designed to evaluate attitude. Attention is evaluated with 3 questions that had pass/fail answers. 54 participants were evenly distributed among A, B, and C conditions taking into account their international experience backgrounds, obtaining a diverse sample of participants. The results from the subjective test validate the proposed methodology in VR communications, showing that video quality experiments can be adapted to conditions imposed by experiments focused on the evaluation of socioemotional features in terms of contents of long-duration, actor and observer acquisition perspectives, and genre. In addition, the positive results related to the sense of presence imply that technology can be relevant in the analyzed use case. The acquisition perspective greatly influences social presence and all the contents have a positive impact on all participants on their attitude towards international experiences. The annotated dataset, Student Experiences Around the World dataset (SEAW-dataset), obtained from the experiment is made publicly available.
CVJul 1, 2020
FVV Live: A real-time free-viewpoint video system with consumer electronics hardwarePablo Carballeira, Carlos Carmona, César Díaz et al.
FVV Live is a novel end-to-end free-viewpoint video system, designed for low cost and real-time operation, based on off-the-shelf components. The system has been designed to yield high-quality free-viewpoint video using consumer-grade cameras and hardware, which enables low deployment costs and easy installation for immersive event-broadcasting or videoconferencing. The paper describes the architecture of the system, including acquisition and encoding of multiview plus depth data in several capture servers and virtual view synthesis on an edge server. All the blocks of the system have been designed to overcome the limitations imposed by hardware and network, which impact directly on the accuracy of depth data and thus on the quality of virtual view synthesis. The design of FVV Live allows for an arbitrary number of cameras and capture servers, and the results presented in this paper correspond to an implementation with nine stereo-based depth cameras. FVV Live presents low motion-to-photon and end-to-end delays, which enables seamless free-viewpoint navigation and bilateral immersive communications. Moreover, the visual quality of FVV Live has been assessed through subjective assessment with satisfactory results, and additional comparative tests show that it is preferred over state-of-the-art DIBR alternatives.
CVJun 30, 2020
FVV Live: Real-Time, Low-Cost, Free Viewpoint VideoDaniel Berjón, Pablo Carballeira, Julián Cabrera et al.
FVV Live is a novel real-time, low-latency, end-to-end free viewpoint system including capture, transmission, synthesis on an edge server and visualization and control on a mobile terminal. The system has been specially designed for low-cost and real-time operation, only using off-the-shelf components.
MMMay 9, 2019
Methodology for accurately assessing the quality perceived by users on 360VR contentsLara Muñoz, César Díaz, Marta Orduna et al.
To properly evaluate the performance of 360VR-specific encoding and transmission schemes, and particularly of the solutions based on viewport adaptation, it is necessary to consider not only the bandwidth saved, but also the quality of the portion of the scene actually seen by users over time. With this motivation, we propose a robust, yet flexible methodology for accurately assessing the quality within the viewport along the visualization session. This procedure is based on a complete analysis of the geometric relations involved. Moreover, the designed methodology allows for both offline and online usage thanks to the use of different approximations. In this way, our methodology can be used regardless of the approach to properly evaluate the implemented strategy, obtaining a fairer comparison between them.
MMJan 18, 2019
Video Multimethod Assessment Fusion (VMAF) on 360VR contentsMarta Orduna, César Díaz, Lara Muñoz et al.
This paper describes the subjective experiments and subsequent analysis carried out to validate the application of one of the most robust and influential video quality metrics, Video Multimethod Assessment Fusion (VMAF), to 360VR contents. VMAF is a full reference metric initially designed to work with traditional 2D contents. Hence, at first, it cannot be assumed to be compatible with the particularities of the scenario where omnidirectional content is visualized using a Head-Mounted Display (HMD). Therefore, through a complete set of tests, we prove that this metric can be successfully used without any specific training or adjustments to obtain the quality of 360VR sequences actually perceived by users.
OCOct 10, 2015
Optimal Piecewise Linear Function Approximation for GPU-based ApplicationsDaniel Berjón, Guillermo Gallego, Carlos Cuevas et al.
Many computer vision and human-computer interaction applications developed in recent years need evaluating complex and continuous mathematical functions as an essential step toward proper operation. However, rigorous evaluation of this kind of functions often implies a very high computational cost, unacceptable in real-time applications. To alleviate this problem, functions are commonly approximated by simpler piecewise-polynomial representations. Following this idea, we propose a novel, efficient, and practical technique to evaluate complex and continuous functions using a nearly optimal design of two types of piecewise linear approximations in the case of a large budget of evaluation subintervals. To this end, we develop a thorough error analysis that yields asymptotically tight bounds to accurately quantify the approximation performance of both representations. It provides an improvement upon previous error estimates and allows the user to control the trade-off between the approximation error and the number of evaluation subintervals. To guarantee real-time operation, the method is suitable for, but not limited to, an efficient implementation in modern Graphics Processing Units (GPUs), where it outperforms previous alternative approaches by exploiting the fixed-function interpolation routines present in their texture units. The proposed technique is a perfect match for any application requiring the evaluation of continuous functions, we have measured in detail its quality and efficiency on several functions, and, in particular, the Gaussian function because it is extensively used in many areas of computer vision and cybernetics, and it is expensive to evaluate.