Hugo Proença

CV
h-index65
36papers
717citations
Novelty38%
AI Score55

36 Papers

CVOct 12, 2022Code
Deep Learning for Iris Recognition: A Survey

Kien Nguyen, Hugo Proença, Fernando Alonso-Fernandez

In this survey, we provide a comprehensive review of more than 200 papers, technical reports, and GitHub repositories published over the last 10 years on the recent developments of deep learning techniques for iris recognition, covering broad topics on algorithm designs, open-source tools, open challenges, and emerging research. First, we conduct a comprehensive analysis of deep learning techniques developed for two main sub-tasks in iris biometrics: segmentation and recognition. Second, we focus on deep learning techniques for the robustness of iris recognition systems against presentation attacks and via human-machine pairing. Third, we delve deep into deep learning techniques for forensic application, especially in post-mortem iris recognition. Fourth, we review open-source resources and tools in deep learning techniques for iris recognition. Finally, we highlight the technical challenges, emerging research trends, and outlook for the future of deep learning in iris recognition.

CVSep 24, 2022Code
Face Super-Resolution Using Stochastic Differential Equations

Marcelo dos Santos, Rayson Laroca, Rafael O. Ribeiro et al.

Diffusion models have proven effective for various applications such as images, audio and graph generation. Other important applications are image super-resolution and the solution of inverse problems. More recently, some works have used stochastic differential equations (SDEs) to generalize diffusion models to continuous time. In this work, we introduce SDEs to generate super-resolution face images. To the best of our knowledge, this is the first time SDEs have been used for such an application. The proposed method provides an improved peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and consistency than the existing super-resolution methods based on diffusion models. In particular, we also assess the potential application of this method for the face recognition task. A generic facial feature extractor is used to compare the super-resolution images with the ground truth and superior results were obtained compared with other methods. Our code is publicly available at https://github.com/marcelowds/sr-sde

CVMar 9, 2023Code
WASD: A Wilder Active Speaker Detection Dataset

Tiago Roxo, Joana C. Costa, Pedro R. M. Inácio et al.

Current Active Speaker Detection (ASD) models achieve great results on AVA-ActiveSpeaker (AVA), using only sound and facial features. Although this approach is applicable in movie setups (AVA), it is not suited for less constrained conditions. To demonstrate this limitation, we propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the two key components of current ASD: audio and face. Grouped into 5 categories, ranging from optimal conditions to surveillance settings, WASD contains incremental challenges for ASD with tactical impairment of audio and face data. We select state-of-the-art models and assess their performance in two groups of WASD: Easy (cooperative settings) and Hard (audio and/or face are specifically degraded). The results show that: 1) AVA trained models maintain a state-of-the-art performance in WASD Easy group, while underperforming in the Hard one, showing the 2) similarity between AVA and Easy data; and 3) training in WASD does not improve models performance to AVA levels, particularly for audio impairment and surveillance settings. This shows that AVA does not prepare models for wild ASD and current approaches are subpar to deal with such conditions. The proposed dataset also contains body data annotations to provide a new source for ASD, and is available at https://github.com/Tiago-Roxo/WASD.

CVDec 28, 2022
Periocular Biometrics: A Modality for Unconstrained Scenarios

Fernando Alonso-Fernandez, Josef Bigun, Julian Fierrez et al.

Periocular refers to the externally visible region of the face that surrounds the eye socket. This feature-rich area can provide accurate identification in unconstrained or uncooperative scenarios, where the iris or face modalities may not offer sufficient biometric cues due to factors such as partial occlusion or high subject-to-camera distance. The COVID-19 pandemic has further highlighted its importance, as the ocular region remained the only visible facial area even in controlled settings due to the widespread use of masks. This paper discusses the state of the art in periocular biometrics, presenting an overall framework encompassing its most significant research aspects, which include: (a) ocular definition, acquisition, and detection; (b) identity recognition, including combination with other modalities and use of various spectra; and (c) ocular soft-biometric analysis. Finally, we conclude by addressing current challenges and proposing future directions.

CVDec 11, 2024Code
ASDnB: Merging Face with Body Cues For Robust Active Speaker Detection

Tiago Roxo, Joana C. Costa, Pedro Inácio et al.

State-of-the-art Active Speaker Detection (ASD) approaches mainly use audio and facial features as input. However, the main hypothesis in this paper is that body dynamics is also highly correlated to "speaking" (and "listening") actions and should be particularly useful in wild conditions (e.g., surveillance settings), where face cannot be reliably accessed. We propose ASDnB, a model that singularly integrates face with body information by merging the inputs at different steps of feature extraction. Our approach splits 3D convolution into 2D and 1D to reduce computation cost without loss of performance, and is trained with adaptive weight feature importance for improved complement of face with body data. Our experiments show that ASDnB achieves state-of-the-art results in the benchmark dataset (AVA-ActiveSpeaker), in the challenging data of WASD, and in cross-domain settings using Columbia. This way, ASDnB can perform in multiple settings, which is positively regarded as a strong baseline for robust ASD models (code available at https://github.com/Tiago-Roxo/ASDnB).

CVDec 6, 2024Code
BIAS: A Body-based Interpretable Active Speaker Approach

Tiago Roxo, Joana C. Costa, Pedro R. M. Inácio et al.

State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at https://github.com/Tiago-Roxo/BIAS.

CVJan 28
FD-MAD: Frequency-Domain Residual Analysis for Face Morphing Attack Detection

Diogo J. Paulo, Hugo Proença, João C. Neves

Face morphing attacks present a significant threat to face recognition systems used in electronic identity enrolment and border control, particularly in single-image morphing attack detection (S-MAD) scenarios where no trusted reference is available. In spite of the vast amount of research on this problem, morph detection systems struggle in cross-dataset scenarios. To address this problem, we introduce a region-aware frequency-based morph detection strategy that drastically improves over strong baseline methods in challenging cross-dataset and cross-morph settings using a lightweight approach. Having observed the separability of bona fide and morph samples in the frequency domain of different facial parts, our approach 1) introduces the concept of residual frequency domain, where the frequency of the signal is decoupled from the natural spectral decay to easily discriminate between morph and bona fide data; 2) additionally, we reason in a global and local manner by combining the evidence from different facial regions in a Markov Random Field, which infers a globally consistent decision. The proposed method, trained exclusively on the synthetic morphing attack detection development dataset (SMDD), is evaluated in challenging cross-dataset and cross-morph settings on FRLL-Morph and MAD22 sets. Our approach achieves an average equal error rate (EER) of 1.85\% on FRLL-Morph and ranks second on MAD22 with an average EER of 6.12\%, while also obtaining a good bona fide presentation classification error rate (BPCER) at a low attack presentation classification error rate (APCER) using only spectral features. These findings indicate that Fourier-domain residual modeling with structured regional fusion offers a competitive alternative to deep S-MAD architectures.

CVOct 1, 2025Code
ZQBA: Zero Query Black-box Adversarial Attack

Joana C. Costa, Tiago Roxo, Hugo Proença et al.

Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at https://github.com/Joana-Cabral/ZQBA.

CVFeb 27, 2025Code
LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks

Joana C. Costa, Tiago Roxo, Hugo Proença et al.

State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at https://github.com/Joana-Cabral/LISArD.

CVJan 29
Rectifying Geometry-Induced Similarity Distortions for Real-World Aerial-Ground Person Re-Identification

Kailash A. Hambarde, Hugo Proença

Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies between aerial and ground cameras, which induce severe geometric distortions and invalidate the assumption of a shared similarity space across views. Existing methods primarily rely on geometry-aware feature learning or appearance-conditioned prompting, while implicitly assuming that the geometry-invariant dot-product similarity used in attention mechanisms remains reliable under large viewpoint and scale variations. We argue that this assumption does not hold. Extreme camera geometry systematically distorts the query-key similarity space and degrades attention-based matching, even when feature representations are partially aligned. To address this issue, we introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Rather than modifying feature representations or the attention formulation itself, GIQT adapts the similarity computation to compensate for dominant geometry-induced anisotropic distortions. Building on this local similarity rectification, we further incorporate a geometry-conditioned prompt generation mechanism that provides global, view-adaptive representation priors derived directly from camera geometry.Experiments on four aerial-ground person re-identification benchmarks demonstrate that the proposed framework consistently improves robustness under extreme and previously unseen geometric conditions, while introducing minimal computational overhead compared to state-of-the-art methods.

CVOct 21, 2021Code
Generative Adversarial Graph Convolutional Networks for Human Action Synthesis

Bruno Degardin, João Neves, Vasco Lopes et al.

Synthesising the spatial and temporal dynamics of the human body skeleton remains a challenging task, not only in terms of the quality of the generated shapes, but also of their diversity, particularly to synthesise realistic body movements of a specific action (action conditioning). In this paper, we propose Kinetic-GAN, a novel architecture that leverages the benefits of Generative Adversarial Networks and Graph Convolutional Networks to synthesise the kinetics of the human body. The proposed adversarial architecture can condition up to 120 different actions over local and global body movements while improving sample quality and diversity through latent space disentanglement and stochastic variations. Our experiments were carried out in three well-known datasets, where Kinetic-GAN notably surpasses the state-of-the-art methods in terms of distribution quality metrics while having the ability to synthesise more than one order of magnitude regarding the number of different actions. Our code and models are publicly available at https://github.com/DegardinBruno/Kinetic-GAN.

CVJul 8, 2020Code
The UU-Net: Reversible Face De-Identification for Visual Surveillance Video Footage

Hugo Proença

We propose a reversible face de-identification method for low resolution video data, where landmark-based techniques cannot be reliably used. Our solution is able to generate a photo realistic de-identified stream that meets the data protection regulations and can be publicly released under minimal privacy constraints. Notably, such stream encapsulates all the information required to later reconstruct the original scene, which is useful for scenarios, such as crime investigation, where the identification of the subjects is of most importance. We describe a learning process that jointly optimizes two main components: 1) a public module, that receives the raw data and generates the de-identified stream, where the ID information is surrogated in a photo-realistic and seamless way; and 2) a private module, designed for legal/security authorities, that analyses the public stream and reconstructs the original scene, disclosing the actual IDs of all the subjects in the scene. The proposed solution is landmarks-free and uses a conditional generative adversarial network to generate synthetic faces that preserve pose, lighting, background information and even facial expressions. Also, we enable full control over the set of soft facial attributes that should be preserved between the raw and de-identified data, which broads the range of applications for this solution. Our experiments were conducted in three different visual surveillance datasets (BIODI, MARS and P-DESTRE) and showed highly encouraging results. The source code is available at https://github.com/hugomcp/uu-net.

CVApr 2, 2020Code
An Attention-Based Deep Learning Model for Multiple Pedestrian Attributes Recognition

Ehsan Yaghoubi, Diana Borza, João Neves et al.

The automatic characterization of pedestrians in surveillance footage is a tough challenge, particularly when the data is extremely diverse with cluttered backgrounds, and subjects are captured from varying distances, under multiple poses, with partial occlusion. Having observed that the state-of-the-art performance is still unsatisfactory, this paper provides a novel solution to the problem, with two-fold contributions: 1) considering the strong semantic correlation between the different full-body attributes, we propose a multi-task deep model that uses an element-wise multiplication layer to extract more comprehensive feature representations. In practice, this layer serves as a filter to remove irrelevant background features, and is particularly important to handle complex, cluttered data; and 2) we introduce a weighted-sum term to the loss function that not only relativizes the contribution of each task (kind of attributed) but also is crucial for performance improvement in multiple-attribute inference settings. Our experiments were performed on two well-known datasets (RAP and PETA) and point for the superiority of the proposed method with respect to the state-of-the-art. The code is available at https://github.com/Ehsan-Yaghoubi/MAN-PAR-.

LGJan 30, 2020Code
Person Re-identification: Implicitly Defining the Receptive Fields of Deep Learning Classification Frameworks

Ehsan Yaghoubi, Diana Borza, Aruna Kumar et al.

The \emph{receptive fields} of deep learning classification models determine the regions of the input data that have the most significance for providing correct decisions. The primary way to learn such receptive fields is to train the models upon masked data, which helps the networks to ignore any unwanted regions, but has two major drawbacks: 1) it often yields edge-sensitive decision processes; and 2) augments the computational cost of the inference phase considerably. This paper describes a solution for implicitly driving the inference of the networks' receptive fields, by creating synthetic learning data composed of interchanged segments that should be \emph{apriori} important/irrelevant for the network decision. In practice, we use a segmentation module to distinguish between the foreground (important)/background (irrelevant) parts of each learning instance, and randomly swap segments between image pairs, while keeping the class label exclusively consistent with the label of the deemed important segments. This strategy typically drives the networks to early convergence and appropriate solutions, where the identity and clutter descriptions are not correlated. Moreover, this data augmentation solution has various interesting properties: 1) it is parameter-free; 2) it fully preserves the label information; and, 3) it is compatible with the typical data augmentation techniques. In the empirical validation, we considered the person re-identification problem and evaluated the effectiveness of the proposed solution in the well-known \emph{Richly Annotated Pedestrian} (RAP) dataset for two different settings (\emph{upper-body} and \emph{full-body}), observing highly competitive results over the state-of-the-art. Under a reproducible research paradigm, both the code and the empirical evaluation protocol are available at \url{https://github.com/Ehsan-Yaghoubi/reid-strong-baseline}.

LGAug 10, 2024
A Laplacian-based Quantum Graph Neural Network for Semi-Supervised Learning

Hamed Gholipour, Farid Bozorgnia, Kailash Hambarde et al.

Laplacian learning method is a well-established technique in classical graph-based semi-supervised learning, but its potential in the quantum domain remains largely unexplored. This study investigates the performance of the Laplacian-based Quantum Semi-Supervised Learning (QSSL) method across four benchmark datasets -- Iris, Wine, Breast Cancer Wisconsin, and Heart Disease. Further analysis explores the impact of increasing Qubit counts, revealing that adding more Qubits to a quantum system doesn't always improve performance. The effectiveness of additional Qubits depends on the quantum algorithm and how well it matches the dataset. Additionally, we examine the effects of varying entangling layers on entanglement entropy and test accuracy. The performance of Laplacian learning is highly dependent on the number of entangling layers, with optimal configurations varying across different datasets. Typically, moderate levels of entanglement offer the best balance between model complexity and generalization capabilities. These observations highlight the crucial need for precise hyperparameter tuning tailored to each dataset to achieve optimal performance in Laplacian learning methods.

CVApr 30
Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa et al.

Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formulations address this by discriminating manipulation type, but introduce a unresolved problem: models may rely solely on data source integrity to detect DeepFakes without evaluating their semantic consistency. If the DeepFake origin is not in the data source but in its content, can semantic mismatch be assessed by the state-of-the-art? This paper proposes a new evaluation setup, extending the four-class formulation by explicitly modeling semantic-level inconsistency between authentic modalities with the introduction a new class: Real Audio-Real Video with Semantic Mismatch (RARV-SMM). We assess the robustness of state-of-the-art models in this new realistic DeepFake setting, using the FakeAVCeleb dataset, highlighting the limitations of existing approaches when faced with semantic mismatch data. We further introduce three RARV-SMM variants that expose distinct architectural vulnerabilities as audio-visual divergence increases. We also propose a semantic reinforcement strategy that incorporates the semantic mismatch class and ImageBind embeddings to improve DeepFake detection in both our proposed and state-of-the-art settings, on FakeAVCeleb and LAV-DF, paving the way to more realistic DeepFake detectors. The source code and data are available at https://github.com/.

CVJan 5
SortWaste: A Densely Annotated Dataset for Object Detection in Industrial Waste Sorting

Sara Inácio, Hugo Proença, João C. Neves

The increasing production of waste, driven by population growth, has created challenges in managing and recycling materials effectively. Manual waste sorting is a common practice; however, it remains inefficient for handling large-scale waste streams and presents health risks for workers. On the other hand, existing automated sorting approaches still struggle with the high variability, clutter, and visual complexity of real-world waste streams. The lack of real-world datasets for waste sorting is a major reason automated systems for this problem are underdeveloped. Accordingly, we introduce SortWaste, a densely annotated object detection dataset collected from a Material Recovery Facility. Additionally, we contribute to standardizing waste detection in sorting lines by proposing ClutterScore, an objective metric that gauges the scene's hardness level using a set of proxies that affect visual complexity (e.g., object count, class and size entropy, and spatial overlap). In addition to these contributions, we provide an extensive benchmark of state-of-the-art object detection models, detailing their results with respect to the hardness level assessed by the proposed metric. Despite achieving promising results (mAP of 59.7% in the plastic-only detection task), performance significantly decreases in highly cluttered scenes. This highlights the need for novel and more challenging datasets on the topic.

CVMay 7, 2025
DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition

Kailash A. Hambarde, Nzakiese Mbongo, Pavan Kumar MP et al.

Person reidentification (ReID) technology has been considered to perform relatively well under controlled, ground-level conditions, but it breaks down when deployed in challenging real-world settings. Evidently, this is due to extreme data variability factors such as resolution, viewpoint changes, scale variations, occlusions, and appearance shifts from clothing or session drifts. Moreover, the publicly available data sets do not realistically incorporate such kinds and magnitudes of variability, which limits the progress of this technology. This paper introduces DetReIDX, a large-scale aerial-ground person dataset, that was explicitly designed as a stress test to ReID under real-world conditions. DetReIDX is a multi-session set that includes over 13 million bounding boxes from 509 identities, collected in seven university campuses from three continents, with drone altitudes between 5.8 and 120 meters. More important, as a key novelty, DetReIDX subjects were recorded in (at least) two sessions on different days, with changes in clothing, daylight and location, making it suitable to actually evaluate long-term person ReID. Plus, data were annotated from 16 soft biometric attributes and multitask labels for detection, tracking, ReID, and action recognition. In order to provide empirical evidence of DetReIDX usefulness, we considered the specific tasks of human detection and ReID, where SOTA methods catastrophically degrade performance (up to 80% in detection accuracy and over 70% in Rank-1 ReID) when exposed to DetReIDXs conditions. The dataset, annotations, and official evaluation protocols are publicly available at https://www.it.ubi.pt/DetReIDX/

CVJan 30, 2025
Human Re-ID Meets LVLMs: What can we expect?

Kailash Hambarde, Pranita Samale, Hugo Proença

Large vision-language models (LVLMs) have been regarded as a breakthrough advance in an astoundingly variety of tasks, from content generation to virtual assistants and multimodal search or retrieval. However, for many of these applications, the performance of these methods has been widely criticized, particularly when compared with state-of-the-art methods and technologies in each specific domain. In this work, we compare the performance of the leading large vision-language models in the human re-identification task, using as baseline the performance attained by state-of-the-art AI models specifically designed for this problem. We compare the results due to ChatGPT-4o, Gemini-2.0-Flash, Claude 3.5 Sonnet, and Qwen-VL-Max to a baseline ReID PersonViT model, using the well-known Market1501 dataset. Our evaluation pipeline includes the dataset curation, prompt engineering, and metric selection to assess the models' performance. Results are analyzed from many different perspectives: similarity scores, classification accuracy, and classification metrics, including precision, recall, F1 score, and area under curve (AUC). Our results confirm the strengths of LVLMs, but also their severe limitations that often lead to catastrophic answers and should be the scope of further research. As a concluding remark, we speculate about some further research that should fuse traditional and LVLMs to combine the strengths from both families of techniques and achieve solid improvements in performance.

CVDec 6, 2024
How to Squeeze An Explanation Out of Your Model

Tiago Roxo, Joana C. Costa, Pedro R. M. Inácio et al.

Deep learning models are widely used nowadays for their reliability in performing various tasks. However, they do not typically provide the reasoning behind their decision, which is a significant drawback, particularly for more sensitive areas such as biometrics, security and healthcare. The most commonly used approaches to provide interpretability create visual attention heatmaps of regions of interest on an image based on models gradient backpropagation. Although this is a viable approach, current methods are targeted toward image settings and default/standard deep learning models, meaning that they require significant adaptations to work on video/multi-modal settings and custom architectures. This paper proposes an approach for interpretability that is model-agnostic, based on a novel use of the Squeeze and Excitation (SE) block that creates visual attention heatmaps. By including an SE block prior to the classification layer of any model, we are able to retrieve the most influential features via SE vector manipulation, one of the key components of the SE block. Our results show that this new SE-based interpretability can be applied to various models in image and video/multi-modal settings, namely biometrics of facial features with CelebA and behavioral biometrics using Active Speaker Detection datasets. Furthermore, our proposal does not compromise model performance toward the original task, and has competitive results with current interpretability approaches in state-of-the-art object datasets, highlighting its robustness to perform in varying data aside from the biometric context.

CVMar 11, 2024
Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework

Henrique Jesus, Hugo Proença

Large vision models based in deep learning architectures have been consistently advancing the state-of-the-art in biometric recognition. However, three weaknesses are commonly reported for such kind of approaches: 1) their extreme demands in terms of learning data; 2) the difficulties in generalising between different domains; and 3) the lack of interpretability/explainability, with biometrics being of particular interest, as it is important to provide evidence able to be used for forensics/legal purposes (e.g., in courts). To the best of our knowledge, this paper describes the first recognition framework/strategy that aims at addressing the three weaknesses simultaneously. At first, it relies exclusively in synthetic samples for learning purposes. Instead of requiring a large amount and variety of samples for each subject, the idea is to exclusively enroll a 3D point cloud per identity. Then, using generative strategies, we synthesize a very large (potentially infinite) number of samples, containing all the desired covariates (poses, clothing, distances, perspectives, lighting, occlusions,...). Upon the synthesizing method used, it is possible to adapt precisely to different kind of domains, which accounts for generalization purposes. Such data are then used to learn a model that performs local registration between image pairs, establishing positive correspondences between body parts that are the key, not only to recognition (according to cardinality and distribution), but also to provide an interpretable description of the response (e.g.: "both samples are from the same person, as they have similar facial shape, hair color and legs thickness").

CVJan 4
VReID-XFD: Video-based Person Re-identification at Extreme Far Distance Challenge Results

Kailash A. Hambarde, Hugo Proença, Md Rashidunnabi et al.

Person re-identification (ReID) across aerial and ground views at extreme far distances introduces a distinct operating regime where severe resolution degradation, extreme viewpoint changes, unstable motion cues, and clothing variation jointly undermine the appearance-based assumptions of existing ReID systems. To study this regime, we introduce VReID-XFD, a video-based benchmark and community challenge for extreme far-distance (XFD) aerial-to-ground person re-identification. VReID-XFD is derived from the DetReIDX dataset and comprises 371 identities, 11,288 tracklets, and 11.75 million frames, captured across altitudes from 5.8 m to 120 m, viewing angles from oblique (30 degrees) to nadir (90 degrees), and horizontal distances up to 120 m. The benchmark supports aerial-to-aerial, aerial-to-ground, and ground-to-aerial evaluation under strict identity-disjoint splits, with rich physical metadata. The VReID-XFD-25 Challenge attracted 10 teams with hundreds of submissions. Systematic analysis reveals monotonic performance degradation with altitude and distance, a universal disadvantage of nadir views, and a trade-off between peak performance and robustness. Even the best-performing SAS-PReID method achieves only 43.93 percent mAP in the aerial-to-ground setting. The dataset, annotations, and official evaluation protocols are publicly available at https://www.it.ubi.pt/DetReIDX/ .

CVNov 20, 2025
StreetView-Waste: A Multi-Task Dataset for Urban Waste Management

Diogo J. Paulo, João Martins, Hugo Proença et al.

Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.

CVJul 25, 2025
Bias Analysis for Synthetic Face Detection: A Case Study of the Impact of Facial Attributes

Asmae Lamsaf, Lucia Cascone, Hugo Proença et al.

Bias analysis for synthetic face detection is bound to become a critical topic in the coming years. Although many detection models have been developed and several datasets have been released to reliably identify synthetic content, one crucial aspect has been largely overlooked: these models and training datasets can be biased, leading to failures in detection for certain demographic groups and raising significant social, legal, and ethical issues. In this work, we introduce an evaluation framework to contribute to the analysis of bias of synthetic face detectors with respect to several facial attributes. This framework exploits synthetic data generation, with evenly distributed attribute labels, for mitigating any skew in the data that could otherwise influence the outcomes of bias analysis. We build on the proposed framework to provide an extensive case study of the bias level of five state-of-the-art detectors in synthetic datasets with 25 controlled facial attributes. While the results confirm that, in general, synthetic face detectors are biased towards the presence/absence of specific facial attributes, our study also sheds light on the origins of the observed bias through the analysis of the correlations with the balancing of facial attributes in the training sets of the detectors, and the analysis of detectors activation maps in image pairs with controlled attribute modifications.

CVJun 28, 2025
AG-VPReID 2025: Aerial-Ground Video-based Person Re-identification Challenge Results

Kien Nguyen, Clinton Fookes, Sridha Sridharan et al.

Person re-identification (ReID) across aerial and ground vantage points has become crucial for large-scale surveillance and public safety applications. Although significant progress has been made in ground-only scenarios, bridging the aerial-ground domain gap remains a formidable challenge due to extreme viewpoint differences, scale variations, and occlusions. Building upon the achievements of the AG-ReID 2023 Challenge, this paper introduces the AG-VPReID 2025 Challenge - the first large-scale video-based competition focused on high-altitude (80-120m) aerial-ground ReID. Constructed on the new AG-VPReID dataset with 3,027 identities, over 13,500 tracklets, and approximately 3.7 million frames captured from UAVs, CCTV, and wearable cameras, the challenge featured four international teams. These teams developed solutions ranging from multi-stream architectures to transformer-based temporal reasoning and physics-informed modeling. The leading approach, X-TFCLIP from UAM, attained 72.28% Rank-1 accuracy in the aerial-to-ground ReID setting and 70.77% in the ground-to-aerial ReID setting, surpassing existing baselines while highlighting the dataset's complexity. For additional details, please refer to the official website at https://agvpreid25.github.io.

CVMay 18, 2023
How Deep Learning Sees the World: A Survey on Adversarial Attacks & Defenses

Joana C. Costa, Tiago Roxo, Hugo Proença et al.

Deep Learning is currently used to perform multiple tasks, such as object recognition, face recognition, and natural language processing. However, Deep Neural Networks (DNNs) are vulnerable to perturbations that alter the network prediction (adversarial examples), raising concerns regarding its usage in critical areas, such as self-driving vehicles, malware detection, and healthcare. This paper compiles the most recent adversarial attacks, grouped by the attacker capacity, and modern defenses clustered by protection strategies. We also present the new advances regarding Vision Transformers, summarize the datasets and metrics used in the context of adversarial settings, and compare the state-of-the-art results under different attacks, finishing with the identification of open issues.

CVJul 14, 2021
YinYang-Net: Complementing Face and Body Information for Wild Gender Recognition

Tiago Roxo, Hugo Proença

Soft biometrics inference in surveillance scenarios is a topic of interest for various applications, particularly in security-related areas. However, soft biometric analysis is not extensively reported in wild conditions. In particular, previous works on gender recognition report their results in face datasets, with relatively good image quality and frontal poses. Given the uncertainty of the availability of the facial region in wild conditions, we consider that these methods are not adequate for surveillance settings. To overcome these limitations, we: 1) present frontal and wild face versions of three well-known surveillance datasets; and 2) propose YinYang-Net (YY-Net), a model that effectively and dynamically complements facial and body information, which makes it suitable for gender recognition in wild conditions. The frontal and wild face datasets derive from widely used Pedestrian Attribute Recognition (PAR) sets (PETA, PA-100K, and RAP), using a pose-based approach to filter the frontal samples and facial regions. This approach retrieves the facial region of images with varying image/subject conditions, where the state-of-the-art face detectors often fail. YY-Net combines facial and body information through a learnable fusion matrix and a channel-attention sub-network, focusing on the most influential body parts according to the specific image/subject features. We compare it with five PAR methods, consistently obtaining state-of-the-art results on gender recognition, and reducing the prediction errors by up to 24% in frontal samples. The announced PAR datasets versions and YY-Net serve as the basis for wild soft biometrics classification and are available in https://github.com/Tiago-Roxo.

CVMay 14, 2021
REGINA - Reasoning Graph Convolutional Networks in Human Action Recognition

Bruno Degardin, Vasco Lopes, Hugo Proença

It is known that the kinematics of the human body skeleton reveals valuable information in action recognition. Recently, modeling skeletons as spatio-temporal graphs with Graph Convolutional Networks (GCNs) has been reported to solidly advance the state-of-the-art performance. However, GCN-based approaches exclusively learn from raw skeleton data, and are expected to extract the inherent structural information on their own. This paper describes REGINA, introducing a novel way to REasoning Graph convolutional networks IN Human Action recognition. The rationale is to provide to the GCNs additional knowledge about the skeleton data, obtained by handcrafted features, in order to facilitate the learning process, while guaranteeing that it remains fully trainable in an end-to-end manner. The challenge is to capture complementary information over the dynamics between consecutive frames, which is the key information extracted by state-of-the-art GCN techniques. Moreover, the proposed strategy can be easily integrated in the existing GCN-based methods, which we also regard positively. Our experiments were carried out in well known action recognition datasets and enabled to conclude that REGINA contributes for solid improvements in performance when incorporated to other GCN-based approaches, without any other adjustment regarding the original method. For reproducibility, the REGINA code and all the experiments carried out will be publicly available at https://github.com/DegardinBruno.

CVMay 12, 2021
Is Gender "In-the-Wild" Inference Really a Solved Problem?

Tiago Roxo, Hugo Proença

Soft biometrics analysis is seen as an important research topic, given its relevance to various applications. However, even though it is frequently seen as a solved task, it can still be very hard to perform in wild conditions, under varying image conditions, uncooperative poses, and occlusions. Considering the gender trait as our topic of study, we report an extensive analysis of the feasibility of its inference regarding image (resolution, luminosity, and blurriness) and subject-based features (face and body keypoints confidence). Using three state-of-the-art datasets (PETA, PA-100K, RAP) and five Person Attribute Recognition models, we correlate feature analysis with gender inference accuracy using the Shapley value, enabling us to perceive the importance of each image/subject-based feature. Furthermore, we analyze face-based gender inference and assess the pose effect on it. Our results suggest that: 1) image-based features are more influential for low-quality data; 2) an increase in image quality translates into higher subject-based feature importance; 3) face-based gender inference accuracy correlates with image quality increase; and 4) subjects' frontal pose promotes an implicit attention towards the face. The reported results are seen as a basis for subsequent developments of inference approaches in uncontrolled outdoor environments, which typically correspond to visual surveillance conditions.

CVJun 19, 2020
A Symbolic Temporal Pooling method for Video-based Person Re-Identification

S V Aruna Kumar, Ehsan Yaghoubi, Hugo Proença

In video-based person re-identification, both the spatial and temporal features are known to provide orthogonal cues to effective representations. Such representations are currently typically obtained by aggregating the frame-level features using max/avg pooling, at different points of the models. However, such operations also decrease the amount of discriminating information available, which is particularly hazardous in case of poor separability between the different classes. To alleviate this problem, this paper introduces a symbolic temporal pooling method, where frame-level features are represented in the distribution valued symbolic form, yielding from fitting an Empirical Cumulative Distribution Function (ECDF) to each feature. Also, considering that the original triplet loss formulation cannot be applied directly to this kind of representations, we introduce a symbolic triplet loss function that infers the similarity between two symbolic objects. Having carried out an extensive empirical evaluation of the proposed solution against the state-of-the-art, in four well known data sets (MARS, iLIDS-VID, PRID2011 and P-DESTRE), the observed results point for consistent improvements in performance over the previous best performing techniques.

CVApr 6, 2020
The P-DESTRE: A Fully Annotated Dataset for Pedestrian Detection, Tracking, Re-Identification and Search from Aerial Devices

S. V. Aruna Kumar, Ehsan Yaghoubi, Abhijit Das et al.

Over the last decades, the world has been witnessing growing threats to the security in urban spaces, which has augmented the relevance given to visual surveillance solutions able to detect, track and identify persons of interest in crowds. In particular, unmanned aerial vehicles (UAVs) are a potential tool for this kind of analysis, as they provide a cheap way for data collection, cover large and difficult-to-reach areas, while reducing human staff demands. In this context, all the available datasets are exclusively suitable for the pedestrian re-identification problem, in which the multi-camera views per ID are taken on a single day, and allows the use of clothing appearance features for identification purposes. Accordingly, the main contributions of this paper are two-fold: 1) we announce the UAV-based P-DESTRE dataset, which is the first of its kind to provide consistent ID annotations across multiple days, making it suitable for the extremely challenging problem of person search, i.e., where no clothing information can be reliably used. Apart this feature, the P-DESTRE annotations enable the research on UAV-based pedestrian detection, tracking, re-identification and soft biometric solutions; and 2) we compare the results attained by state-of-the-art pedestrian detection, tracking, reidentification and search techniques in well-known surveillance datasets, to the effectiveness obtained by the same techniques in the P-DESTRE data. Such comparison enables to identify the most problematic data degradation factors of UAV-based data for each task, and can be used as baselines for subsequent advances in this kind of technology. The dataset and the full details of the empirical evaluation carried out are freely available at http://p-destre.di.ubi.pt/.

CVFeb 26, 2020
A Quadruplet Loss for Enforcing Semantically Coherent Embeddings in Multi-output Classification Problems

Hugo Proença, Ehsan Yaghoubi, Pendar Alirezazadeh

This paper describes one objective function for learning semantically coherent feature embeddings in multi-output classification problems, i.e., when the response variables have dimension higher than one. In particular, we consider the problems of identity retrieval and soft biometrics labelling in visual surveillance environments, which have been attracting growing interests. Inspired by the triplet loss [34] function, we propose a generalization that: 1) defines a metric that considers the number of agreeing labels between pairs of elements; and 2) disregards the notion of anchor, replacing d(A1, A2) < d(A1, B) by d(A, B) < d(C, D), for A, B, C, D distance constraints, according to the number of agreeing labels between pairs. As the triplet loss formulation, our proposal also privileges small distances between positive pairs, but at the same time explicitly enforces that the distance between other pairs corresponds directly to their similarity in terms of agreeing labels. This yields feature embeddings with a strong correspondence between the classes centroids and their semantic descriptions, i.e., where elements are closer to others that share some of their labels than to elements with fully disjoint labels membership. As practical effect, the proposed loss can be seen as particularly suitable for performing joint coarse (soft label) + fine (ID) inference, based on simple rules as k-neighbours, which is a novelty with respect to previous related loss functions. Also, in opposition to its triplet counterpart, the proposed loss is agnostic with regard to any demanding criteria for mining learning instances (such as the semi-hard pairs). Our experiments were carried out in five different datasets (BIODI, LFW, IJB-A, Megaface and PETA) and validate our assumptions, showing highly promising results.

CVFeb 10, 2020
Unconstrained Periocular Recognition: Using Generative Deep Learning Frameworks for Attribute Normalization

Luiz A. Zanlorensi, Hugo Proença, David Menotti

Ocular biometric systems working in unconstrained environments usually face the problem of small within-class compactness caused by the multiple factors that jointly degrade the quality of the obtained data. In this work, we propose an attribute normalization strategy based on deep learning generative frameworks, that reduces the variability of the samples used in pairwise comparisons, without reducing their discriminability. The proposed method can be seen as a preprocessing step that contributes for data regularization and improves the recognition accuracy, being fully agnostic to the recognition strategy used. As proof of concept, we consider the "eyeglasses" and "gaze" factors, comparing the levels of performance of five different recognition methods with/without using the proposed normalization strategy. Also, we introduce a new dataset for unconstrained periocular recognition, composed of images acquired by mobile devices, particularly suited to perceive the impact of "wearing eyeglasses" in recognition effectiveness. Our experiments were performed in two different datasets, and support the usefulness of our attribute normalization scheme to improve the recognition performance.

CVNov 21, 2019
Deep Representations for Cross-spectral Ocular Biometrics

Luiz A. Zanlorensi, Diego R. Lucio, Alceu S. Britto et al.

One of the major challenges in ocular biometrics is the cross-spectral scenario, i.e., how to match images acquired in different wavelengths (typically visible (VIS) against near-infrared (NIR)). This article designs and extensively evaluates cross-spectral ocular verification methods, for both the closed and open-world settings, using well known deep learning representations based on the iris and periocular regions. Using as inputs the bounding boxes of non-normalized iris/periocular regions, we fine-tune Convolutional Neural Network(CNN) models (based either on VGG16 or ResNet-50 architectures), originally trained for face recognition. Based on the experiments carried out in two publicly available cross-spectral ocular databases, we report results for intra-spectral and cross-spectral scenarios, with the best performance being observed when fusing ResNet-50 deep representations from both the periocular and iris regions. When compared to the state-of-the-art, we observed that the proposed solution consistently reduces the Equal Error Rate(EER) values by 90% / 93% / 96% and 61% / 77% / 83% on the cross-spectral scenario and in the PolyU Bi-spectral and Cross-eye-cross-spectral datasets. Lastly, we evaluate the effect that the "deepness" factor of feature representations has in recognition effectiveness, and - based on a subjective analysis of the most problematic pairwise comparisons - we point out further directions for this field of research.

CVNov 13, 2019
GANprintR: Improved Fakes and Evaluation of the State of the Art in Face Manipulation Detection

João C. Neves, Ruben Tolosana, Ruben Vera-Rodriguez et al.

The availability of large-scale facial databases, together with the remarkable progresses of deep learning technologies, in particular Generative Adversarial Networks (GANs), have led to the generation of extremely realistic fake facial content, raising obvious concerns about the potential for misuse. Such concerns have fostered the research on manipulation detection methods that, contrary to humans, have already achieved astonishing results in various scenarios. In this study, we focus on the synthesis of entire facial images, which is a specific type of facial manipulation. The main contributions of this study are four-fold: i) a novel strategy to remove GAN "fingerprints" from synthetic fake images based on autoencoders is described, in order to spoof facial manipulation detection systems while keeping the visual quality of the resulting images; ii) an in-depth analysis of the recent literature in facial manipulation detection; iii) a complete experimental assessment of this type of facial manipulation, considering the state-of-the-art fake detection systems (based on holistic deep networks, steganalysis, and local artifacts), remarking how challenging is this task in unconstrained scenarios; and finally iv) we announce a novel public database, named iFakeFaceDB, yielding from the application of our proposed GAN-fingerprint Removal approach (GANprintR) to already very realistic synthetic fake images. The results obtained in our empirical evaluation show that additional efforts are required to develop robust facial manipulation detection systems against unseen conditions and spoof techniques, such as the one proposed in this study.

CVJan 5, 2019
Forensic shoe-print identification: a brief survey

Imad Rida, Lunke Fei, Hugo Proença et al.

As an advanced research topic in forensics science, automatic shoe-print identification has been extensively studied in the last two decades, since shoe marks are the clues most frequently left in a crime scene. Hence, these impressions provide a pertinent evidence for the proper progress of investigations in order to identify the potential criminals. The main goal of this survey is to provide a cohesive overview of the research carried out in forensic shoe-print identification and its basic background. Apart defining the problem and describing the phases that typically compose the processing chain of shoe-print identification, we provide a summary/comparison of the state-of-the-art approaches, in order to guide the neophyte and help to advance the research topic. This is done through introducing simple and basic taxonomies as well as summaries of the state-of-the-art performance. Lastly, we discuss the current open problems and challenges in this research topic, point out for promising directions in this field.