Kevin Hernandez-Diaz

h-index41

18papers

173citations

Novelty31%

AI Score36

Ranked #121,270 of 201,326 authors (top 60%)#38,622 in CV (top 66%)

18 Papers

CVMar 11, 2022Code

LFW-Beautified: A Dataset of Face Images with Beautification and Augmented Reality Filters

Pontus Hedman, Vasilios Skepetzis, Kevin Hernandez-Diaz et al.

Selfie images enjoy huge popularity in social media. The same platforms centered around sharing this type of images offer filters to beautify them or incorporate augmented reality effects. Studies suggests that filtered images attract more views and engagement. Selfie images are also in increasing use in security applications due to mobiles becoming data hubs for many transactions. Also, video conference applications, boomed during the pandemic, include such filters. Such filters may destroy biometric features that would allow person recognition or even detection of the face itself, even if such commodity applications are not necessarily used to compromise facial systems. This could also affect subsequent investigations like crimes in social media, where automatic analysis is usually necessary given the amount of information posted in social sites or stored in devices or cloud repositories. To help in counteracting such issues, we contribute with a database of facial images that includes several manipulations. It includes image enhancement filters (which mostly modify contrast and lightning) and augmented reality filters that incorporate items like animal noses or glasses. Additionally, images with sunglasses are processed with a reconstruction network trained to learn to reverse such modifications. This is because obfuscating the eye region has been observed in the literature to have the highest impact on the accuracy of face detection or recognition. We start from the popular Labeled Faces in the Wild (LFW) database, to which we apply different modifications, generating 8 datasets. Each dataset contains 4,324 images of size 64 x 64, with a total of 34,592 images. The use of a public and widely employed face dataset allows for replication and comparison. The created database is available at https://github.com/HalmstadUniversityBiometrics/LFW-Beautified

CVDec 9, 2022

Visual Detection of Personal Protective Equipment and Safety Gear on Industry Workers

Jonathan Karlsson, Fredrik Strand, Josef Bigun et al.

Workplace injuries are common in today's society due to a lack of adequately worn safety equipment. A system that only admits appropriately equipped personnel can be created to improve working conditions. The goal is thus to develop a system that will improve workers' safety using a camera that will detect the usage of Personal Protective Equipment (PPE). To this end, we collected and labeled appropriate data from several public sources, which have been used to train and evaluate several models based on the popular YOLOv4 object detector. Our focus, driven by a collaborating industrial partner, is to implement our system into an entry control point where workers must present themselves to obtain access to a restricted area. Combined with facial identity recognition, the system would ensure that only authorized people wearing appropriate equipment are granted access. A novelty of this work is that we increase the number of classes to five objects (hardhat, safety vest, safety gloves, safety glasses, and hearing protection), whereas most existing works only focus on one or two classes, usually hardhats or vests. The AI model developed provides good detection accuracy at a distance of 3 and 5 meters in the collaborative environment where we aim at operating (mAP of 99/89%, respectively). The small size of some objects or the potential occlusion by body parts have been identified as potential factors that are detrimental to accuracy, which we have counteracted via data augmentation and cropping of the body before applying PPE detection.

CVJul 25, 2023

An Explainable Model-Agnostic Algorithm for CNN-based Biometrics Verification

Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Jose M. Buades et al.

This paper describes an adaptation of the Local Interpretable Model-Agnostic Explanations (LIME) AI method to operate under a biometric verification setting. LIME was initially proposed for networks with the same output classes used for training, and it employs the softmax probability to determine which regions of the image contribute the most to classification. However, in a verification setting, the classes to be recognized have not been seen during training. In addition, instead of using the softmax output, face descriptors are usually obtained from a layer before the classification layer. The model is adapted to achieve explainability via cosine similarity between feature vectors of perturbated versions of the input image. The method is showcased for face biometrics with two CNN models based on MobileNetv2 and ResNet50.

CVJul 11, 2023

One-Shot Learning for Periocular Recognition: Exploring the Effect of Domain Adaptation and Data Bias on Deep Representations

Kevin Hernandez-Diaz, Fernando Alonso-Fernandez, Josef Bigun

One weakness of machine-learning algorithms is the need to train the models for a new task. This presents a specific challenge for biometric recognition due to the dynamic nature of databases and, in some instances, the reliance on subject collaboration for data collection. In this paper, we investigate the behavior of deep representations in widely used CNN models under extreme data scarcity for One-Shot periocular recognition, a biometric recognition task. We analyze the outputs of CNN layers as identity-representing feature vectors. We examine the impact of Domain Adaptation on the network layers' output for unseen data and evaluate the method's robustness concerning data normalization and generalization of the best-performing layer. We improved state-of-the-art results that made use of networks trained with biometric datasets with millions of images and fine-tuned for the target periocular dataset by utilizing out-of-the-box CNNs trained for the ImageNet Recognition Challenge and standard computer vision algorithms. For example, for the Cross-Eyed dataset, we could reduce the EER by 67% and 79% (from 1.70% and 3.41% to 0.56% and 0.71%) in the Close-World and Open-World protocols, respectively, for the periocular case. We also demonstrate that traditional algorithms like SIFT can outperform CNNs in situations with limited data or scenarios where the network has not been trained with the test classes like the Open-World mode. SIFT alone was able to reduce the EER by 64% and 71.6% (from 1.7% and 3.41% to 0.6% and 0.97%) for Cross-Eyed in the Close-World and Open-World protocols, respectively, and a reduction of 4.6% (from 3.94% to 3.76%) in the PolyU database for the Open-World and single biometric case.

CVDec 9, 2022

Image-Based Fire Detection in Industrial Environments with YOLOv4

Otto Zell, Joel Pålsson, Kevin Hernandez-Diaz et al.

Fires have destructive power when they break out and affect their surroundings on a devastatingly large scale. The best way to minimize their damage is to detect the fire as quickly as possible before it has a chance to grow. Accordingly, this work looks into the potential of AI to detect and recognize fires and reduce detection time using object detection on an image stream. Object detection has made giant leaps in speed and accuracy over the last six years, making real-time detection feasible. To our end, we collected and labeled appropriate data from several public sources, which have been used to train and evaluate several models based on the popular YOLOv4 object detector. Our focus, driven by a collaborating industrial partner, is to implement our system in an industrial warehouse setting, which is characterized by high ceilings. A drawback of traditional smoke detectors in this setup is that the smoke has to rise to a sufficient height. The AI models brought forward in this research managed to outperform these detectors by a significant amount of time, providing precious anticipation that could help to minimize the effects of fires further.

CVJul 28, 2024

Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition

Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Prayag Tiwari et al.

We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition. These architectures have demonstrated significant success in various computer vision tasks beyond the ones for which they were designed. This work builds on our previous study using off-the-shelf Convolutional Neural Network (CNN) and extends it to include the more recently proposed Vision Transformers (ViT). Despite being trained for generic object classification, middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images. We also demonstrate that CNNs and ViTs are highly complementary since their combination results in boosted accuracy. In addition, we show that a small portion of these pre-trained models can achieve good accuracy, resulting in thinner models with fewer parameters, suitable for resource-limited environments such as mobiles. This efficiency improves if traditional handcrafted features are added as well.

CVDec 9, 2022

Synthetic Data for Object Classification in Industrial Applications

August Baaz, Yonan Yonan, Kevin Hernandez-Diaz et al.

One of the biggest challenges in machine learning is data collection. Training data is an important part since it determines how the model will behave. In object classification, capturing a large number of images per object and in different conditions is not always possible and can be very time-consuming and tedious. Accordingly, this work explores the creation of artificial images using a game engine to cope with limited data in the training dataset. We combine real and synthetic data to train the object classification engine, a strategy that has shown to be beneficial to increase confidence in the decisions made by the classifier, which is often critical in industrial setups. To combine real and synthetic data, we first train the classifier on a massive amount of synthetic data, and then we fine-tune it on real images. Another important result is that the amount of real images needed for fine-tuning is not very high, reaching top accuracy with just 12 or 24 images per class. This substantially reduces the requirements of capturing a great amount of real data.

CVJul 20, 2023

SqueezerFaceNet: Reducing a Small Face Recognition CNN Even More Via Filter Pruning

Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Jose Maria Buades Rubio et al.

The widespread use of mobile devices for various digital services has created a need for reliable and real-time person authentication. In this context, facial recognition technologies have emerged as a dependable method for verifying users due to the prevalence of cameras in mobile devices and their integration into everyday applications. The rapid advancement of deep Convolutional Neural Networks (CNNs) has led to numerous face verification architectures. However, these models are often large and impractical for mobile applications, reaching sizes of hundreds of megabytes with millions of parameters. We address this issue by developing SqueezerFaceNet, a light face recognition network which less than 1M parameters. This is achieved by applying a network pruning method based on Taylor scores, where filters with small importance scores are removed iteratively. Starting from an already small network (of 1.24M) based on SqueezeNet, we show that it can be further reduced (up to 40%) without an appreciable loss in performance. To the best of our knowledge, we are the first to evaluate network pruning methods for the task of face recognition.

CVOct 30, 2025

Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping

Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Jose Maria Buades Rubio et al.

We focus on ocular biometrics, specifically the periocular region (the area around the eye), which offers high discrimination and minimal acquisition constraints. We evaluate three Convolutional Neural Network architectures of varying depth and complexity to assess their effectiveness for periocular recognition. The networks are trained on 1,907,572 ocular crops extracted from the large-scale VGGFace2 database. This significantly contrasts with existing works, which typically rely on small-scale periocular datasets for training having only a few thousand images. Experiments are conducted with ocular images from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images, and the UFPR-Periocular database, which consists of selfies captured via mobile devices with user guidance on the screen. Due to the uncontrolled conditions of VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from 9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In contrast, UFPR-Periocular yields significantly better performance (EERs of 1-2%), thanks to higher image quality and more consistent acquisition protocols. To the best of our knowledge, these are the lowest reported EERs on the UFPR dataset to date.

CVJul 14, 2025

FGSSNet: Feature-Guided Semantic Segmentation of Real World Floorplans

Hugo Norrby, Gabriel Färm, Kevin Hernandez-Diaz et al.

We introduce FGSSNet, a novel multi-headed feature-guided semantic segmentation (FGSS) architecture designed to improve the generalization ability of wall segmentation on floorplans. FGSSNet features a U-Net segmentation backbone with a multi-headed dedicated feature extractor used to extract domain-specific feature maps which are injected into the latent space of U-Net to guide the segmentation process. This dedicated feature extractor is trained as an encoder-decoder with selected wall patches, representative of the walls present in the input floorplan, to produce a compressed latent representation of wall patches while jointly trained to predict the wall width. In doing so, we expect that the feature extractor encodes texture and width features of wall patches that are useful to guide the wall segmentation process. Our experiments show increased performance by the use of such injected features in comparison to the vanilla U-Net, highlighting the validity of the proposed approach.

CVApr 24, 2024

Understanding and Improving CNNs with Complex Structure Tensor: A Biometrics Study

Kevin Hernandez-Diaz, Josef Bigun, Fernando Alonso-Fernandez

Our study provides evidence that CNNs struggle to effectively extract orientation features. We show that the use of Complex Structure Tensor, which contains compact orientation features with certainties, as input to CNNs consistently improves identification accuracy compared to using grayscale inputs alone. Experiments also demonstrated that our inputs, which were provided by mini complex conv-nets, combined with reduced CNN sizes, outperformed full-fledged, prevailing CNN architectures. This suggests that the upfront use of orientation features in CNNs, a strategy seen in mammalian vision, not only mitigates their limitations but also enhances their explainability and relevance to thin-clients. Experiments were done on publicly available data sets comprising periocular images for biometric identification and verification (Close and Open World) using 6 State of the Art CNN architectures. We reduced SOA Equal Error Rate (EER) on the PolyU dataset by 5-26% depending on data and scenario.

CVOct 17, 2021

On the Effect of Selfie Beautification Filters on Face Detection and Recognition

Pontus Hedman, Vasilios Skepetzis, Kevin Hernandez-Diaz et al.

Beautification and augmented reality filters are very popular in applications that use selfie images captured with smartphones or personal devices. However, they can distort or modify biometric features, severely affecting the capability of recognizing individuals' identity or even detecting the face. Accordingly, we address the effect of such filters on the accuracy of automated face detection and recognition. The social media image filters studied either modify the image contrast or illumination or occlude parts of the face with for example artificial glasses or animal noses. We observe that the effect of some of these filters is harmful both to face detection and identity recognition, specially if they obfuscate the eye or (to a lesser extent) the nose. To counteract such effect, we develop a method to reconstruct the applied manipulation with a modified version of the U-NET segmentation network. This is observed to contribute to a better face detection and recognition accuracy. From a recognition perspective, we employ distance measures and trained machine learning algorithms applied to features extracted using a ResNet-34 network trained to recognize faces. We also evaluate if incorporating filtered images to the training set of machine learning approaches are beneficial for identity recognition. Our results show good recognition when filters do not occlude important landmarks, specially the eyes (identification accuracy >99%, EER<2%). The combined effect of the proposed approaches also allow to mitigate the effect produced by filters that occlude parts of the face, achieving an identification accuracy of >92% with the majority of perturbations evaluated, and an EER <8%. Although there is room for improvement, when neither U-NET reconstruction nor training with filtered images is applied, the accuracy with filters that severely occlude the eye is <72% (identification) and >12% (EER)

HCJul 16, 2021

In-Bed Person Monitoring Using Thermal Infrared Sensors

Elias Josse, Amanda Nerborg, Kevin Hernandez-Diaz et al.

The world is expecting an aging population and shortage of healthcare professionals. This poses the problem of providing a safe and dignified life for the elderly. Technological solutions involving cameras can contribute to safety, comfort and efficient emergency responses, but they are invasive of privacy. We use 'Griddy', a prototype with a Panasonic Grid-EYE, a low-resolution infrared thermopile array sensor, which offers more privacy. Mounted over a bed, it can determine if the user is on the bed or not without human interaction. For this purpose, two datasets were captured, one (480 images) under constant conditions, and a second one (200 images) under different variations such as use of a duvet, sleeping with a pet, or increased room temperature. We test three machine learning algorithms: Support Vector Machines (SVM), k-Nearest Neighbors (k-NN) and Neural Network (NN). With 10-fold cross validation, the highest accuracy in the main dataset is for both SVM and k-NN (99%). The results with variable data show a lower reliability under certain circumstances, highlighting the need of extra work to meet the challenge of variations in the environment.

CVAug 26, 2020

Cross-Spectral Periocular Recognition with Conditional Adversarial Networks

Kevin Hernandez-Diaz, Fernando Alonso-Fernandez, Josef Bigun

This work addresses the challenge of comparing periocular images captured in different spectra, which is known to produce significant drops in performance in comparison to operating in the same spectrum. We propose the use of Conditional Generative Adversarial Networks, trained to con-vert periocular images between visible and near-infrared spectra, so that biometric verification is carried out in the same spectrum. The proposed setup allows the use of existing feature methods typically optimized to operate in a single spectrum. Recognition experiments are done using a number of off-the-shelf periocular comparators based both on hand-crafted features and CNN descriptors. Using the Hong Kong Polytechnic University Cross-Spectral Iris Images Database (PolyU) as benchmark dataset, our experiments show that cross-spectral performance is substantially improved if both images are converted to the same spectrum, in comparison to matching features extracted from images in different spectra. In addition to this, we fine-tune a CNN based on the ResNet50 architecture, obtaining a cross-spectral periocular performance of EER=1%, and GAR>99% @ FAR=1%, which is comparable to the state-of-the-art with the PolyU database.

CLJul 31, 2020

Writer Identification Using Microblogging Texts for Social Media Forensics

Fernando Alonso-Fernandez, Nicole Mariah Sharon Belvisi, Kevin Hernandez-Diaz et al.

Establishing authorship of online texts is fundamental to combat cybercrimes. Unfortunately, text length is limited on some platforms, making the challenge harder. We aim at identifying the authorship of Twitter messages limited to 140 characters. We evaluate popular stylometric features, widely used in literary analysis, and specific Twitter features like URLs, hashtags, replies or quotes. We use two databases with 93 and 3957 authors, respectively. We test varying sized author sets and varying amounts of training/test texts per author. Performance is further improved by feature combination via automatic selection. With a large number of training Tweets (>500), a good accuracy (Rank-5>80%) is achievable with only a few dozens of test Tweets, even with several thousands of authors. With smaller sample sizes (10-20 training Tweets), the search space can be diminished by 9-15% while keeping a high chance that the correct author is retrieved among the candidates. In such cases, automatic attribution can provide significant time savings to experts in suspect search. For completeness, we report verification results. With few training/test Tweets, the EER is above 20-25%, which is reduced to < 15% if hundreds of training Tweets are available. We also quantify the computational complexity and time permanence of the employed features.

CVJul 16, 2020

SqueezeFacePoseNet: Lightweight Face Verification Across Different Poses for Mobile Platforms

Fernando Alonso-Fernandez, Javier Barrachina, Kevin Hernandez-Diaz et al.

Virtual applications through mobile platforms are one of the most critical and ever-growing fields in AI, where ubiquitous and real-time person authentication has become critical after the breakthrough of all services provided via mobile devices. In this context, face verification technologies can provide reliable and robust user authentication, given the availability of cameras in these devices, as well as their widespread use in everyday applications. The rapid development of deep Convolutional Neural Networks has resulted in many accurate face verification architectures. However, their typical size (hundreds of megabytes) makes them infeasible to be incorporated in downloadable mobile applications where the entire file typically may not exceed 100 Mb. Accordingly, we address the challenge of developing a lightweight face recognition network of just a few megabytes that can operate with sufficient accuracy in comparison to much larger models. The network also should be able to operate under different poses, given the variability naturally observed in uncontrolled environments where mobile devices are typically used. In this paper, we adapt the lightweight SqueezeNet model, of just 4.4MB, to effectively provide cross-pose face recognition. After trained on the MS-Celeb-1M and VGGFace2 databases, our model achieves an EER of 1.23% on the difficult frontal vs. profile comparison, and0.54% on profile vs. profile images. Under less extreme variations involving frontal images in any of the enrolment/query images pair, EER is pushed down to<0.3%, and the FRR at FAR=0.1%to less than 1%. This makes our light model suitable for face recognition where at least acquisition of the enrolment image can be controlled. At the cost of a slight degradation in performance, we also test an even lighter model (of just 2.5MB) where regular convolutions are replaced with depth-wise separable convolutions.

CVApr 23, 2020

Cloud-Based Face and Speech Recognition for Access Control Applications

Nathalie Tkauc, Thao Tran, Kevin Hernandez-Diaz et al.

This paper describes the implementation of a system to recognize employees and visitors wanting to gain access to a physical office through face images and speech-to-text recognition. The system helps employees to unlock the entrance door via face recognition without the need of tag-keys or cards. To prevent spoofing attacks and increase security, a randomly generated code is sent to the employee, who then has to type it into the screen. On the other hand, visitors and delivery persons are provided with a speech-to-text service where they utter the name of the employee that they want to meet, and the system then sends a notification to the right employee automatically. The hardware of the system is constituted by two Raspberry Pi, a 7-inch LCD-touch display, a camera, and a sound card with a microphone and speaker. To carry out face recognition and speech-to-text conversion, the cloud-based platforms Amazon Web Services and the Google Speech-to-Text API service are used respectively. The two-step face authentication mechanism for employees provides an increased level of security and protection against spoofing attacks without the need of carrying key-tags or access cards, while disturbances by visitors or couriers are minimized by notifying their arrival to the right employee, without disturbing other co-workers by means of ring-bells.

CVSep 17, 2018

Periocular Recognition Using CNN Features Off-the-Shelf

Kevin Hernandez-Diaz, Fernando Alonso-Fernandez, Josef Bigun

Periocular refers to the region around the eye, including sclera, eyelids, lashes, brows and skin. With a surprisingly high discrimination ability, it is the ocular modality requiring the least constrained acquisition. Here, we apply existing pre-trained architectures, proposed in the context of the ImageNet Large Scale Visual Recognition Challenge, to the task of periocular recognition. These have proven to be very successful for many other computer vision tasks apart from the detection and classification tasks for which they were designed. Experiments are done with a database of periocular images captured with a digital camera. We demonstrate that these off-the-shelf CNN features can effectively recognize individuals based on periocular images, despite being trained to classify generic objects. Compared against reference periocular features, they show an EER reduction of up to ~40%, with the fusion of CNN and traditional features providing additional improvements.