Danilo Avola

h-index25

14papers

220citations

Novelty38%

AI Score42

Ranked #58,054 of 194,257 authors (top 30%)#20,076 in CV (top 34%)

14 Papers

3.3LGMar 18, 2022

Analyzing EEG Data with Machine and Deep Learning: A Benchmark

Danilo Avola, Marco Cascio, Luigi Cinque et al.

Nowadays, machine and deep learning techniques are widely used in different areas, ranging from economics to biology. In general, these techniques can be used in two ways: trying to adapt well-known models and architectures to the available data, or designing custom architectures. In both cases, to speed up the research process, it is useful to know which type of models work best for a specific problem and/or data type. By focusing on EEG signal analysis, and for the first time in literature, in this paper a benchmark of machine and deep learning for EEG signal classification is proposed. For our experiments we used the four most widespread models, i.e., multilayer perceptron, convolutional neural network, long short-term memory, and gated recurrent unit, highlighting which one can be a good starting point for developing EEG classification models.

11.3CVJul 2, 2024Code

Semantically Guided Representation Learning For Action Anticipation

Anxhelo Diko, Danilo Avola, Bardh Prenkaj et al.

Action anticipation is the task of forecasting future activity from a partially observed sequence of events. However, this task is exposed to intrinsic future uncertainty and the difficulty of reasoning upon interconnected actions. Unlike previous works that focus on extrapolating better visual and temporal information, we concentrate on learning action representations that are aware of their semantic interconnectivity based on prototypical action patterns and contextual co-occurrences. To this end, we propose the novel Semantically Guided Representation Learning (S-GEAR) framework. S-GEAR learns visual action prototypes and leverages language models to structure their relationship, inducing semanticity. To gather insights on S-GEAR's effectiveness, we test it on four action anticipation benchmarks, obtaining improved results compared to previous works: +3.5, +2.7, and +3.5 absolute points on Top-1 Accuracy on Epic-Kitchen 55, EGTEA Gaze+ and 50 Salads, respectively, and +0.8 on Top-5 Recall on Epic-Kitchens 100. We further observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes. Finally, S-GEAR opens new research frontiers in anticipation tasks by demonstrating the intricate impact of action semantic interconnectivity.

8.7CVFeb 17, 2024Code

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

Anxhelo Diko, Danilo Avola, Marco Cascio et al.

Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify elements within an image and increase the accuracy and robustness of vision-based recognition systems. Following this rationale, we propose a novel residual attention learning method for improving ViT-based architectures, increasing their visual feature diversity and model robustness. In this way, the proposed network can capture and preserve significant low-level features, providing more details about the elements within the scene being analyzed. The effectiveness and robustness of the presented method are evaluated on five image classification benchmarks, including ImageNet1k, CIFAR10, CIFAR100, Oxford Flowers-102, and Oxford-IIIT Pet, achieving improved performances. Additionally, experiments on the COCO2017 dataset show that the devised approach discovers and incorporates semantic and spatial relationships for object detection and instance segmentation when implemented into spatial-aware transformer models.

7.1LGJul 17, 2025

Transformer-Based Person Identification via Wi-Fi CSI Amplitude and Phase Perturbations

Danilo Avola, Andrea Bernardini, Francesco Danese et al.

Wi-Fi sensing is gaining momentum as a non-intrusive and privacy-preserving alternative to vision-based systems for human identification. However, person identification through wireless signals, particularly without user motion, remains largely unexplored. Most prior wireless-based approaches rely on movement patterns, such as walking gait, to extract biometric cues. In contrast, we propose a transformer-based method that identifies individuals from Channel State Information (CSI) recorded while the subject remains stationary. CSI captures fine-grained amplitude and phase distortions induced by the unique interaction between the human body and the radio signal. To support evaluation, we introduce a dataset acquired with ESP32 devices in a controlled indoor environment, featuring six participants observed across multiple orientations. A tailored preprocessing pipeline, including outlier removal, smoothing, and phase calibration, enhances signal quality. Our dual-branch transformer architecture processes amplitude and phase modalities separately and achieves 99.82\% classification accuracy, outperforming convolutional and multilayer perceptron baselines. These results demonstrate the discriminative potential of CSI perturbations, highlighting their capacity to encode biometric traits in a consistent manner. They further confirm the viability of passive, device-free person identification using low-cost commodity Wi-Fi hardware in real-world settings.

7.1LGApr 29, 2025

Digital Shielding for Cross-Domain Wi-Fi Signal Adaptation using Relativistic Average Generative Adversarial Network

Danilo Avola, Federica Bruni, Gian Luca Foresti et al.

Wi-Fi sensing uses radio-frequency signals from Wi-Fi devices to analyze environments, enabling tasks such as tracking people, detecting intrusions, and recognizing gestures. The rise of this technology is driven by the IEEE 802.11bf standard and growing demand for tools that can ensure privacy and operate through obstacles. However, the performance of Wi-Fi sensing is heavily influenced by environmental conditions, especially when extracting spatial and temporal features from the surrounding scene. A key challenge is achieving robust generalization across domains, ensuring stable performance even when the sensing environment changes significantly. This paper introduces a novel deep learning model for cross-domain adaptation of Wi-Fi signals, inspired by physical signal shielding. The model uses a Relativistic average Generative Adversarial Network (RaGAN) with Bidirectional Long Short-Term Memory (Bi-LSTM) architectures for both the generator and discriminator. To simulate physical shielding, an acrylic box lined with electromagnetic shielding fabric was constructed, mimicking a Faraday cage. Wi-Fi signal spectra were collected from various materials both inside (domain-free) and outside (domain-dependent) the box to train the model. A multi-class Support Vector Machine (SVM) was trained on domain-free spectra and tested on signals denoised by the RaGAN. The system achieved 96% accuracy and demonstrated strong material discrimination capabilities, offering potential for use in security applications to identify concealed objects based on their composition.

3.6CVJul 18, 2025

Augmented Reality in Cultural Heritage: A Dual-Model Pipeline for 3D Artwork Reconstruction

Daniele Pannone, Alessia Castronovo, Maurizio Mancini et al.

This paper presents an innovative augmented reality pipeline tailored for museum environments, aimed at recognizing artworks and generating accurate 3D models from single images. By integrating two complementary pre-trained depth estimation models, i.e., GLPN for capturing global scene structure and Depth-Anything for detailed local reconstruction, the proposed approach produces optimized depth maps that effectively represent complex artistic features. These maps are then converted into high-quality point clouds and meshes, enabling the creation of immersive AR experiences. The methodology leverages state-of-the-art neural network architectures and advanced computer vision techniques to overcome challenges posed by irregular contours and variable textures in artworks. Experimental results demonstrate significant improvements in reconstruction accuracy and visual realism, making the system a highly robust tool for museums seeking to enhance visitor engagement through interactive digital content.

4.1LGJul 18, 2025

Bi-GRU Based Deception Detection using EEG Signals

Danilo Avola, Muhammad Yasir Bilal, Emad Emam et al.

Deception detection is a significant challenge in fields such as security, psychology, and forensics. This study presents a deep learning approach for classifying deceptive and truthful behavior using ElectroEncephaloGram (EEG) signals from the Bag-of-Lies dataset, a multimodal corpus designed for naturalistic, casual deception scenarios. A Bidirectional Gated Recurrent Unit (Bi-GRU) neural network was trained to perform binary classification of EEG samples. The model achieved a test accuracy of 97\%, along with high precision, recall, and F1-scores across both classes. These results demonstrate the effectiveness of using bidirectional temporal modeling for EEG-based deception detection and suggest potential for real-time applications and future exploration of advanced neural architectures.

3.6CVJul 17, 2025

WhoFi: Deep Person Re-Identification via Wi-Fi Channel Signal Encoding

Danilo Avola, Emad Emam, Dario Montagnini et al.

Person Re-Identification is a key and challenging task in video surveillance. While traditional methods rely on visual data, issues like poor lighting, occlusion, and suboptimal angles often hinder performance. To address these challenges, we introduce WhoFi, a novel pipeline that utilizes Wi-Fi signals for person re-identification. Biometric features are extracted from Channel State Information (CSI) and processed through a modular Deep Neural Network (DNN) featuring a Transformer-based encoder. The network is trained using an in-batch negative loss function to learn robust and generalizable biometric signatures. Experiments on the NTU-Fi dataset show that our approach achieves competitive results compared to state-of-the-art methods, confirming its effectiveness in identifying individuals via Wi-Fi signals.

3.6CVApr 29, 2025

Autoencoder Models for Point Cloud Environmental Synthesis from WiFi Channel State Information: A Preliminary Study

Daniele Pannone, Danilo Avola

This paper introduces a deep learning framework for generating point clouds from WiFi Channel State Information data. We employ a two-stage autoencoder approach: a PointNet autoencoder with convolutional layers for point cloud generation, and a Convolutional Neural Network autoencoder to map CSI data to a matching latent space. By aligning these latent spaces, our method enables accurate environmental point cloud reconstruction from WiFi data. Experimental results validate the effectiveness of our approach, highlighting its potential for wireless sensing and environmental mapping applications.

2.4IVOct 6, 2021Code

Study on Transfer Learning Capabilities for Pneumonia Classification in Chest-X-Rays Image

Danilo Avola, Andrea Bacciu, Luigi Cinque et al.

Over the last year, the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and its variants have highlighted the importance of screening tools with high diagnostic accuracy for new illnesses such as COVID-19. To that regard, deep learning approaches have proven as effective solutions for pneumonia classification, especially when considering chest-x-rays images. However, this lung infection can also be caused by other viral, bacterial or fungi pathogens. Consequently, efforts are being poured toward distinguishing the infection source to help clinicians to diagnose the correct disease origin. Following this tendency, this study further explores the effectiveness of established neural network architectures on the pneumonia classification task through the transfer learning paradigm. To present a comprehensive comparison, 12 well-known ImageNet pre-trained models were fine-tuned and used to discriminate among chest-x-rays of healthy people, and those showing pneumonia symptoms derived from either a viral (i.e., generic or SARS-CoV-2) or bacterial source. Furthermore, since a common public collection distinguishing between such categories is currently not available, two distinct datasets of chest-x-rays images, describing the aforementioned sources, were combined and employed to evaluate the various architectures. The experiments were performed using a total of 6330 images split between train, validation and test sets. For all models, common classification metrics were computed (e.g., precision, f1-score) and most architectures obtained significant performances, reaching, among the others, up to 84.46% average f1-score when discriminating the 4 identified classes. Moreover, confusion matrices and activation maps computed via the Grad-CAM algorithm were also reported to present an informed discussion on the networks classifications.

1.4CVOct 6, 2021

SIRe-Networks: Convolutional Neural Networks Architectural Extension for Information Preservation via Skip/Residual Connections and Interlaced Auto-Encoders

Danilo Avola, Luigi Cinque, Alessio Fagioli et al.

Improving existing neural network architectures can involve several design choices such as manipulating the loss functions, employing a diverse learning strategy, exploiting gradient evolution at training time, optimizing the network hyper-parameters, or increasing the architecture depth. The latter approach is a straightforward solution, since it directly enhances the representation capabilities of a network; however, the increased depth generally incurs in the well-known vanishing gradient problem. In this paper, borrowing from different methods addressing this issue, we introduce an interlaced multi-task learning strategy, defined SIRe, to reduce the vanishing gradient in relation to the object classification task. The presented methodology directly improves a convolutional neural network (CNN) by preserving information from the input image through interlaced auto-encoders (AEs), and further refines the base network architecture by means of skip and residual connections. To validate the presented methodology, a simple CNN and various implementations of famous networks are extended via the SIRe strategy and extensively tested on five collections, i.e., MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and Caltech-256; where the SIRe-extended architectures achieve significantly increased performances across all models and datasets, thus confirming the presented approach effectiveness.

4.7CVSep 28, 2021

3D Hand Pose and Shape Estimation from RGB Images for Keypoint-Based Hand Gesture Recognition

Danilo Avola, Luigi Cinque, Alessio Fagioli et al.

Estimating the 3D pose of a hand from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand gesture recognition. Currently, reasonable estimations can be computed from single RGB images, especially when a multi-task learning approach is used to force the system to consider the shape of the hand when its pose is determined. However, depending on the method used to represent the hand, the performance can drop considerably in real-life tasks, suggesting that stable descriptions are required to achieve satisfactory results. In this paper, we present a keypoint-based end-to-end framework for 3D hand and pose estimation and successfully apply it to the task of hand gesture recognition as a study case. Specifically, after a pre-processing step in which the images are normalized, the proposed pipeline uses a multi-task semantic feature extractor generating 2D heatmaps and hand silhouettes from RGB images, a viewpoint encoder to predict the hand and camera view parameters, a stable hand estimator to produce the 3D hand pose and shape, and a loss function to guide all of the components jointly during the learning phase. Tests were performed on a 3D pose and shape estimation benchmark dataset to assess the proposed framework, which obtained state-of-the-art performance. Our system was also evaluated on two hand-gesture recognition benchmark datasets and significantly outperformed other keypoint-based approaches, indicating that it is an effective solution that is able to generate stable 3D estimates for hand pose and shape.

11.4IVMay 28, 2020

Multimodal Feature Fusion and Knowledge-Driven Learning via Experts Consult for Thyroid Nodule Classification

Danilo Avola, Luigi Cinque, Alessio Fagioli et al.

Computer-aided diagnosis (CAD) is becoming a prominent approach to assist clinicians spanning across multiple fields. These automated systems take advantage of various computer vision (CV) procedures, as well as artificial intelligence (AI) techniques, to formulate a diagnosis of a given image, e.g., computed tomography and ultrasound. Advances in both areas (CV and AI) are enabling ever increasing performances of CAD systems, which can ultimately avoid performing invasive procedures such as fine-needle aspiration. In this study, a novel end-to-end knowledge-driven classification framework is presented. The system focuses on multimodal data generated by thyroid ultrasonography, and acts as a CAD system by providing a thyroid nodule classification into the benign and malignant categories. Specifically, the proposed system leverages cues provided by an ensemble of experts to guide the learning phase of a densely connected convolutional network (DenseNet). The ensemble is composed by various networks pretrained on ImageNet, including AlexNet, ResNet, VGG, and others. The previously computed multimodal feature parameters are used to create ultrasonography domain experts via transfer learning, decreasing, moreover, the number of samples required for training. To validate the proposed method, extensive experiments were performed, providing detailed performances for both the experts ensemble and the knowledge-driven DenseNet. As demonstrated by the results, the proposed system achieves relevant performances in terms of qualitative metrics for the thyroid nodule classification task, thus resulting in a great asset when formulating a diagnosis.

1.7CVApr 5, 2017

The UMCD Dataset

Danilo Avola, Gian Luca Foresti, Niki Martinel et al.

In recent years, the technological improvements of low-cost small-scale Unmanned Aerial Vehicles (UAVs) are promoting an ever-increasing use of them in different tasks. In particular, the use of small-scale UAVs is useful in all these low-altitude tasks in which common UAVs cannot be adopted, such as recurrent comprehensive view of wide environments, frequent monitoring of military areas, real-time classification of static and moving entities (e.g., people, cars, etc.). These tasks can be supported by mosaicking and change detection algorithms achieved at low-altitude. Currently, public datasets for testing these algorithms are not available. This paper presents the UMCD dataset, the first collection of geo-referenced video sequences acquired at low-altitude for mosaicking and change detection purposes. Five reference scenarios are also reported.