Wassim Hamidouche

CV
h-index49
43papers
804citations
Novelty40%
AI Score56

43 Papers

IVJul 5, 2022Code
Transformer based Models for Unsupervised Anomaly Segmentation in Brain MR Images

Ahmed Ghorbel, Ahmed Aldahdooh, Shadi Albarqouni et al. · eth-zurich

The quality of patient care associated with diagnostic radiology is proportionate to a physician workload. Segmentation is a fundamental limiting precursor to both diagnostic and therapeutic procedures. Advances in machine learning (ML) aim to increase diagnostic efficiency by replacing a single application with generalized algorithms. The goal of unsupervised anomaly detection (UAD) is to identify potential anomalous regions unseen during training, where convolutional neural network (CNN) based autoencoders (AEs) and variational autoencoders (VAEs) are considered a de facto approach for reconstruction based-anomaly segmentation. The restricted receptive field in CNNs limits the CNN to model the global context. Hence, if the anomalous regions cover large parts of the image, the CNN-based AEs are not capable of bringing a semantic understanding of the image. Meanwhile, vision transformers (ViTs) have emerged as a competitive alternative to CNNs. It relies on the self-attention mechanism that can relate image patches to each other. We investigate in this paper Transformer capabilities in building AEs for the reconstruction-based UAD task to reconstruct a coherent and more realistic image. We focus on anomaly segmentation for brain magnetic resonance imaging (MRI) and present five Transformer-based models while enabling segmentation performance comparable to or superior to state-of-the-art (SOTA) models. The source code is made publicly available on GitHub: https://github.com/ahmedgh970/Transformers_Unsupervised_Anomaly_Segmentation.git.

92.6CLMay 4Code
AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Hao Yu, Tianyi Xu, Michael A. Hedderich et al.

Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on [Huggingface](https://huggingface.co/collections/McGill-NLP/afriquellm).

CVJul 19, 2023
NTIRE 2023 Quality Assessment of Video Enhancement Challenge

Xiaohong Liu, Xiongkuo Min, Wei Sun et al. · eth-zurich

This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance.

CVAug 22, 2024Code
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

We introduce FIDAVL: Fake Image Detection and Attribution using a Vision-Language Model. FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. Leveraging the benefits of zero-shot learning, FIDAVL exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images and accurately attribute them to their originating source models. We conducted extensive experiments on a comprehensive dataset comprising synthetic images generated by various state-of-the-art models. Our results demonstrate that FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47% while also obtaining noteworthy performance metrics, with an average F1-score of 92.64% and ROUGE-L score of 96.50% for attributing synthetic images to their respective source generation models. The source code of this work will be publicly released at https://github.com/Mamadou-Keita/FIDAVL.

CLJan 15Code
BYOL: Bring Your Own Language Into LLMs

Syed Waqas Zamir, Wassim Hamidouche, Boulbaba Ben Amor et al.

Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .

69.5CLMay 31
TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

Victor Akinode, Senyu Li, Wassim Hamidouche et al.

Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings: human translation of JBB prompts, English adaptation to African contexts followed by human translation, human-curated prompts validated through interactions with GPT-5.2, and code-switched prompts combining English and African languages, isolating the effect of language, cultural grounding, and prompt evasiveness on model safety. Across closed and open models, prompting in African languages reduces refusal relative to English, with culturally adapted prompts leading to least refusal. The evaluation also surfaces two structural limitations: model comprehension failures and reduced LLM-as-a-judge reliability in LRLs. To capture the first, we introduce Deflection alongside Refused and Jailbroken; to assess the second, we validate outputs with human annotations, showing that judge-human agreement drops in lower-resource languages and less commonly supported scripts.

CVOct 1, 2022
Evaluation of Pre-Trained CNN Models for Geographic Fake Image Detection

Sid Ahmed Fezza, Mohammed Yasser Ouis, Bachir Kaddar et al.

Thanks to the remarkable advances in generative adversarial networks (GANs), it is becoming increasingly easy to generate/manipulate images. The existing works have mainly focused on deepfake in face images and videos. However, we are currently witnessing the emergence of fake satellite images, which can be misleading or even threatening to national security. Consequently, there is an urgent need to develop detection methods capable of distinguishing between real and fake satellite images. To advance the field, in this paper, we explore the suitability of several convolutional neural network (CNN) architectures for fake satellite image detection. Specifically, we benchmark four CNN models by conducting extensive experiments to evaluate their performance and robustness against various image distortions. This work allows the establishment of new baselines and may be useful for the development of CNN-based methods for fake satellite image detection.

CVJul 12, 2023
AICT: An Adaptive Image Compression Transformer

Ahmed Ghorbel, Wassim Hamidouche, Luce Morin

Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Current methods that still rely on ConvNet-based entropy coding are limited in long-range modeling dependencies due to their local connectivity and an increasing number of architectural biases and priors. On the contrary, the proposed ICT can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed adaptive image compression transformer (AICT) framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM.

CVJul 12, 2023
ConvNeXt-ChARM: ConvNeXt-based Transform for Efficient Neural Image Compression

Ahmed Ghorbel, Wassim Hamidouche, Luce Morin

Over the last few years, neural image compression has gained wide attention from research and industry, yielding promising end-to-end deep neural codecs outperforming their conventional counterparts in rate-distortion performance. Despite significant advancement, current methods, including attention-based transform coding, still need to be improved in reducing the coding rate while preserving the reconstruction fidelity, especially in non-homogeneous textured image areas. Those models also require more parameters and a higher decoding time. To tackle the above challenges, we propose ConvNeXt-ChARM, an efficient ConvNeXt-based transform coding framework, paired with a compute-efficient channel-wise auto-regressive prior to capturing both global and local contexts from the hyper and quantized latent representations. The proposed architecture can be optimized end-to-end to fully exploit the context information and extract compact latent representation while reconstructing higher-quality images. Experimental results on four widely-used datasets showed that ConvNeXt-ChARM brings consistent and significant BD-rate (PSNR) reductions estimated on average to 5.24% and 1.22% over the versatile video coding (VVC) reference encoder (VTM-18.0) and the state-of-the-art learned image compression method SwinT-ChARM, respectively. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the next generation ConvNet, namely ConvNeXt, and Swin Transformer.

LGJun 5, 2022
Federated Adversarial Training with Transformers

Ahmed Aldahdooh, Wassim Hamidouche, Olivier Déforges

Federated learning (FL) has emerged to enable global model training over distributed clients' data while preserving its privacy. However, the global trained model is vulnerable to the evasion attacks especially, the adversarial examples (AEs), carefully crafted samples to yield false classification. Adversarial training (AT) is found to be the most promising approach against evasion attacks and it is widely studied for convolutional neural network (CNN). Recently, vision transformers have been found to be effective in many computer vision tasks. To the best of the authors' knowledge, there is no work that studied the feasibility of AT in a FL process for vision transformers. This paper investigates such feasibility with different federated model aggregation methods and different vision transformer models with different tokenization and classification head techniques. In order to improve the robust accuracy of the models with the not independent and identically distributed (Non-IID), we propose an extension to FedAvg aggregation method, called FedWAvg. By measuring the similarities between the last layer of the global model and the last layer of the client updates, FedWAvg calculates the weights to aggregate the local models updates. The experiments show that FedWAvg improves the robust accuracy when compared with other state-of-the-art aggregation methods.

CVJul 5, 2023
Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient Neural Image Compression

Ahmed Ghorbel, Wassim Hamidouche, Luce Morin

Recently, the performance of neural image compression (NIC) has steadily improved thanks to the last line of study, reaching or outperforming state-of-the-art conventional codecs. Despite significant progress, current NIC methods still rely on ConvNet-based entropy coding, limited in modeling long-range dependencies due to their local connectivity and the increasing number of architectural biases and priors, resulting in complex underperforming models with high decoding latency. Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Through the proposed ICT, we can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre-/post-processor to accurately extract more compact latent codes while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the adaptive image compression transformer (AICT) and the neural codec SwinT-ChARM.

CLNov 4, 2025
AI Diffusion in Low Resource Language Countries

Amit Misra, Syed Waqas Zamir, Wassim Hamidouche et al.

Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.

CVApr 2, 2024Code
Bi-LORA: A Vision-Language Approach for Synthetic Image Detection

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

Advancements in deep image synthesis techniques, such as generative adversarial networks (GANs) and diffusion models (DMs), have ushered in an era of generating highly realistic images. While this technological progress has captured significant interest, it has also raised concerns about the potential difficulty in distinguishing real images from their synthetic counterparts. This paper takes inspiration from the potent convergence capabilities between vision and language, coupled with the zero-shot nature of vision-language models (VLMs). We introduce an innovative method called Bi-LORA that leverages VLMs, combined with low-rank adaptation (LORA) tuning techniques, to enhance the precision of synthetic image detection for unseen model-generated images. The pivotal conceptual shift in our methodology revolves around reframing binary classification as an image captioning task, leveraging the distinctive capabilities of cutting-edge VLM, notably bootstrapping language image pre-training (BLIP2). Rigorous and comprehensive experiments are conducted to validate the effectiveness of our proposed approach, particularly in detecting unseen diffusion-generated images from unknown diffusion-based generative models during training, showcasing robustness to noise, and demonstrating generalization capabilities to GANs. The obtained results showcase an impressive average accuracy of 93.41% in synthetic image detection on unseen generation models. The code and models associated with this research can be publicly accessed at https://github.com/Mamadou-Keita/VLM-DETECT.

CVApr 3, 2024Code
Harnessing the Power of Large Vision Language Models for Synthetic Image Detection

Mamadou Keita, Wassim Hamidouche, Hassen Bougueffa et al.

In recent years, the emergence of models capable of generating images from text has attracted considerable interest, offering the possibility of creating realistic images from text descriptions. Yet these advances have also raised concerns about the potential misuse of these images, including the creation of misleading content such as fake news and propaganda. This study investigates the effectiveness of using advanced vision-language models (VLMs) for synthetic image identification. Specifically, the focus is on tuning state-of-the-art image captioning models for synthetic image detection. By harnessing the robust understanding capabilities of large VLMs, the aim is to distinguish authentic images from synthetic images produced by diffusion-based models. This study contributes to the advancement of synthetic image detection by exploiting the capabilities of visual language models such as BLIP-2 and ViTGPT2. By tailoring image captioning models, we address the challenges associated with the potential misuse of synthetic images in real-world applications. Results described in this paper highlight the promising role of VLMs in the field of synthetic image detection, outperforming conventional image-based detection techniques. Code and models can be found at https://github.com/Mamadou-Keita/VLM-DETECT.

CVAug 11, 2025Code
Designing Object Detection Models for TinyML: Foundations, Comparative Analysis, Challenges, and Emerging Solutions

Christophe EL Zeinaty, Wassim Hamidouche, Glenn Herrou et al.

Object detection (OD) has become vital for numerous computer vision applications, but deploying it on resource-constrained IoT devices presents a significant challenge. These devices, often powered by energy-efficient microcontrollers, struggle to handle the computational load of deep learning-based OD models. This issue is compounded by the rapid proliferation of IoT devices, predicted to surpass 150 billion by 2030. TinyML offers a compelling solution by enabling OD on ultra-low-power devices, paving the way for efficient and real-time processing at the edge. Although numerous survey papers have been published on this topic, they often overlook the optimization challenges associated with deploying OD models in TinyML environments. To address this gap, this survey paper provides a detailed analysis of key optimization techniques for deploying OD models on resource-constrained devices. These techniques include quantization, pruning, knowledge distillation, and neural architecture search. Furthermore, we explore both theoretical approaches and practical implementations, bridging the gap between academic research and real-world edge artificial intelligence deployment. Finally, we compare the key performance indicators (KPIs) of existing OD implementations on microcontroller devices, highlighting the achieved maturity level of these solutions in terms of both prediction accuracy and efficiency. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/christophezei/Optimizing-Object-Detection-Models-for-TinyML-A-Comprehensive-Survey.

CVApr 28, 2025Code
DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model's ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at https://github.com/Mamadou-Keita/DeeCLIP for research purposes.

19.0CVMay 14
Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

IVJul 11, 2025Code
VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models

Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza et al.

Recent years have witnessed remarkable progress in developing Vision-Language Models (VLMs) capable of processing both textual and visual inputs. These models have demonstrated impressive performance, leading to their widespread adoption in various applications. However, this widespread raises serious concerns regarding user privacy, particularly when models inadvertently process or expose private visual information. In this work, we frame the preservation of privacy in VLMs as an adversarial attack problem. We propose a novel attack strategy that selectively conceals information within designated Region Of Interests (ROIs) in an image, effectively preventing VLMs from accessing sensitive content while preserving the semantic integrity of the remaining image. Unlike conventional adversarial attacks that often disrupt the entire image, our method maintains high coherence in unmasked areas. Experimental results across three state-of-the-art VLMs namely LLaVA, Instruct-BLIP, and BLIP2-T5 demonstrate up to 98% reduction in detecting targeted ROIs, while maintaining global image semantics intact, as confirmed by high similarity scores between clean and adversarial outputs. We believe that this work contributes to a more privacy conscious use of multimodal models and offers a practical tool for further research, with the source code publicly available at: https://github.com/hbrachemi/Vlm_defense-attack.

CRMar 6, 2025Code
Energy-Latency Attacks: A New Adversarial Threat to Deep Learning

Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza et al.

The growing computational demand for deep neural networks ( DNNs) has raised concerns about their energy consumption and carbon footprint, particularly as the size and complexity of the models continue to increase. To address these challenges, energy-efficient hardware and custom accelerators have become essential. Additionally, adaptable DNN s are being developed to dynamically balance performance and efficiency. The use of these strategies became more common to enable sustainable AI deployment. However, these efficiency-focused designs may also introduce vulnerabilities, as attackers can potentially exploit them to increase latency and energy usage by triggering their worst-case-performance scenarios. This new type of attack, called energy-latency attacks, has recently gained significant research attention, focusing on the vulnerability of DNN s to this emerging attack paradigm, which can trigger denial-of-service ( DoS) attacks. This paper provides a comprehensive overview of current research on energy-latency attacks, categorizing them using the established taxonomy for traditional adversarial attacks. We explore different metrics used to measure the success of these attacks and provide an analysis and comparison of existing attack strategies. We also analyze existing defense mechanisms and highlight current challenges and potential areas for future research in this developing field. The GitHub page for this work can be accessed at https://github.com/hbrachemi/Survey_energy_attacks/

CVJan 14, 2025Code
Energy Backdoor Attack to Deep Neural Networks

Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza et al.

The rise of deep learning (DL) has increased computing complexity and energy use, prompting the adoption of application specific integrated circuits (ASICs) for energy-efficient edge and mobile deployment. However, recent studies have demonstrated the vulnerability of these accelerators to energy attacks. Despite the development of various inference time energy attacks in prior research, backdoor energy attacks remain unexplored. In this paper, we design an innovative energy backdoor attack against deep neural networks (DNNs) operating on sparsity-based accelerators. Our attack is carried out in two distinct phases: backdoor injection and backdoor stealthiness. Experimental results using ResNet-18 and MobileNet-V2 models trained on CIFAR-10 and Tiny ImageNet datasets show the effectiveness of our proposed attack in increasing energy consumption on trigger samples while preserving the model's performance for clean/regular inputs. This demonstrates the vulnerability of DNNs to energy backdoor attacks. The source code of our attack is available at: https://github.com/hbrachemi/energy_backdoor.

CVJul 12, 2021Code
Detect and Defense Against Adversarial Examples in Deep Learning using Natural Scene Statistics and Adaptive Denoising

Anouar Kherchouche, Sid Ahmed Fezza, Wassim Hamidouche

Despite the enormous performance of deepneural networks (DNNs), recent studies have shown theirvulnerability to adversarial examples (AEs), i.e., care-fully perturbed inputs designed to fool the targetedDNN. Currently, the literature is rich with many ef-fective attacks to craft such AEs. Meanwhile, many de-fenses strategies have been developed to mitigate thisvulnerability. However, these latter showed their effec-tiveness against specific attacks and does not general-ize well to different attacks. In this paper, we proposea framework for defending DNN classifier against ad-versarial samples. The proposed method is based on atwo-stage framework involving a separate detector anda denoising block. The detector aims to detect AEs bycharacterizing them through the use of natural scenestatistic (NSS), where we demonstrate that these statis-tical features are altered by the presence of adversarialperturbations. The denoiser is based on block matching3D (BM3D) filter fed by an optimum threshold valueestimated by a convolutional neural network (CNN) toproject back the samples detected as AEs into theirdata manifold. We conducted a complete evaluation onthree standard datasets namely MNIST, CIFAR-10 andTiny-ImageNet. The experimental results show that theproposed defense method outperforms the state-of-the-art defense techniques by improving the robustnessagainst a set of attacks under black-box, gray-box and white-box settings. The source code is available at: https://github.com/kherchouche-anouar/2DAE

CVMay 24, 2021Code
SHD360: A Benchmark Dataset for Salient Human Detection in 360° Videos

Yi Zhang, Lu Zhang, Kang Wang et al.

Salient human detection (SHD) in dynamic 360° immersive videos is of great importance for various applications such as robotics, inter-human and human-object interaction in augmented reality. However, 360° video SHD has been seldom discussed in the computer vision community due to a lack of datasets with large-scale omnidirectional videos and rich annotations. To this end, we propose SHD360, the first 360° video SHD dataset which contains various real-life daily scenes. Since so far there is no method proposed for 360° image/video SHD, we systematically benchmark 11 representative state-of-the-art salient object detection (SOD) approaches on our SHD360, and explore key issues derived from extensive experimenting results. We hope our proposed dataset and benchmark could serve as a good starting point for advancing human-centric researches towards 360° panoramic data. The dataset is available at https://github.com/PanoAsh/SHD360.

CVApr 28, 2021Code
Learning Synergistic Attention for Light Field Salient Object Detection

Yi Zhang, Geng Chen, Qian Chen et al.

We propose a novel Synergistic Attention Network (SA-Net) to address the light field salient object detection by establishing a synergistic effect between multi-modal features with advanced attention mechanisms. Our SA-Net exploits the rich information of focal stacks via 3D convolutional neural networks, decodes the high-level features of multi-modal light field data with two cascaded synergistic attention modules, and predicts the saliency map using an effective feature fusion module in a progressive manner. Extensive experiments on three widely-used benchmark datasets show that our SA-Net outperforms 28 state-of-the-art models, sufficiently demonstrating its effectiveness and superiority. Our code is available at https://github.com/PanoAsh/SA-Net.

CVJan 22, 2020Code
A Fixation-based 360° Benchmark Dataset for Salient Object Detection

Yi Zhang, Lu Zhang, Wassim Hamidouche et al.

Fixation prediction (FP) in panoramic contents has been widely investigated along with the booming trend of virtual reality (VR) applications. However, another issue within the field of visual saliency, salient object detection (SOD), has been seldom explored in 360° (or omnidirectional) images due to the lack of datasets representative of real scenes with pixel-level annotations. Toward this end, we collect 107 equirectangular panoramas with challenging scenes and multiple object classes. Based on the consistency between FP and explicit saliency judgements, we further manually annotate 1,165 salient objects over the collected images with precise masks under the guidance of real human eye fixation maps. Six state-of-the-art SOD models are then benchmarked on the proposed fixation-based 360° image dataset (F-360iSOD), by applying a multiple cubic projection-based fine-tuning method. Experimental results show a limitation of the current methods when used for SOD in panoramic images, which indicates the proposed dataset is challenging. Key issues for 360° SOD is also discussed. The proposed dataset is available at https://github.com/PanoAsh/F-360iSOD.

CLApr 2, 2024
Generative AI for Immersive Communication: The Next Frontier in Internet-of-Senses Through 6G

Nassim Sehad, Lina Bariah, Wassim Hamidouche et al.

Over the past two decades, the Internet-of-Things (IoT) has become a transformative concept, and as we approach 2030, a new paradigm known as the Internet of Senses (IoS) is emerging. Unlike conventional Virtual Reality (VR), IoS seeks to provide multi-sensory experiences, acknowledging that in our physical reality, our perception extends far beyond just sight and sound; it encompasses a range of senses. This article explores the existing technologies driving immersive multi-sensory media, delving into their capabilities and potential applications. This exploration includes a comparative analysis between conventional immersive media streaming and a proposed use case that leverages semantic communication empowered by generative Artificial Intelligence (AI). The focal point of this analysis is the substantial reduction in bandwidth consumption by 99.93% in the proposed scheme. Through this comparison, we aim to underscore the practical applications of generative AI for immersive media. Concurrently addressing major challenges in this field, such as temporal synchronization of multiple media, ensuring high throughput, minimizing the End-to-End (E2E) latency, and robustness to low bandwidth while outlining future trajectories.

NIMar 6, 2025
Large-Scale AI in Telecom: Charting the Roadmap for Innovation, Scalability, and Enhanced Digital Experiences

Adnan Shahid, Adrian Kliks, Ahmed Al-Tahmeesschi et al.

This white paper discusses the role of large-scale AI in the telecommunications industry, with a specific focus on the potential of generative AI to revolutionize network functions and user experiences, especially in the context of 6G systems. It highlights the development and deployment of Large Telecom Models (LTMs), which are tailored AI models designed to address the complex challenges faced by modern telecom networks. The paper covers a wide range of topics, from the architecture and deployment strategies of LTMs to their applications in network management, resource allocation, and optimization. It also explores the regulatory, ethical, and standardization considerations for LTMs, offering insights into their future integration into telecom infrastructure. The goal is to provide a comprehensive roadmap for the adoption of LTMs to enhance scalability, performance, and user-centric innovation in telecom networks.

IVFeb 28, 2024
NERV++: An Enhanced Implicit Neural Video Representation

Ahmed Ghorbel, Wassim Hamidouche, Luce Morin

Neural fields, also known as implicit neural representations (INRs), have shown a remarkable capability of representing, generating, and manipulating various data types, allowing for continuous data reconstruction at a low memory footprint. Though promising, INRs applied to video compression still need to improve their rate-distortion performance by a large margin, and require a huge number of parameters and long training iterations to capture high-frequency details, limiting their wider applicability. Resolving this problem remains a quite challenging task, which would make INRs more accessible in compression tasks. We take a step towards resolving these shortcomings by introducing neural representations for videos NeRV++, an enhanced implicit neural video representation, as more straightforward yet effective enhancement over the original NeRV decoder architecture, featuring separable conv2d residual blocks (SCRBs) that sandwiches the upsampling block (UB), and a bilinear interpolation skip layer for improved feature representation. NeRV++ allows videos to be directly represented as a function approximated by a neural network, and significantly enhance the representation capacity beyond current INR-based video codecs. We evaluate our method on UVG, MCL JVC, and Bunny datasets, achieving competitive results for video compression with INRs. This achievement narrows the gap to autoencoder-based video coding, marking a significant stride in INR-based video compression research.

LGApr 13, 2025
Can LLMs Revolutionize the Design of Explainable and Efficient TinyML Models?

Christophe El Zeinaty, Wassim Hamidouche, Glenn Herrou et al.

This paper introduces a novel framework for designing efficient neural network architectures specifically tailored to tiny machine learning (TinyML) platforms. By leveraging large language models (LLMs) for neural architecture search (NAS), a vision transformer (ViT)-based knowledge distillation (KD) strategy, and an explainability module, the approach strikes an optimal balance between accuracy, computational efficiency, and memory usage. The LLM-guided search explores a hierarchical search space, refining candidate architectures through Pareto optimization based on accuracy, multiply-accumulate operations (MACs), and memory metrics. The best-performing architectures are further fine-tuned using logits-based KD with a pre-trained ViT-B/16 model, which enhances generalization without increasing model size. Evaluated on the CIFAR-100 dataset and deployed on an STM32H7 microcontroller (MCU), the three proposed models, LMaNet-Elite, LMaNet-Core, and QwNet-Core, achieve accuracy scores of 74.50%, 74.20% and 73.00%, respectively. All three models surpass current state-of-the-art (SOTA) models, such as MCUNet-in3/in4 (69.62% / 72.86%) and XiNet (72.27%), while maintaining a low computational cost of less than 100 million MACs and adhering to the stringent 320 KB static random-access memory (SRAM) constraint. These results demonstrate the efficiency and performance of the proposed framework for TinyML platforms, underscoring the potential of combining LLM-driven search, Pareto optimization, KD, and explainability to develop accurate, efficient, and interpretable models. This approach opens new possibilities in NAS, enabling the design of efficient architectures specifically suited for TinyML.

CVAug 5, 2025
RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

IVFeb 1, 2022
CAESR: Conditional Autoencoder and Super-Resolution for Learned Spatial Scalability

Charles Bonnineau, Wassim Hamidouche, Jean-François Travers et al.

In this paper, we present CAESR, an hybrid learning-based coding approach for spatial scalability based on the versatile video coding (VVC) standard. Our framework considers a low-resolution signal encoded with VVC intra-mode as a base-layer (BL), and a deep conditional autoencoder with hyperprior (AE-HP) as an enhancement-layer (EL) model. The EL encoder takes as inputs both the upscaled BL reconstruction and the original image. Our approach relies on conditional coding that learns the optimal mixture of the source and the upscaled BL image, enabling better performance than residual coding. On the decoder side, a super-resolution (SR) module is used to recover high-resolution details and invert the conditional coding process. Experimental results have shown that our solution is competitive with the VVC full-resolution intra coding while being scalable.

CVJul 7, 2021
Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition

Mouath Aouayeb, Wassim Hamidouche, Catherine Soladie et al.

As various databases of facial expressions have been made accessible over the last few decades, the Facial Expression Recognition (FER) task has gotten a lot of interest. The multiple sources of the available databases raised several challenges for facial recognition task. These challenges are usually addressed by Convolution Neural Network (CNN) architectures. Different from CNN models, a Transformer model based on attention mechanism has been presented recently to address vision tasks. One of the major issue with Transformers is the need of a large data for training, while most FER databases are limited compared to other vision applications. Therefore, we propose in this paper to learn a vision Transformer jointly with a Squeeze and Excitation (SE) block for FER task. The proposed method is evaluated on different publicly available FER databases including CK+, JAFFE,RAF-DB and SFEW. Experiments demonstrate that our model outperforms state-of-the-art methods on CK+ and SFEW and achieves competitive results on JAFFE and RAF-DB.

CVJun 7, 2021
Reveal of Vision Transformers Robustness against Adversarial Attacks

Ahmed Aldahdooh, Wassim Hamidouche, Olivier Deforges

The major part of the vanilla vision transformer (ViT) is the attention block that brings the power of mimicking the global context of the input image. For better performance, ViT needs large-scale training data. To overcome this data hunger limitation, many ViT-based networks, or hybrid-ViT, have been proposed to include local context during the training. The robustness of ViTs and its variants against adversarial attacks has not been widely investigated in the literature like CNNs. This work studies the robustness of ViT variants 1) against different Lp-based adversarial attacks in comparison with CNNs, 2) under adversarial examples (AEs) after applying preprocessing defense methods and 3) under the adaptive attacks using expectation over transformation (EOT) framework. To that end, we run a set of experiments on 1000 images from ImageNet-1k and then provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs. For instance, we found that 1) Vanilla ViTs or hybrid-ViTs are more robust than CNNs under Lp-based attacks and under adaptive attacks. 2) Unlike hybrid-ViTs, Vanilla ViTs are not responding to preprocessing defenses that mainly reduce the high frequency components. Furthermore, feature maps, attention maps, and Grad-CAM visualization jointly with image quality measures, and perturbations' energy spectrum are provided for an insight understanding of attention-based models.

IVMay 7, 2021
NTIRE 2021 Challenge on Perceptual Image Quality Assessment

Jinjin Gu, Haoming Cai, Chao Dong et al.

This paper reports on the NTIRE 2021 challenge on perceptual image quality assessment (IQA), held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2021. As a new type of image processing technology, perceptual image processing algorithms based on Generative Adversarial Networks (GAN) have produced images with more realistic textures. These output images have completely different characteristics from traditional distortions, thus pose a new challenge for IQA methods to evaluate their visual quality. In comparison with previous IQA challenges, the training and testing datasets in this challenge include the outputs of perceptual image processing algorithms and the corresponding subjective scores. Thus they can be used to develop and evaluate IQA methods on GAN-based distortions. The challenge has 270 registered participants in total. In the final testing stage, 13 participating teams submitted their models and fact sheets. Almost all of them have achieved much better results than existing IQA methods, while the winning method can demonstrate state-of-the-art performance.

CVMay 3, 2021
CMA-Net: A Cascaded Mutual Attention Network for Light Field Salient Object Detection

Yi Zhang, Lu Zhang, Wassim Hamidouche et al.

In the past few years, numerous deep learning methods have been proposed to address the task of segmenting salient objects from RGB images. However, these approaches depending on single modality fail to achieve the state-of-the-art performance on widely used light field salient object detection (SOD) datasets, which collect large-scale natural images and provide multiple modalities such as multi-view, micro-lens images and depth maps. Most recently proposed light field SOD methods have acquired improving detecting accuracy, yet still predict rough objects' structures and perform slow inference speed. To this end, we propose CMA-Net, which consists of two novel cascaded mutual attention modules aiming at fusing the high level features from the modalities of all-in-focus and depth. Our proposed CMA-Net outperforms 30 SOD methods on two widely applied light field benchmark datasets. Besides, the proposed CMA-Net is able to inference at the speed of 53 fps. Extensive quantitative and qualitative experiments illustrate both the effectiveness and efficiency of our CMA-Net.

CVMay 1, 2021
Adversarial Example Detection for DNN Models: A Review and Experimental Comparison

Ahmed Aldahdooh, Wassim Hamidouche, Sid Ahmed Fezza et al.

Deep learning (DL) has shown great success in many human-related tasks, which has led to its adoption in many computer vision based applications, such as security surveillance systems, autonomous vehicles and healthcare. Such safety-critical applications have to draw their path to success deployment once they have the capability to overcome safety-critical challenges. Among these challenges are the defense against or/and the detection of the adversarial examples (AEs). Adversaries can carefully craft small, often imperceptible, noise called perturbations to be added to the clean image to generate the AE. The aim of AE is to fool the DL model which makes it a potential risk for DL applications. Many test-time evasion attacks and countermeasures,i.e., defense or detection methods, are proposed in the literature. Moreover, few reviews and surveys were published and theoretically showed the taxonomy of the threats and the countermeasure methods with little focus in AE detection methods. In this paper, we focus on image classification task and attempt to provide a survey for detection methods of test-time evasion attacks on neural network classifiers. A detailed discussion for such methods is provided with experimental results for eight state-of-the-art detectors under different scenarios on four datasets. We also provide potential challenges and future perspectives for this research direction.

NEApr 19, 2021
Conditional Coding and Variable Bitrate for Practical Learned Video Coding

Théo Ladune, Pierrick Philippe, Wassim Hamidouche et al.

This paper introduces a practical learned video codec. Conditional coding and quantization gain vectors are used to provide flexibility to a single encoder/decoder pair, which is able to compress video sequences at a variable bitrate. The flexibility is leveraged at test time by choosing the rate and GOP structure to optimize a rate-distortion cost. Using the CLIC21 video test conditions, the proposed approach shows performance on par with HEVC.

CVApr 16, 2021
Multitask Learning for VVC Quality Enhancement and Super-Resolution

Charles Bonnineau, Wassim Hamidouche, Jean-Francois Travers et al.

The latest video coding standard, called versatile video coding (VVC), includes several novel and refined coding tools at different levels of the coding chain. These tools bring significant coding gains with respect to the previous standard, high efficiency video coding (HEVC). However, the encoder may still introduce visible coding artifacts, mainly caused by coding decisions applied to adjust the bitrate to the available bandwidth. Hence, pre and post-processing techniques are generally added to the coding pipeline to improve the quality of the decoded video. These methods have recently shown outstanding results compared to traditional approaches, thanks to the recent advances in deep learning. Generally, multiple neural networks are trained independently to perform different tasks, thus omitting to benefit from the redundancy that exists between the models. In this paper, we investigate a learning-based solution as a post-processing step to enhance the decoded VVC video quality. Our method relies on multitask learning to perform both quality enhancement and super-resolution using a single shared network optimized for multiple degradation levels. The proposed solution enables a good performance in both mitigating coding artifacts and super-resolution with fewer network parameters compared to traditional specialized architectures.

CRMar 9, 2021
Revisiting Model's Uncertainty and Confidences for Adversarial Example Detection

Ahmed Aldahdooh, Wassim Hamidouche, Olivier Déforges

Security-sensitive applications that rely on Deep Neural Networks (DNNs) are vulnerable to small perturbations that are crafted to generate Adversarial Examples(AEs). The AEs are imperceptible to humans and cause DNN to misclassify them. Many defense and detection techniques have been proposed. Model's confidences and Dropout, as a popular way to estimate the model's uncertainty, have been used for AE detection but they showed limited success against black- and gray-box attacks. Moreover, the state-of-the-art detection techniques have been designed for specific attacks or broken by others, need knowledge about the attacks, are not consistent, increase model parameters overhead, are time-consuming, or have latency in inference time. To trade off these factors, we revisit the model's uncertainty and confidences and propose a novel unsupervised ensemble AE detection mechanism that 1) uses the uncertainty method called SelectiveNet, 2) processes model layers outputs, i.e.feature maps, to generate new confidence probabilities. The detection method is called Selective and Feature based Adversarial Detection (SFAD). Experimental results show that the proposed approach achieves better performance against black- and gray-box attacks than the state-of-the-art methods and achieves comparable performance against white-box attacks. Moreover, results show that SFAD is fully robust against High Confidence Attacks (HCAs) for MNIST and partially robust for CIFAR10 datasets.

CRMar 6, 2021
Selective Encryption of the Versatile Video Coding Standard

Guillaume Gautier, Mousa FarajAllah, Wassim Hamidouche et al.

Versatile video coding (VVC) is the next generation video coding standard developed by the joint video experts team (JVET) and released in July 2020. VVC introduces several new coding tools providing a significant coding gain over the high efficiency video coding (HEVC) standard. It is well known that increasing the coding efficiency adds more dependencies in the video bitstream making format-compliant encryption with the standard more challenging. In this paper we tackle the problem of selective encryption of the VVC standard in format-compliant and constant bitrate. These two constraints ensure that the encrypted bitstream can be decoded by any VVC decoder while the bitrate remains unchanged by the encryption. The selective encryption of all possible VVC syntax elements is investigated. A new algorithm is proposed to encrypt in format-compliant and constant bitrate the transform coefficients (TCs) together with other syntax elements at the level of the entropy encoder. The proposed solution was integrated and assessed under the VVC reference software model version 6.0. Experimental results showed that the encryption drastically decreases the video quality while the encryption is robust against several types of attacks. The encryption space is estimated in the range of 15% to 26% of the bitstream size resulting in a lightweight encryption process. The web page of this work is available at https://gugautie.github.io/sevvc/.

IVAug 6, 2020
Optical Flow and Mode Selection for Learning-based Video Coding

Théo Ladune, Pierrick Philippe, Wassim Hamidouche et al.

This paper introduces a new method for inter-frame coding based on two complementary autoencoders: MOFNet and CodecNet. MOFNet aims at computing and conveying the Optical Flow and a pixel-wise coding Mode selection. The optical flow is used to perform a prediction of the frame to code. The coding mode selection enables competition between direct copy of the prediction or transmission through CodecNet. The proposed coding scheme is assessed under the Challenge on Learned Image Compression 2020 (CLIC20) P-frame coding conditions, where it is shown to perform on par with the state-of-the-art video codec ITU/MPEG HEVC. Moreover, the possibility of copying the prediction enables to learn the optical flow in an end-to-end fashion i.e. without relying on pre-training and/or a dedicated loss term.

NEJul 6, 2020
ModeNet: Mode Selection Network For Learned Video Coding

Théo Ladune, Pierrick Philippe, Wassim Hamidouche et al.

In this paper, a mode selection network (ModeNet) is proposed to enhance deep learning-based video compression. Inspired by traditional video coding, ModeNet purpose is to enable competition among several coding modes. The proposed ModeNet learns and conveys a pixel-wise partitioning of the frame, used to assign each pixel to the most suited coding mode. ModeNet is trained alongside the different coding modes to minimize a rate-distortion cost. It is a flexible component which can be generalized to other systems to allow competition between different coding tools. Mod-eNet interest is studied on a P-frame coding task, where it is used to design a method for coding a frame given its prediction. ModeNet-based systems achieve compelling performance when evaluated under the Challenge on Learned Image Compression 2020 (CLIC20) P-frame coding track conditions.

IVFeb 21, 2020
Binary Probability Model for Learning Based Image Compression

Théo Ladune, Pierrick Philippe, Wassim Hamidouche et al.

In this paper, we propose to enhance learned image compression systems with a richer probability model for the latent variables. Previous works model the latents with a Gaussian or a Laplace distribution. Inspired by binary arithmetic coding , we propose to signal the latents with three binary values and one integer, with different probability models. A relaxation method is designed to perform gradient-based training. The richer probability model results in a better entropy coding leading to lower rate. Experiments under the Challenge on Learned Image Compression (CLIC) test conditions demonstrate that this method achieves 18% rate saving compared to Gaussian or Laplace models.

LGJun 1, 2019
Perceptual Evaluation of Adversarial Attacks for CNN-based Image Classification

Sid Ahmed Fezza, Yassine Bakhti, Wassim Hamidouche et al.

Deep neural networks (DNNs) have recently achieved state-of-the-art performance and provide significant progress in many machine learning tasks, such as image classification, speech processing, natural language processing, etc. However, recent studies have shown that DNNs are vulnerable to adversarial attacks. For instance, in the image classification domain, adding small imperceptible perturbations to the input image is sufficient to fool the DNN and to cause misclassification. The perturbed image, called \textit{adversarial example}, should be visually as close as possible to the original image. However, all the works proposed in the literature for generating adversarial examples have used the $L_{p}$ norms ($L_{0}$, $L_{2}$ and $L_{\infty}$) as distance metrics to quantify the similarity between the original image and the adversarial example. Nonetheless, the $L_{p}$ norms do not correlate with human judgment, making them not suitable to reliably assess the perceptual similarity/fidelity of adversarial examples. In this paper, we present a database for visual fidelity assessment of adversarial examples. We describe the creation of the database and evaluate the performance of fifteen state-of-the-art full-reference (FR) image fidelity assessment metrics that could substitute $L_{p}$ norms. The database as well as subjective scores are publicly available to help designing new metrics for adversarial examples and to facilitate future research works.