Abdenour Hadid

CV
h-index54
40papers
788citations
Novelty46%
AI Score56

40 Papers

CVAug 22, 2024Code
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

We introduce FIDAVL: Fake Image Detection and Attribution using a Vision-Language Model. FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. Leveraging the benefits of zero-shot learning, FIDAVL exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images and accurately attribute them to their originating source models. We conducted extensive experiments on a comprehensive dataset comprising synthetic images generated by various state-of-the-art models. Our results demonstrate that FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47% while also obtaining noteworthy performance metrics, with an average F1-score of 92.64% and ROUGE-L score of 96.50% for attributing synthetic images to their respective source generation models. The source code of this work will be publicly released at https://github.com/Mamadou-Keita/FIDAVL.

LGSep 9, 2022
Knowledge-based Deep Learning for Modeling Chaotic Systems

Zakaria Elabid, Tanujit Chakraborty, Abdenour Hadid

Deep Learning has received increased attention due to its unbeatable success in many fields, such as computer vision, natural language processing, recommendation systems, and most recently in simulating multiphysics problems and predicting nonlinear dynamical systems. However, modeling and forecasting the dynamics of chaotic systems remains an open research problem since training deep learning models requires big data, which is not always available in many cases. Such deep learners can be trained from additional information obtained from simulated results and by enforcing the physical laws of the chaotic systems. This paper considers extreme events and their dynamics and proposes elegant models based on deep neural networks, called knowledge-based deep learning (KDL). Our proposed KDL can learn the complex patterns governing chaotic systems by jointly training on real and simulated data directly from the dynamics and their differential equations. This knowledge is transferred to model and forecast real-world chaotic events exhibiting extreme behavior. We validate the efficiency of our model by assessing it on three real-world benchmark datasets: El Nino sea surface temperature, San Juan Dengue viral infection, and Bjørnøya daily precipitation, all governed by extreme events' dynamics. Using prior knowledge of extreme events and physics-based loss functions to lead the neural network learning, we ensure physically consistent, generalizable, and accurate forecasting, even in a small data regime.

LGSep 8, 2022
W-Transformers : A Wavelet-based Transformer Framework for Univariate Time Series Forecasting

Lena Sasal, Tanujit Chakraborty, Abdenour Hadid

Deep learning utilizing transformers has recently achieved a lot of success in many vital areas such as natural language processing, computer vision, anomaly detection, and recommendation systems, among many others. Among several merits of transformers, the ability to capture long-range temporal dependencies and interactions is desirable for time series forecasting, leading to its progress in various time series applications. In this paper, we build a transformer model for non-stationary time series. The problem is challenging yet crucially important. We present a novel framework for univariate time series representation learning based on the wavelet-based transformer encoder architecture and call it W-Transformer. The proposed W-Transformers utilize a maximal overlap discrete wavelet transformation (MODWT) to the time series data and build local transformers on the decomposed datasets to vividly capture the nonstationarity and long-range nonlinear dependencies in the time series. Evaluating our framework on several publicly available benchmark time series datasets from various domains and with diverse characteristics, we demonstrate that it performs, on average, significantly better than the baseline forecasters for short-term and long-term forecasting, even for datasets that consist of only a few hundred training samples.

CVOct 1, 2022
Evaluation of Pre-Trained CNN Models for Geographic Fake Image Detection

Sid Ahmed Fezza, Mohammed Yasser Ouis, Bachir Kaddar et al.

Thanks to the remarkable advances in generative adversarial networks (GANs), it is becoming increasingly easy to generate/manipulate images. The existing works have mainly focused on deepfake in face images and videos. However, we are currently witnessing the emergence of fake satellite images, which can be misleading or even threatening to national security. Consequently, there is an urgent need to develop detection methods capable of distinguishing between real and fake satellite images. To advance the field, in this paper, we explore the suitability of several convolutional neural network (CNN) architectures for fake satellite image detection. Specifically, we benchmark four CNN models by conducting extensive experiments to evaluate their performance and robustness against various image distortions. This work allows the establishment of new baselines and may be useful for the development of CNN-based methods for fake satellite image detection.

LGAug 29, 2024Code
TempoKGAT: A Novel Graph Attention Network Approach for Temporal Graph Analysis

Lena Sasal, Daniel Busby, Abdenour Hadid

Graph neural networks (GNN) have shown significant capabilities in handling structured data, yet their application to dynamic, temporal data remains limited. This paper presents a new type of graph attention network, called TempoKGAT, which combines time-decaying weight and a selective neighbor aggregation mechanism on the spatial domain, which helps uncover latent patterns in the graph data. In this approach, a top-k neighbor selection based on the edge weights is introduced to represent the evolving features of the graph data. We evaluated the performance of our TempoKGAT on multiple datasets from the traffic, energy, and health sectors involving spatio-temporal data. We compared the performance of our approach to several state-of-the-art methods found in the literature on several open-source datasets. Our method shows superior accuracy on all datasets. These results indicate that TempoKGAT builds on existing methodologies to optimize prediction accuracy and provide new insights into model interpretation in temporal contexts.

MLApr 1, 2022
Probabilistic AutoRegressive Neural Networks for Accurate Long-range Forecasting

Madhurima Panja, Tanujit Chakraborty, Uttam Kumar et al.

Forecasting time series data is a critical area of research with applications spanning from stock prices to early epidemic prediction. While numerous statistical and machine learning methods have been proposed, real-life prediction problems often require hybrid solutions that bridge classical forecasting approaches and modern neural network models. In this study, we introduce the Probabilistic AutoRegressive Neural Networks (PARNN), capable of handling complex time series data exhibiting non-stationarity, nonlinearity, non-seasonality, long-range dependence, and chaotic patterns. PARNN is constructed by improving autoregressive neural networks (ARNN) using autoregressive integrated moving average (ARIMA) feedback error, combining the explainability, scalability, and "white-box-like" prediction behavior of both models. Notably, the PARNN model provides uncertainty quantification through prediction intervals, setting it apart from advanced deep learning tools. Through comprehensive computational experiments, we evaluate the performance of PARNN against standard statistical, machine learning, and deep learning models, including Transformers, NBeats, and DeepAR. Diverse real-world datasets from macroeconomics, tourism, epidemiology, and other domains are employed for short-term, medium-term, and long-term forecasting evaluations. Our results demonstrate the superiority of PARNN across various forecast horizons, surpassing the state-of-the-art forecasters. The proposed PARNN model offers a valuable hybrid solution for accurate long-range forecasting. By effectively capturing the complexities present in time series data, it outperforms existing methods in terms of accuracy and reliability. The ability to quantify uncertainty through prediction intervals further enhances the model's usefulness in decision-making processes.

50.9CVApr 4Code
SPARK-IL: Spectral Retrieval-Augmented RAG for Knowledge-driven Deepfake Detection via Incremental Learning

Hessen Bougueffa Eutamene, Abdellah Zakaria Sellam, Abdelmalik Taleb-Ahmed et al.

Detecting AI-generated images remains a significant challenge because detectors trained on specific generators often fail to generalize to unseen models; however, while pixel-level artifacts vary across models, frequency-domain signatures exhibit greater consistency, providing a promising foundation for cross-generator detection. To address this, we propose SPARK-IL, a retrieval-augmented framework that combines dual-path spectral analysis with incremental learning by utilizing a partially frozen ViT-L/14 encoder for semantic representations alongside a parallel path for raw RGB pixel embeddings. Both paths undergo multi-band Fourier decomposition into four frequency bands, which are individually processed by Kolmogorov-Arnold Networks (KAN) with mixture-of-experts for band-specific transformations before the resulting spectral embeddings are fused via cross-attention with residual connections. During inference, this fused embedding retrieves the $k$ nearest labeled signatures from a Milvus database using cosine similarity to facilitate predictions via majority voting, while an incremental learning strategy expands the database and employs elastic weight consolidation to preserve previously learned transformations. Evaluated on the UniversalFakeDetect benchmark across 19 generative models -- including GANs, face-swapping, and diffusion methods -- SPARK-IL achieves a 94.6\% mean accuracy, with the code to be publicly released at https://github.com/HessenUPHF/SPARK-IL.

CVMar 1
VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification

Abdellah Zakaria Sellam, Fadi Abdeladhim Zidi, Salah Eddine Bekhouche et al.

Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2\%, the model achieves Overall Accuracy (OA) of 99.69\% on the Salinas dataset and 99.45\% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.

CVApr 2, 2024Code
Bi-LORA: A Vision-Language Approach for Synthetic Image Detection

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

Advancements in deep image synthesis techniques, such as generative adversarial networks (GANs) and diffusion models (DMs), have ushered in an era of generating highly realistic images. While this technological progress has captured significant interest, it has also raised concerns about the potential difficulty in distinguishing real images from their synthetic counterparts. This paper takes inspiration from the potent convergence capabilities between vision and language, coupled with the zero-shot nature of vision-language models (VLMs). We introduce an innovative method called Bi-LORA that leverages VLMs, combined with low-rank adaptation (LORA) tuning techniques, to enhance the precision of synthetic image detection for unseen model-generated images. The pivotal conceptual shift in our methodology revolves around reframing binary classification as an image captioning task, leveraging the distinctive capabilities of cutting-edge VLM, notably bootstrapping language image pre-training (BLIP2). Rigorous and comprehensive experiments are conducted to validate the effectiveness of our proposed approach, particularly in detecting unseen diffusion-generated images from unknown diffusion-based generative models during training, showcasing robustness to noise, and demonstrating generalization capabilities to GANs. The obtained results showcase an impressive average accuracy of 93.41% in synthetic image detection on unseen generation models. The code and models associated with this research can be publicly accessed at https://github.com/Mamadou-Keita/VLM-DETECT.

CVSep 5, 2024
Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression

Ibtissam Saadi, Douglas W. Cunningham, Taleb-ahmed Abdelmalik et al.

Existing methods for driver facial expression recognition (DFER) are often computationally intensive, rendering them unsuitable for real-time applications. In this work, we introduce a novel transfer learning-based dual architecture, named ShuffViT-DFER, which elegantly combines computational efficiency and accuracy. This is achieved by harnessing the strengths of two lightweight and efficient models using convolutional neural network (CNN) and vision transformers (ViT). We efficiently fuse the extracted features to enhance the performance of the model in accurately recognizing the facial expressions of the driver. Our experimental results on two benchmarking and public datasets, KMU-FED and KDEF, highlight the validity of our proposed method for real-time application with superior performance when compared to state-of-the-art methods.

15.5CVMar 16
Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition

Salah Eddine Bekhouche, Hichem Telli, Azeddine Benlamoudi et al.

Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels -- saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emph{disagreements} between what is said, how it sounds, and what the face shows. We present \textbf{ConflictAwareAH}, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features -- element-wise absolute differences between modality embeddings -- serve as \emph{bidirectional} cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emph{text-guided late fusion} strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf{0.694 Macro F1} on the labelled test split and \textbf{0.715} on the private leaderboard, outperforming published multimodal baselines by over 10 points -- all on a single GPU in under 25 minutes of training.

CVApr 3, 2024Code
Harnessing the Power of Large Vision Language Models for Synthetic Image Detection

Mamadou Keita, Wassim Hamidouche, Hassen Bougueffa et al.

In recent years, the emergence of models capable of generating images from text has attracted considerable interest, offering the possibility of creating realistic images from text descriptions. Yet these advances have also raised concerns about the potential misuse of these images, including the creation of misleading content such as fake news and propaganda. This study investigates the effectiveness of using advanced vision-language models (VLMs) for synthetic image identification. Specifically, the focus is on tuning state-of-the-art image captioning models for synthetic image detection. By harnessing the robust understanding capabilities of large VLMs, the aim is to distinguish authentic images from synthetic images produced by diffusion-based models. This study contributes to the advancement of synthetic image detection by exploiting the capabilities of visual language models such as BLIP-2 and ViTGPT2. By tailoring image captioning models, we address the challenges associated with the potential misuse of synthetic images in real-world applications. Results described in this paper highlight the promising role of VLMs in the field of synthetic image detection, outperforming conventional image-based detection techniques. Code and models can be found at https://github.com/Mamadou-Keita/VLM-DETECT.

CVMay 14, 2025Code
Recent Advances in Medical Imaging Segmentation: A Survey

Fares Bougourzi, Abdenour Hadid

Medical imaging is a cornerstone of modern healthcare, driving advancements in diagnosis, treatment planning, and patient care. Among its various tasks, segmentation remains one of the most challenging problem due to factors such as data accessibility, annotation complexity, structural variability, variation in medical imaging modalities, and privacy constraints. Despite recent progress, achieving robust generalization and domain adaptation remains a significant hurdle, particularly given the resource-intensive nature of some proposed models and their reliance on domain expertise. This survey explores cutting-edge advancements in medical image segmentation, focusing on methodologies such as Generative AI, Few-Shot Learning, Foundation Models, and Universal Models. These approaches offer promising solutions to longstanding challenges. We provide a comprehensive overview of the theoretical foundations, state-of-the-art techniques, and recent applications of these methods. Finally, we discuss inherent limitations, unresolved issues, and future research directions aimed at enhancing the practicality and accessibility of segmentation models in medical imaging. We are maintaining a \href{https://github.com/faresbougourzi/Awesome-DL-for-Medical-Imaging-Segmentation}{GitHub Repository} to continue tracking and updating innovations in this field.

LGAug 29, 2024
TG-PhyNN: An Enhanced Physically-Aware Graph Neural Network framework for forecasting Spatio-Temporal Data

Zakaria Elabid, Lena Sasal, Daniel Busby et al.

Accurately forecasting dynamic processes on graphs, such as traffic flow or disease spread, remains a challenge. While Graph Neural Networks (GNNs) excel at modeling and forecasting spatio-temporal data, they often lack the ability to directly incorporate underlying physical laws. This work presents TG-PhyNN, a novel Temporal Graph Physics-Informed Neural Network framework. TG-PhyNN leverages the power of GNNs for graph-based modeling while simultaneously incorporating physical constraints as a guiding principle during training. This is achieved through a two-step prediction strategy that enables the calculation of physical equation derivatives within the GNN architecture. Our findings demonstrate that TG-PhyNN significantly outperforms traditional forecasting models (e.g., GRU, LSTM, GAT) on real-world spatio-temporal datasets like PedalMe (traffic flow), COVID-19 spread, and Chickenpox outbreaks. These datasets are all governed by well-defined physical principles, which TG-PhyNN effectively exploits to offer more reliable and accurate forecasts in various domains where physical processes govern the dynamics of data. This paves the way for improved forecasting in areas like traffic flow prediction, disease outbreak prediction, and potentially other fields where physics plays a crucial role.

CVApr 28, 2025Code
DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model's ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at https://github.com/Mamadou-Keita/DeeCLIP for research purposes.

19.0CVMay 14
Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

CLJul 31, 2025Code
Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Salah Eddine Bekhouche, Azeddine Benlamoudi, Yazid Bounab et al.

Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.

CVJul 21, 2025Code
SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging

Salah Eddine Bekhouche, Gaby Maroun, Fadi Dornaika et al.

Medical image segmentation is crucial for many healthcare tasks, including disease diagnosis and treatment planning. One key area is the segmentation of skin lesions, which is vital for diagnosing skin cancer and monitoring patients. In this context, this paper introduces SegDT, a new segmentation model based on diffusion transformer (DiT). SegDT is designed to work on low-cost hardware and incorporates Rectified Flow, which improves the generation quality at reduced inference steps and maintains the flexibility of standard diffusion models. Our method is evaluated on three benchmarking datasets and compared against several existing works, achieving state-of-the-art results while maintaining fast inference speeds. This makes the proposed model appealing for real-world medical applications. This work advances the performance and capabilities of deep learning models in medical image analysis, enabling faster, more accurate diagnostic tools for healthcare professionals. The code is made publicly available at \href{https://github.com/Bekhouche/SegDT}{GitHub}.

CVMar 21, 2025Code
PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition

Ibtissam Saadi, Abdenour Hadid, Douglas W. Cunningham et al.

Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .

HCNov 1, 2019Code
Towards Robust Deep Neural Networks for Affect and Depression Recognition from Speech

Alice Othmani, Daoud Kadoch, Kamil Bentounes et al.

Intelligent monitoring systems and affective computing applications have emerged in recent years to enhance healthcare. Examples of these applications include assessment of affective states such as Major Depressive Disorder (MDD). MDD describes the constant expression of certain emotions: negative emotions (low Valence) and lack of interest (low Arousal). High-performing intelligent systems would enhance MDD diagnosis in its early stages. In this paper, we present a new deep neural network architecture, called EmoAudioNet, for emotion and depression recognition from speech. Deep EmoAudioNet learns from the time-frequency representation of the audio signal and the visual representation of its spectrum of frequencies. Our model shows very promising results in predicting affect and depression. It works similarly or outperforms the state-of-the-art methods according to several evaluation metrics on RECOLA and on DAIC-WOZ datasets in predicting arousal, valence, and depression. Code of EmoAudioNet is publicly available on GitHub: https://github.com/AliceOTHMANI/EmoAudioNet

20.9CVApr 21
RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

Ahmed Marouane Djouama, Abir Belaala, Abdellah Zakaria Sellam et al.

Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.

LGApr 4, 2024
Knowledge-Based Convolutional Neural Network for the Simulation and Prediction of Two-Phase Darcy Flows

Zakaria Elabid, Daniel Busby, Abdenour Hadid

Physics-informed neural networks (PINNs) have gained significant prominence as a powerful tool in the field of scientific computing and simulations. Their ability to seamlessly integrate physical principles into deep learning architectures has revolutionized the approaches to solving complex problems in physics and engineering. However, a persistent challenge faced by mainstream PINNs lies in their handling of discontinuous input data, leading to inaccuracies in predictions. This study addresses these challenges by incorporating the discretized forms of the governing equations into the PINN framework. We propose to combine the power of neural networks with the dynamics imposed by the discretized differential equations. By discretizing the governing equations, the PINN learns to account for the discontinuities and accurately capture the underlying relationships between inputs and outputs, improving the accuracy compared to traditional interpolation techniques. Moreover, by leveraging the power of neural networks, the computational cost associated with numerical simulations is substantially reduced. We evaluate our model on a large-scale dataset for the prediction of pressure and saturation fields demonstrating high accuracies compared to non-physically aware models.

11.5CVMar 13
Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation

Fares Bougourzi, Fadi Dornaika, Abdenour Hadid

Deep learning has achieved remarkable success in medical image segmentation, often reaching expert-level accuracy in delineating tumors and tissues. However, most existing approaches remain task-specific, showing strong performance on individual datasets but limited generalization across diverse imaging modalities. Moreover, many methods focus primarily on the encoder, relying on large pretrained backbones that increase computational complexity. In this paper, we propose a decoder-centric approach for generalized 2D medical image segmentation. The proposed Deco-Mamba follows a U-Net-like structure with a Transformer-CNN-Mamba design. The encoder combines a CNN block and Transformer backbone for efficient feature extraction, while the decoder integrates our novel Co-Attention Gate (CAG), Vision State Space Module (VSSM), and deformable convolutional refinement block to enhance multi-scale contextual representation. Additionally, a windowed distribution-aware KL-divergence loss is introduced for deep supervision across multiple decoding stages. Extensive experiments on diverse medical image segmentation benchmarks yield state-of-the-art performance and strong generalization capability while maintaining moderate model complexity. The source code will be released upon acceptance.

CVAug 30, 2025
C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Car Damage Detection

Abdellah Zakaria Sellam, Ilyes Benaissa, Salah Eddine Bekhouche et al.

Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains

CLAug 30, 2025
CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning

Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Hichem Telli et al.

Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares, which poses a challenge for AI. In this paper, we present a lightweight framework for solving multiple-choice inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS). The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning. We evaluate Arabic encoders (MARBERT, ArabicBERT, AraBERT) and compare them with API-based LLMs (Gemini, DeepSeek) on the QIAS 2025 dataset. While large models achieve an accuracy of up to 87.6%, they require more resources and are context-dependent. Our MARBERT-based approach achieves 69.87% accuracy, presenting a compelling case for efficiency, on-device deployability, and privacy. While this is lower than the 87.6% achieved by the best-performing LLM, our work quantifies a critical trade-off between the peak performance of large models and the practical advantages of smaller, specialized systems in high-stakes domains.

CVAug 5, 2025
RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene et al.

In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

CVJul 27, 2025
SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection

Mohammed-En-Nadhir Zighem, Abdenour Hadid

Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.

CVJan 27, 2024
Face to Cartoon Incremental Super-Resolution using Knowledge Distillation

Trinetra Devkatte, Shiv Ram Dubey, Satish Kumar Singh et al.

Facial super-resolution/hallucination is an important area of research that seeks to enhance low-resolution facial images for a variety of applications. While Generative Adversarial Networks (GANs) have shown promise in this area, their ability to adapt to new, unseen data remains a challenge. This paper addresses this problem by proposing an incremental super-resolution using GANs with knowledge distillation (ISR-KD) for face to cartoon. Previous research in this area has not investigated incremental learning, which is critical for real-world applications where new data is continually being generated. The proposed ISR-KD aims to develop a novel unified framework for facial super-resolution that can handle different settings, including different types of faces such as cartoon face and various levels of detail. To achieve this, a GAN-based super-resolution network was pre-trained on the CelebA dataset and then incrementally trained on the iCartoonFace dataset, using knowledge distillation to retain performance on the CelebA test set while improving the performance on iCartoonFace test set. Our experiments demonstrate the effectiveness of knowledge distillation in incrementally adding capability to the model for cartoon face super-resolution while retaining the learned knowledge for facial hallucination tasks in GANs.

GEO-PHJan 25, 2024
When Geoscience Meets Generative AI and Large Language Models: Foundations, Trends, and Future Challenges

Abdenour Hadid, Tanujit Chakraborty, Daniel Busby

Generative Artificial Intelligence (GAI) represents an emerging field that promises the creation of synthetic data and outputs in different modalities. GAI has recently shown impressive results across a large spectrum of applications ranging from biology, medicine, education, legislation, computer science, and finance. As one strives for enhanced safety, efficiency, and sustainability, generative AI indeed emerges as a key differentiator and promises a paradigm shift in the field. This paper explores the potential applications of generative AI and large language models in geoscience. The recent developments in the field of machine learning and deep learning have enabled the generative model's utility for tackling diverse prediction problems, simulation, and multi-criteria decision-making challenges related to geoscience and Earth system dynamics. This survey discusses several GAI models that have been used in geoscience comprising generative adversarial networks (GANs), physics-informed neural networks (PINNs), and generative pre-trained transformer (GPT)-based structures. These tools have helped the geoscience community in several applications, including (but not limited to) data generation/augmentation, super-resolution, panchromatic sharpening, haze removal, restoration, and land surface changing. Some challenges still remain such as ensuring physical interpretation, nefarious use cases, and trustworthiness. Beyond that, GAI models show promises to the geoscience community, especially with the support to climate change, urban science, atmospheric science, marine science, and planetary science through their extraordinary ability to data-driven modeling and uncertainty quantification.

CVDec 12, 2023
Pain Analysis using Adaptive Hierarchical Spatiotemporal Dynamic Imaging

Issam Serraoui, Eric Granger, Abdenour Hadid et al.

Automatic pain intensity estimation plays a pivotal role in healthcare and medical fields. While many methods have been developed to gauge human pain using behavioral or physiological indicators, facial expressions have emerged as a prominent tool for this purpose. Nevertheless, the dependence on labeled data for these techniques often renders them expensive and time-consuming. To tackle this, we introduce the Adaptive Hierarchical Spatio-temporal Dynamic Image (AHDI) technique. AHDI encodes spatiotemporal changes in facial videos into a singular RGB image, permitting the application of simpler 2D deep models for video representation. Within this framework, we employ a residual network to derive generalized facial representations. These representations are optimized for two tasks: estimating pain intensity and differentiating between genuine and simulated pain expressions. For the former, a regression model is trained using the extracted representations, while for the latter, a binary classifier identifies genuine versus feigned pain displays. Testing our method on two widely-used pain datasets, we observed encouraging results for both tasks. On the UNBC database, we achieved an MSE of 0.27 outperforming the SOTA which had an MSE of 0.40. On the BioVid dataset, our model achieved an accuracy of 89.76%, which is an improvement of 5.37% over the SOTA accuracy. Most notably, for distinguishing genuine from simulated pain, our accuracy stands at 94.03%, marking a substantial improvement of 8.98%. Our methodology not only minimizes the need for extensive labeled data but also augments the precision of pain evaluations, facilitating superior pain management.

MLDec 10, 2023
Skew-Probabilistic Neural Networks for Learning from Imbalanced Data

Shraddha M. Naik, Tanujit Chakraborty, Madhurima Panja et al.

Real-world datasets often exhibit imbalanced data distribution, where certain class levels are severely underrepresented. In such cases, traditional pattern classifiers have shown a bias towards the majority class, impeding accurate predictions for the minority class. This paper introduces an imbalanced data-oriented classifier using probabilistic neural networks (PNN) with a skew-normal kernel function to address this major challenge. PNN is known for providing probabilistic outputs, enabling quantification of prediction confidence, interpretability, and the ability to handle limited data. By leveraging the skew-normal distribution, which offers increased flexibility, particularly for imbalanced and non-symmetric data, our proposed Skew-Probabilistic Neural Networks (SkewPNN) can better represent underlying class densities. Hyperparameter fine-tuning is imperative to optimize the performance of the proposed approach on imbalanced datasets. To this end, we employ a population-based heuristic algorithm, the Bat optimization algorithm, to explore the hyperparameter space effectively. We also prove the statistical consistency of the density estimates, suggesting that the true distribution will be approached smoothly as the sample size increases. Theoretical analysis of the computational complexity of the proposed SkewPNN and BA-SkewPNN is also provided. Numerical simulations have been conducted on different synthetic datasets, comparing various benchmark-imbalanced learners. Real-data analysis on several datasets shows that SkewPNN and BA-SkewPNN substantially outperform most state-of-the-art machine-learning methods for both balanced and imbalanced datasets (binary and multi-class categories) in most experimental settings.

CVJun 1, 2020
Multi-view Deep Features for Robust Facial Kinship Verification

Oualid Laiadi, Abdelmalik Ouamane, Abdelhamid Benakcha et al.

Automatic kinship verification from facial images is an emerging research topic in machine learning community. In this paper, we proposed an effective facial features extraction model based on multi-view deep features. Thus, we used four pre-trained deep learning models using eight features layers (FC6 and FC7 layers of each VGG-F, VGG-M, VGG-S and VGG-Face models) to train the proposed Multilinear Side-Information based Discriminant Analysis integrating Within Class Covariance Normalization (MSIDA+WCCN) method. Furthermore, we show that how can metric learning methods based on WCCN method integration improves the Simple Scoring Cosine similarity (SSC) method. We refer that we used the SSC method in RFIW'20 competition using the eight deep features concatenation. Thus, the integration of WCCN in the metric learning methods decreases the intra-class variations effect introduced by the deep features weights. We evaluate our proposed method on two kinship benchmarks namely KinFaceW-I and KinFaceW-II databases using four Parent-Child relations (Father-Son, Father-Daughter, Mother-Son and Mother-Daughter). Thus, the proposed MSIDA+WCCN method improves the SSC method with 12.80% and 14.65% on KinFaceW-I and KinFaceW-II databases, respectively. The results obtained are positively compared with some modern methods, including those that rely on deep learning.

CVJul 12, 2019
AVD: Adversarial Video Distillation

Mohammad Tavakolian, Mohammad Sabokrou, Abdenour Hadid

In this paper, we present a simple yet efficient approach for video representation, called Adversarial Video Distillation (AVD). The key idea is to represent videos by compressing them in the form of realistic images, which can be used in a variety of video-based scene analysis applications. Representing a video as a single image enables us to address the problem of video analysis by image analysis techniques. To this end, we exploit a 3D convolutional encoder-decoder network to encode the input video as an image by minimizing the reconstruction error. Furthermore, weak supervision by an adversarial training procedure is imposed on the output of the encoder to generate semantically realistic images. The encoder learns to extract semantically meaningful representations from a given input video by mapping the 3D input into a 2D latent representation. The obtained representation can be simply used as the input of deep models pre-trained on images for video classification. We evaluated the effectiveness of our proposed method for video-based activity recognition on three standard and challenging benchmark datasets, i.e. UCF101, HMDB51, and Kinetics. The experimental results demonstrate that AVD achieves interesting performance, outperforming the state-of-the-art methods for video classification.

CVJan 5, 2019
Forensic shoe-print identification: a brief survey

Imad Rida, Lunke Fei, Hugo Proença et al.

As an advanced research topic in forensics science, automatic shoe-print identification has been extensively studied in the last two decades, since shoe marks are the clues most frequently left in a crime scene. Hence, these impressions provide a pertinent evidence for the proper progress of investigations in order to identify the potential criminals. The main goal of this survey is to provide a cohesive overview of the research carried out in forensic shoe-print identification and its basic background. Apart defining the problem and describing the phases that typically compose the processing chain of shoe-print identification, we provide a summary/comparison of the state-of-the-art approaches, in order to guide the neophyte and help to advance the research topic. This is done through introducing simple and basic taxonomies as well as summaries of the state-of-the-art performance. Lastly, we discuss the current open problems and challenges in this research topic, point out for promising directions in this field.

CVJul 22, 2018
Deep Discriminative Model for Video Classification

Mohammad Tavakolian, Abdenour Hadid

This paper presents a new deep learning approach for video-based scene classification. We design a Heterogeneous Deep Discriminative Model (HDDM) whose parameters are initialized by performing an unsupervised pre-training in a layer-wise fashion using Gaussian Restricted Boltzmann Machines (GRBM). In order to avoid the redundancy of adjacent frames, we extract spatiotemporal variation patterns within frames and represent them sparsely using Sparse Cubic Symmetrical Pattern (SCSP). Then, a pre-initialized HDDM is separately trained using the videos of each class to learn class-specific models. According to the minimum reconstruction error from the learnt class-specific models, a weighted voting strategy is employed for the classification. The performance of the proposed method is extensively evaluated on two action recognition datasets; UCF101 and Hollywood II, and three dynamic texture and dynamic scene datasets; DynTex, YUPENN, and Maryland. The experimental results and comparisons against state-of-the-art methods demonstrate that the proposed method consistently achieves superior performance on all datasets.

CVJun 18, 2018
Deep Spatiotemporal Representation of the Face for Automatic Pain Intensity Estimation

Mohammad Tavakolian, Abdenour Hadid

Automatic pain intensity assessment has a high value in disease diagnosis applications. Inspired by the fact that many diseases and brain disorders can interrupt normal facial expression formation, we aim to develop a computational model for automatic pain intensity assessment from spontaneous and micro facial variations. For this purpose, we propose a 3D deep architecture for dynamic facial video representation. The proposed model is built by stacking several convolutional modules where each module encompasses a 3D convolution kernel with a fixed temporal depth, several parallel 3D convolutional kernels with different temporal depths, and an average pooling layer. Deploying variable temporal depths in the proposed architecture allows the model to effectively capture a wide range of spatiotemporal variations on the faces. Extensive experiments on the UNBC-McMaster Shoulder Pain Expression Archive database show that our proposed model yields in a promising performance compared to the state-of-the-art in automatic pain intensity estimation.

CVAug 14, 2017
Kinship Verification from Videos using Spatio-Temporal Texture Features and Deep Learning

Elhocine Boutellaa, Miguel Bordallo López, Samy Ait-Aoudia et al.

Automatic kinship verification using facial images is a relatively new and challenging research problem in computer vision. It consists in automatically predicting whether two persons have a biological kin relation by examining their facial attributes. While most of the existing works extract shallow handcrafted features from still face images, we approach this problem from spatio-temporal point of view and explore the use of both shallow texture features and deep features for characterizing faces. Promising results, especially those of deep features, are obtained on the benchmark UvA-NEMO Smile database. Our extensive experiments also show the superiority of using videos over still images, hence pointing out the important role of facial dynamics in kinship verification. Furthermore, the fusion of the two types of features (i.e. shallow spatio-temporal texture features and deep features) shows significant performance improvements compared to state-of-the-art methods.

CVJan 31, 2016
Unsupervised Deep Hashing for Large-scale Visual Search

Zhaoqiang Xia, Xiaoyi Feng, Jinye Peng et al.

Learning based hashing plays a pivotal role in large-scale visual search. However, most existing hashing algorithms tend to learn shallow models that do not seek representative binary codes. In this paper, we propose a novel hashing approach based on unsupervised deep learning to hierarchically transform features into hash codes. Within the heterogeneous deep hashing framework, the autoencoder layers with specific constraints are considered to model the nonlinear mapping between features and binary codes. Then, a Restricted Boltzmann Machine (RBM) layer with constraints is utilized to reduce the dimension in the hamming space. Extensive experiments on the problem of visual search demonstrate the competitiveness of our proposed approach compared to state-of-the-art.

CVJan 8, 2016
Facial age estimation using BSIF and LBP

Salah Eddine Bekhouche, Abdelkrim Ouafi, Abdelmalik Taleb-Ahmed et al.

Human face aging is irreversible process causing changes in human face characteristics such us hair whitening, muscles drop and wrinkles. Due to the importance of human face aging in biometrics systems, age estimation became an attractive area for researchers. This paper presents a novel method to estimate the age from face images, using binarized statistical image features (BSIF) and local binary patterns (LBP)histograms as features performed by support vector regression (SVR) and kernel ridge regression (KRR). We applied our method on FG-NET and PAL datasets. Our proposed method has shown superiority to that of the state-of-the-art methods when using the whole PAL database.

CVNov 19, 2015
face anti-spoofing based on color texture analysis

Zinelabidine Boulkenafet, Jukka Komulainen, Abdenour Hadid

Research on face spoofing detection has mainly been focused on analyzing the luminance of the face images, hence discarding the chrominance information which can be useful for discriminating fake faces from genuine ones. In this work, we propose a new face anti-spoofing method based on color texture analysis. We analyze the joint color-texture information from the luminance and the chrominance channels using a color local binary pattern descriptor. More specifically, the feature histograms are extracted from each image band separately. Extensive experiments on two benchmark datasets, namely CASIA face anti-spoofing and Replay-Attack databases, showed excellent results compared to the state-of-the-art. Most importantly, our inter-database evaluation depicts that the proposed approach showed very promising generalization capabilities.