Mohammed E. Fouda

AR
h-index39
25papers
196citations
Novelty39%
AI Score51

25 Papers

LGAug 11, 2022
Mixed-Precision Neural Networks: A Survey

Mariam Rakka, Mohammed E. Fouda, Pramod Khargonekar et al.

Mixed-precision Deep Neural Networks achieve the energy efficiency and throughput needed for hardware deployment, particularly when the resources are limited, without sacrificing accuracy. However, the optimal per-layer bit precision that preserves accuracy is not easily found, especially with the abundance of models, datasets, and quantization techniques that creates an enormous search space. In order to tackle this difficulty, a body of literature has emerged recently, and several frameworks that achieved promising accuracy results have been proposed. In this paper, we start by summarizing the quantization techniques used generally in literature. Then, we present a thorough survey of the mixed-precision frameworks, categorized according to their optimization techniques such as reinforcement learning and quantization techniques like deterministic rounding. Furthermore, the advantages and shortcomings of each framework are discussed, where we present a juxtaposition. We finally give guidelines for future mixed-precision frameworks.

ARMar 4Code
Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators

Olga Krestinskaya, Mohammed E. Fouda, Ahmed Eltawil et al.

Software-hardware co-design is essential for optimizing in-memory computing (IMC) hardware accelerators for neural networks. However, most existing optimization frameworks target a single workload, leading to highly specialized hardware designs that do not generalize well across models and applications. In contrast, practical deployment scenarios require a single IMC platform that can efficiently support multiple neural network workloads. This work presents a joint hardware-workload co-optimization framework based on an optimized evolutionary algorithm for designing generalized IMC accelerator architectures. By explicitly capturing cross-workload trade-offs rather than optimizing for a single model, the proposed approach significantly reduces the performance gap between workload-specific and generalized IMC designs. The framework is evaluated on both RRAM- and SRAM-based IMC architectures, demonstrating strong robustness and adaptability across diverse design scenarios. Compared to baseline methods, the optimized designs achieve energy-delay-area product (EDAP) reductions of up to 76.2% and 95.5% when optimizing across a small set (4 workloads) and a large set (9 workloads), respectively. The source code of the framework is available at https://github.com/OlgaKrestinskaya/JointHardwareWorkloadOptimizationIMC.

ARDec 28, 2022
Thermal Heating in ReRAM Crossbar Arrays: Challenges and Solutions

Kamilya Smagulova, Mohammed E. Fouda, Ahmed Eltawil

The higher speed, scalability and parallelism offered by ReRAM crossbar arrays foster development of ReRAM-based next generation AI accelerators. At the same time, sensitivity of ReRAM to temperature variations decreases R_on/Roff ratio and negatively affects the achieved accuracy and reliability of the hardware. Various works on temperature-aware optimization and remapping in ReRAM crossbar arrays reported up to 58\% improvement in accuracy and 2.39$\times$ ReRAM lifetime enhancement. This paper classifies the challenges caused by thermal heat, starting from constraints in ReRAM cells' dimensions and characteristics to their placement in the architecture. In addition, it reviews available solutions designed to mitigate the impact of these challenges, including emerging temperature-resilient DNN training methods. Our work also provides a summary of the techniques and their advantages and limitations.

CVAug 9, 2024
A Recurrent YOLOv8-based framework for Event-Based Object Detection

Diego A. Silva, Kamilya Smagulova, Ahmed Elsheikh et al.

Object detection is crucial in various cutting-edge applications, such as autonomous vehicles and advanced robotics systems, primarily relying on data from conventional frame-based RGB sensors. However, these sensors often struggle with issues like motion blur and poor performance in challenging lighting conditions. In response to these challenges, event-based cameras have emerged as an innovative paradigm. These cameras, mimicking the human eye, demonstrate superior performance in environments with fast motion and extreme lighting conditions while consuming less power. This study introduces ReYOLOv8, an advanced object detection framework that enhances a leading frame-based detection system with spatiotemporal modeling capabilities. We implemented a low-latency, memory-efficient method for encoding event data to boost the system's performance. We also developed a novel data augmentation technique tailored to leverage the unique attributes of event data, thus improving detection accuracy. Our models outperformed all comparable approaches in the GEN1 dataset, focusing on automotive applications, achieving mean Average Precision (mAP) improvements of 5%, 2.8%, and 2.5% across nano, small, and medium scales, respectively.These enhancements were achieved while reducing the number of trainable parameters by an average of 4.43% and maintaining real-time processing speeds between 9.2ms and 15.5ms. On the PEDRo dataset, which targets robotics applications, our models showed mAP improvements ranging from 9% to 18%, with 14.5x and 3.8x smaller models and an average speed enhancement of 1.67x.

60.7ARApr 13
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

Jinane Bazzi, Mariam Rakka, Fadi Kurdahi et al.

The growing demand for deploying Small Language Models (SLMs) on edge devices, including laptops, smartphones, and embedded platforms, has exposed fundamental inefficiencies in existing accelerators. While GPUs handle prefill workloads efficiently, the autoregressive decoding phase is dominated by GEMV operations that are inherently memory-bound, resulting in poor utilization and prohibitive energy costs at the edge. In this work, we present EdgeCIM, a hardware-software co-design framework that rethinks accelerator design for end-to-end decoder-only inference. At its core is a CIM macro, implemented in 65nm, coupled with a tile-based mapping strategy that balances pipeline stages, maximizing parallelism while alleviating DRAM bandwidth bottlenecks. Our simulator enables design space exploration of SLMs up to 4B parameters, identifying Pareto-optimal configurations in terms of latency and energy. Compared to an NVIDIA Orin Nano, EdgeCIM achieves up to 7.3x higher throughput and 49.59x better energy efficiency on LLaMA3.2-1B, and delivers 9.95x higher throughput than Qualcomm SA8255P on LLaMA3.2-3B. Extensive benchmarks on TinyLLaMA-1.1B, LLaMA3.2 (1B, 3B), Phi-3.5-mini-3.8B, Qwen2.5 (0.5B, 1.5B, 3B), SmolLM2-1.7B, SmolLM3-3B, and Qwen3 (0.6B, 1.7B, 4B) reveal that our accelerator, under INT4 precision, achieves on average 336.42 tokens/s and 173.02 tokens/J. These results establish EdgeCIM as a compelling solution towards real-time, energy-efficient edge-scale SLM inference.

ARApr 12, 2022
DT2CAM: A Decision Tree to Content Addressable Memory Framework

Mariam Rakka, Mohammed E. Fouda, Rouwaida Kanj et al.

Decision trees are considered one of the most powerful tools for data classification. Accelerating the decision tree search is crucial for on-the-edge applications that have limited power and latency budget. In this paper, we propose a Content Addressable Memory (CAM) Compiler for Decision Tree (DT) inference acceleration. We propose a novel "adaptive-precision" scheme that results in a compact implementation and enables an efficient bijective mapping to Ternary Content Addressable Memories while maintaining high inference accuracies. In addition, a Resistive-CAM (ReCAM) functional synthesizer is developed for mapping the decision tree to the ReCAM and performing functional simulations for energy, latency, and accuracy evaluations. We study the decision tree accuracy under hardware non-idealities including device defects, manufacturing variability, and input encoding noise. We test our framework on various DT datasets including \textit{Give Me Some Credit}, \textit{Titanic}, and \textit{COVID-19}. Our results reveal up to {42.4\%} energy savings and up to 17.8x better energy-delay-area product compared to the state-of-art hardware accelerators, and up to 333 million decisions per sec for the pipelined implementation.

CRSep 20, 2023
AudioFool: Fast, Universal and synchronization-free Cross-Domain Attack on Speech Recognition

Mohamad Fakih, Rouwaida Kanj, Fadi Kurdahi et al.

Automatic Speech Recognition systems have been shown to be vulnerable to adversarial attacks that manipulate the command executed on the device. Recent research has focused on exploring methods to create such attacks, however, some issues relating to Over-The-Air (OTA) attacks have not been properly addressed. In our work, we examine the needed properties of robust attacks compatible with the OTA model, and we design a method of generating attacks with arbitrary such desired properties, namely the invariance to synchronization, and the robustness to filtering: this allows a Denial-of-Service (DoS) attack against ASR systems. We achieve these characteristics by constructing attacks in a modified frequency domain through an inverse Fourier transform. We evaluate our method on standard keyword classification tasks and analyze it in OTA, and we analyze the properties of the cross-domain attacks to explain the efficiency of the approach.

AISep 24, 2024
A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran et al.

Large Language Models (LLMs) have shown promise in various domains, including healthcare, with significant potential to transform mental health applications by enabling scalable and accessible solutions. This study aims to provide a comprehensive evaluation of 33 LLMs, ranging from 2 billion to 405+ billion parameters, in performing key mental health tasks using social media data across six datasets. To our knowledge, this represents the largest-scale systematic evaluation of modern LLMs for mental health applications. Models such as GPT-4, Llama 3, Claude, Gemma, Gemini, and Phi-3 were assessed for their zero-shot (ZS) and few-shot (FS) capabilities across three tasks: binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, achieving accuracies up to 85% on certain datasets, while FS learning notably enhanced disorder severity evaluations, reducing the Mean Absolute Error (MAE) by 1.3 points for the Phi-3-mini model. Recent models, such as Llama 3.1 405b, demonstrated exceptional psychiatric knowledge assessment accuracy at 91.2%, while prompt engineering played a crucial role in improving performance across tasks. However, the ethical constraints imposed by many LLM providers limit their ability to respond to sensitive queries, hampering comprehensive performance evaluations. This work highlights both the capabilities and limitations of LLMs in mental health contexts, offering valuable insights for future applications in psychiatry.

AISep 30, 2025Code
CIMNAS: A Joint Framework for Compute-In-Memory-Aware Neural Architecture Search

Olga Krestinskaya, Mohammed E. Fouda, Ahmed Eltawil et al.

To maximize hardware efficiency and performance accuracy in Compute-In-Memory (CIM)-based neural network accelerators for Artificial Intelligence (AI) applications, co-optimizing both software and hardware design parameters is essential. Manual tuning is impractical due to the vast number of parameters and their complex interdependencies. To effectively automate the design and optimization of CIM-based neural network accelerators, hardware-aware neural architecture search (HW-NAS) techniques can be applied. This work introduces CIMNAS, a joint model-quantization-hardware optimization framework for CIM architectures. CIMNAS simultaneously searches across software parameters, quantization policies, and a broad range of hardware parameters, incorporating device-, circuit-, and architecture-level co-optimizations. CIMNAS experiments were conducted over a search space of 9.9x10^85 potential parameter combinations with the MobileNet model as a baseline and RRAM-based CIM architecture. Evaluated on the ImageNet dataset, CIMNAS achieved a reduction in energy-delay-area product (EDAP) ranging from 90.1x to 104.5x, an improvement in TOPS/W between 4.68x and 4.82x, and an enhancement in TOPS/mm^2 from 11.3x to 12.78x relative to various baselines, all while maintaining an accuracy of 73.81%. The adaptability and robustness of CIMNAS are demonstrated by extending the framework to support the SRAM-based ResNet50 architecture, achieving up to an 819.5x reduction in EDAP. Unlike other state-of-the-art methods, CIMNAS achieves EDAP-focused optimization without any accuracy loss, generating diverse software-hardware parameter combinations for high-performance CIM-based neural network designs. The source code of CIMNAS is available at https://github.com/OlgaKrestinskaya/CIMNAS.

CLSep 7, 2025Code
Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry

Aya E. Fouda, Abdelrahamn A. Hassan, Radwa J. Hanafy et al.

Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.

CLDec 9, 2024
Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance

Abdelrahman A. Ali, Aya E. Fouda, Radwa J. Hanafy et al.

Mental health disorders are increasingly prevalent worldwide, creating an urgent need for innovative tools to support early diagnosis and intervention. This study explores the potential of Large Language Models (LLMs) in multimodal mental health diagnostics, specifically for detecting depression and Post Traumatic Stress Disorder through text and audio modalities. Using the E-DAIC dataset, we compare text and audio modalities to investigate whether LLMs can perform equally well or better with audio inputs. We further examine the integration of both modalities to determine if this can enhance diagnostic accuracy, which generally results in improved performance metrics. Our analysis specifically utilizes custom-formulated metrics; Modal Superiority Score and Disagreement Resolvement Score to evaluate how combined modalities influence model performance. The Gemini 1.5 Pro model achieves the highest scores in binary depression classification when using the combined modality, with an F1 score of 0.67 and a Balanced Accuracy (BA) of 77.4%, assessed across the full dataset. These results represent an increase of 3.1% over its performance with the text modality and 2.7% over the audio modality, highlighting the effectiveness of integrating modalities to enhance diagnostic accuracy. Notably, all results are obtained in zero-shot inferring, highlighting the robustness of the models without requiring task-specific fine-tuning. To explore the impact of different configurations on model performance, we conduct binary, severity, and multiclass tasks using both zero-shot and few-shot prompts, examining the effects of prompt variations on performance. The results reveal that models such as Gemini 1.5 Pro in text and audio modalities, and GPT-4o mini in the text modality, often surpass other models in balanced accuracy and F1 scores across multiple tasks.

CLJan 12, 2025
A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

Noureldin Zahran, Aya E. Fouda, Radwa J. Hanafy et al.

Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5\%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.

AROct 22, 2024
Towards Efficient IMC Accelerator Design Through Joint Hardware-Workload Co-optimization

Olga Krestinskaya, Mohammed E. Fouda, Ahmed Eltawil et al.

Designing generalized in-memory computing (IMC) hardware that efficiently supports a variety of workloads requires extensive design space exploration, which is infeasible to perform manually. Optimizing hardware individually for each workload or solely for the largest workload often fails to yield the most efficient generalized solutions. To address this, we propose a joint hardware-workload optimization framework that identifies optimised IMC chip architecture parameters, enabling more efficient, workload-flexible hardware. We show that joint optimization achieves 36%, 36%, 20%, and 69% better energy-latency-area scores for VGG16, ResNet18, AlexNet, and MobileNetV3, respectively, compared to the separate architecture parameters search optimizing for a single largest workload. Additionally, we quantify the performance trade-offs and losses of the resulting generalized IMC hardware compared to workload-specific IMC designs.

AIDec 5, 2024
Automated Multi-Label Annotation for Mental Health Illnesses Using Large Language Models

Abdelrahaman A. Hassan, Radwa J. Hanafy, Mohammed E. Fouda

The growing prevalence and complexity of mental health disorders present significant challenges for accurate diagnosis and treatment, particularly in understanding the interplay between co-occurring conditions. Mental health disorders, such as depression and Anxiety, often co-occur, yet current datasets derived from social media posts typically focus on single-disorder labels, limiting their utility in comprehensive diagnostic analyses. This paper addresses this critical gap by proposing a novel methodology for cleaning, sampling, labeling, and combining data to create versatile multi-label datasets. Our approach introduces a synthetic labeling technique to transform single-label datasets into multi-label annotations, capturing the complexity of overlapping mental health conditions. To achieve this, two single-label datasets are first merged into a foundational multi-label dataset, enabling realistic analyses of co-occurring diagnoses. We then design and evaluate various prompting strategies for large language models (LLMs), ranging from single-label predictions to unrestricted prompts capable of detecting any present disorders. After rigorously assessing multiple LLMs and prompt configurations, the optimal combinations are identified and applied to label six additional single-disorder datasets from RMHD. The result is SPAADE-DR, a robust, multi-label dataset encompassing diverse mental health conditions. This research demonstrates the transformative potential of LLM-driven synthetic labeling in advancing mental health diagnostics from social media data, paving the way for more nuanced, data-driven insights into mental health care.

ARNov 26, 2024
SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors

Mariam Rakka, Jinhao Li, Guohao Dai et al.

Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.

ARNov 3, 2024
BF-IMNA: A Bit Fluid In-Memory Neural Architecture for Neural Network Acceleration

Mariam Rakka, Rachid Karami, Ahmed M. Eltawil et al.

Mixed-precision quantization works Neural Networks (NNs) are gaining traction for their efficient realization on the hardware leading to higher throughput and lower energy. In-Memory Computing (IMC) accelerator architectures are offered as alternatives to traditional architectures relying on a data-centric computational paradigm, diminishing the memory wall problem, and scoring high throughput and energy efficiency. These accelerators can support static fixed-precision but are not flexible to support mixed-precision NNs. In this paper, we present BF-IMNA, a bit fluid IMC accelerator for end-to-end Convolutional NN (CNN) inference that is capable of static and dynamic mixed-precision without any hardware reconfiguration overhead at run-time. At the heart of BF-IMNA are Associative Processors (APs), which are bit-serial word-parallel Single Instruction, Multiple Data (SIMD)-like engines. We report the performance of end-to-end inference of ImageNet on AlexNet, VGG16, and ResNet50 on BF-IMNA for different technologies (eNVM and NVM), mixed-precision configurations, and supply voltages. To demonstrate bit fluidity, we implement HAWQ-V3's per-layer mixed-precision configurations for ResNet18 on BF-IMNA using different latency budgets, and results reveal a trade-off between accuracy and Energy-Delay Product (EDP): On one hand, mixed-precision with a high latency constraint achieves the closest accuracy to fixed-precision INT8 and reports a high (worse) EDP compared to fixed-precision INT4. On the other hand, with a low latency constraint, BF-IMNA reports the closest EDP to fixed-precision INT4, with a higher degradation in accuracy compared to fixed-precision INT8. We also show that BF-IMNA with fixed-precision configuration still delivers performance that is comparable to current state-of-the-art accelerators: BF-IMNA achieves $20\%$ higher energy efficiency and $2\%$ higher throughput.

CRJul 4, 2025
On Jailbreaking Quantized Language Models Through Fault Injection Attacks

Noureldin Zahran, Ahmad Tahmasivand, Ihsen Alouani et al.

The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (>80% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25 perturbations, FP8 and INT8 models exhibit ASRs below 20% and 50%, respectively. Increasing the perturbation budget up to 150 bit-flips, FP8 models maintained ASR below 65%, demonstrating some resilience compared to INT8 and INT4 models that have high ASR. In addition, analysis of perturbation locations revealed differing architectural targets across quantization schemes, with (FP16, INT4) and (INT8, FP8) showing similar characteristics. Besides, jailbreaks induced in FP16 models were highly transferable to subsequent FP8/INT8 quantization (<5% ASR difference), though INT4 significantly reduced transferred ASR (avg. 35% drop). These findings highlight that while common quantization schemes, particularly FP8, increase the difficulty of direct parameter manipulation jailbreaks, vulnerabilities can still persist, especially through post-attack quantization.

ASApr 2, 2025
Leveraging Embedding Techniques in Multimodal Machine Learning for Mental Illness Assessment

Abdelrahaman A. Hassan, Abdelrahman A. Ali, Aya E. Fouda et al.

The increasing global prevalence of mental disorders, such as depression and PTSD, requires objective and scalable diagnostic tools. Traditional clinical assessments often face limitations in accessibility, objectivity, and consistency. This paper investigates the potential of multimodal machine learning to address these challenges, leveraging the complementary information available in text, audio, and video data. Our approach involves a comprehensive analysis of various data preprocessing techniques, including novel chunking and utterance-based formatting strategies. We systematically evaluate a range of state-of-the-art embedding models for each modality and employ Convolutional Neural Networks (CNNs) and Bidirectional LSTM Networks (BiLSTMs) for feature extraction. We explore data-level, feature-level, and decision-level fusion techniques, including a novel integration of Large Language Model (LLM) predictions. We also investigate the impact of replacing Multilayer Perceptron classifiers with Support Vector Machines. We extend our analysis to severity prediction using PHQ-8 and PCL-C scores and multi-class classification (considering co-occurring conditions). Our results demonstrate that utterance-based chunking significantly improves performance, particularly for text and audio modalities. Decision-level fusion, incorporating LLM predictions, achieves the highest accuracy, with a balanced accuracy of 94.8% for depression and 96.2% for PTSD detection. The combination of CNN-BiLSTM architectures with utterance-level chunking, coupled with the integration of external LLM, provides a powerful and nuanced approach to the detection and assessment of mental health conditions. Our findings highlight the potential of MMML for developing more accurate, accessible, and personalized mental healthcare tools.

CVDec 27, 2024
Chimera: A Block-Based Neural Architecture Search Framework for Event-Based Object Detection

Diego A. Silva, Ahmed Elsheikh, Kamilya Smagulova et al.

Event-based cameras are sensors that simulate the human eye, offering advantages such as high-speed robustness and low power consumption. Established Deep Learning techniques have shown effectiveness in processing event data. Chimera is a Block-Based Neural Architecture Search (NAS) framework specifically designed for Event-Based Object Detection, aiming to create a systematic approach for adapting RGB-domain processing methods to the event domain. The Chimera design space is constructed from various macroblocks, including Attention blocks, Convolutions, State Space Models, and MLP-mixer-based architectures, which provide a valuable trade-off between local and global processing capabilities, as well as varying levels of complexity. The results on the PErson Detection in Robotics (PEDRo) dataset demonstrated performance levels comparable to leading state-of-the-art models, alongside an average parameter reduction of 1.6 times.

LGOct 19, 2025
Mixed-Precision Quantization for Language Models: Techniques and Prospects

Mariam Rakka, Marios Fournarakis, Olga Krestinskaya et al.

The rapid scaling of language models (LMs) has resulted in unprecedented computational, memory, and energy requirements, making their training and deployment increasingly unsustainable. Quantization has emerged as an essential compression technique to reduce model size, alleviate memory bottlenecks, and accelerate inference. However, while uniform low-bit quantization (e.g., INT8, INT4) provides significant efficiency gains, it can degrade accuracy in sensitive components of transformer-based LMs. Mixed-precision quantization offers a promising alternative by selectively allocating precision across layers or within tensors to balance efficiency and accuracy. This survey provides a comprehensive overview of Mixed-Precision quantization frameworks for LMs (MXPLMs). We first review quantization fundamentals, including uniform and non-uniform quantizers, quantization granularity, and methods widely used in post-training quantization. We then categorize and compare recent MXPLM frameworks according to their bit allocation strategies and precision configurations across weights, activations, and key-value caches. A comparative analysis highlights differences in perplexity, zero-shot task performance, and deployment trade-offs. Furthermore, we contrast MXPLMs with earlier mixed-precision quantization methods for deep neural networks, identifying strategies that transfer and those that face challenges in the LM setting. Finally, we summarize open issues and future directions, including hardware-aware design, activation quantization, and scalable optimization methods for billion-parameter models. By consolidating recent advances, this work serves as a reference for understanding the current landscape and research prospects of mixed-precision quantization for large-scale language models.

LGApr 16, 2024
Network architecture search of X-ray based scientific applications

Adarsha Balaji, Ramyad Hadidi, Gregory Kollmer et al.

X-ray and electron diffraction-based microscopy use bragg peak detection and ptychography to perform 3-D imaging at an atomic resolution. Typically, these techniques are implemented using computationally complex tasks such as a Psuedo-Voigt function or solving a complex inverse problem. Recently, the use of deep neural networks has improved the existing state-of-the-art approaches. However, the design and development of the neural network models depends on time and labor intensive tuning of the model by application experts. To that end, we propose a hyperparameter (HPS) and neural architecture search (NAS) approach to automate the design and optimization of the neural network models for model size, energy consumption and throughput. We demonstrate the improved performance of the auto-tuned models when compared to the manually tuned BraggNN and PtychoNN benchmark. We study and demonstrate the importance of the exploring the search space of tunable hyperparameters in enhancing the performance of bragg peak detection and ptychographic reconstruction. Our NAS and HPS of (1) BraggNN achieves a 31.03\% improvement in bragg peak detection accuracy with a 87.57\% reduction in model size, and (2) PtychoNN achieves a 16.77\% improvement in model accuracy and a 12.82\% reduction in model size when compared to the baseline PtychoNN model. When inferred on the Orin-AGX platform, the optimized Braggnn and Ptychonn models demonstrate a 10.51\% and 9.47\% reduction in inference latency and a 44.18\% and 15.34\% reduction in energy consumption when compared to their respective baselines, when inferred in the Orin-AGX edge platform.

NEDec 13, 2021
Efficient Training of Spiking Neural Networks with Temporally-Truncated Local Backpropagation through Time

Wenzhe Guo, Mohammed E. Fouda, Ahmed M. Eltawil et al.

Directly training spiking neural networks (SNNs) has remained challenging due to complex neural dynamics and intrinsic non-differentiability in firing functions. The well-known backpropagation through time (BPTT) algorithm proposed to train SNNs suffers from large memory footprint and prohibits backward and update unlocking, making it impossible to exploit the potential of locally-supervised training methods. This work proposes an efficient and direct training algorithm for SNNs that integrates a locally-supervised training method with a temporally-truncated BPTT algorithm. The proposed algorithm explores both temporal and spatial locality in BPTT and contributes to significant reduction in computational cost including GPU memory utilization, main memory access and arithmetic operations. We thoroughly explore the design space concerning temporal truncation length and local training block size and benchmark their impact on classification accuracy of different networks running different types of tasks. The results reveal that temporal truncation has a negative effect on the accuracy of classifying frame-based datasets, but leads to improvement in accuracy on dynamic-vision-sensor (DVS) recorded datasets. In spite of resulting information loss, local training is capable of alleviating overfitting. The combined effect of temporal truncation and local training can lead to the slowdown of accuracy drop and even improvement in accuracy. In addition, training deep SNNs models such as AlexNet classifying CIFAR10-DVS dataset leads to 7.26% increase in accuracy, 89.94% reduction in GPU memory, 10.79% reduction in memory access, and 99.64% reduction in MAC operations compared to the standard end-to-end BPTT.

CROct 21, 2021
CAPTIVE: Constrained Adversarial Perturbations to Thwart IC Reverse Engineering

Amir Hosein Afandizadeh Zargari, Marzieh AshrafiAmiri, Minjun Seo et al.

Reverse engineering (RE) in Integrated Circuits (IC) is a process in which one will attempt to extract the internals of an IC, extract the circuit structure, and determine the gate-level information of an IC. In general, RE process can be done for validation as well as intellectual property (IP) stealing intentions. In addition, RE also facilitates different illicit activities such as insertion of hardware Trojan, pirate, or counterfeit a design, or develop an attack. In this work, we propose an approach to introduce cognitive perturbations, with the aid of adversarial machine learning, to the IC layout that could prevent the RE process from succeeding. We first construct a layer-by-layer image dataset of 45nm predictive technology. With this dataset, we propose a conventional neural network model called RecoG-Net to recognize the logic gates, which is the first step in RE. RecoG-Net is successfully to recognize the gates with more than 99.7% accuracy. Our thwarting approach utilizes the concept of the adversarial attack generation algorithms to generate perturbation. Unlike traditional adversarial attacks in machine learning, the perturbation generation needs to be highly constrained to meet the fab rules such as Design Rule Checking (DRC) Layout vs. Schematic (LVS) checks. Hence, we propose CAPTIVE as an constrained perturbation generation satisfying the DRC. The experiments shows that the accuracy of reverse engineering using machine learning techniques can decrease from 100% to approximately 30% based on the adversary generator.

ARSep 8, 2021
Resistive Neural Hardware Accelerators

Kamilya Smagulova, Mohammed E. Fouda, Fadi Kurdahi et al.

Deep Neural Networks (DNNs), as a subset of Machine Learning (ML) techniques, entail that real-world data can be learned and that decisions can be made in real-time. However, their wide adoption is hindered by a number of software and hardware limitations. The existing general-purpose hardware platforms used to accelerate DNNs are facing new challenges associated with the growing amount of data and are exponentially increasing the complexity of computations. An emerging non-volatile memory (NVM) devices and processing-in-memory (PIM) paradigm is creating a new hardware architecture generation with increased computing and storage capabilities. In particular, the shift towards ReRAM-based in-memory computing has great potential in the implementation of area and power efficient inference and in training large-scale neural network architectures. These can accelerate the process of the IoT-enabled AI technologies entering our daily life. In this survey, we review the state-of-the-art ReRAM-based DNN many-core accelerators, and their superiority compared to CMOS counterparts was shown. The review covers different aspects of hardware and software realization of DNN accelerators, their present limitations, and future prospectives. In particular, comparison of the accelerators shows the need for the introduction of new performance metrics and benchmarking standards. In addition, the major concerns regarding the efficient design of accelerators include a lack of accuracy in simulation tools for software and hardware co-design.

NENov 21, 2020
On-Chip Error-triggered Learning of Multi-layer Memristive Spiking Neural Networks

Melika Payvand, Mohammed E. Fouda, Fadi Kurdahi et al.

Recent breakthroughs in neuromorphic computing show that local forms of gradient descent learning are compatible with Spiking Neural Networks (SNNs) and synaptic plasticity. Although SNNs can be scalably implemented using neuromorphic VLSI, an architecture that can learn using gradient-descent in situ is still missing. In this paper, we propose a local, gradient-based, error-triggered learning algorithm with online ternary weight updates. The proposed algorithm enables online training of multi-layer SNNs with memristive neuromorphic hardware showing a small loss in the performance compared with the state of the art. We also propose a hardware architecture based on memristive crossbar arrays to perform the required vector-matrix multiplications. The necessary peripheral circuitry including pre-synaptic, post-synaptic and write circuits required for online training, have been designed in the sub-threshold regime for power saving with a standard 180 nm CMOS process.