Jason D. Bakos

h-index13

5papers

9citations

Novelty51%

AI Score39

Ranked #80,281 of 194,257 authors (top 41%)#294 in AR (top 46%)

5 Papers

6.3CRMay 9

Hardware-Accelerated Line-Rate Bitstream Screening for Secure FPGA Reconfiguration

Rye Stahle-Smith, Carter Antley, Jason D. Bakos et al.

As Field-Programmable Gate Arrays (FPGAs) scale in multi-tenant cloud and edge-AI environments, the configuration bitstream has become a critical, yet opaque, security boundary. Existing hardware Trojan detection methods often rely on trusted design artifacts or computationally intensive reverse-engineering, introducing prohibitive latencies in dynamic, "just-in-time" reconfiguration workflows. This paper presents BLADEI (Bitstream-Level Abnormality Detection for Embedded Inference), a bitstream-level security framework designed for deployment-time screening of FPGA configurations without requiring source code, netlists, or vendor-specific tooling. BLADEI introduces a hybrid architecture that combines multi-scale byte-sequence learning with compact statistical representations to detect anomalous configurations directly from raw bitstreams. We implement the framework on a Xilinx PYNQ-Z1 system, demonstrating an end-to-end cloud-to-edge pipeline that enforces security prior to FPGA configuration. Evaluating across 1,383 bitstreams, BLADEI achieves a macro F1-score of 0.91. However, our systems-level characterization reveals a "preprocessing wall": software-based feature extraction accounts for 92% of the total 16.4-second latency, while model inference requires only 1.4 seconds. To address this bottleneck, we propose a streaming hardware-accelerated feature extraction engine designed for the FPGA programmable logic (PL). The evaluation shows that PL-based streaming engine can reduce feature-extraction latency to the millisecond range. This work positions bitstream-level screening as a first-class primitive and demonstrates that hardware-accelerated preprocessing is the key enabler for securing next-generation reconfigurable custom computing machines at line rate.

2.3ARSep 21, 2024

FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs

Ehsan Kabir, Md. Arafat Kabir, Austin R. J. Downey et al.

Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV). Their popularity is largely attributed to the exceptional performance of their multi-head self-attention blocks when analyzing sequential data and extracting features. To date, there are limited hardware accelerators tailored for this mechanism, which is the first step before designing an accelerator for a complete model. This paper proposes \textit{FAMOUS}, a flexible hardware accelerator for dense multi-head attention (MHA) computation of TNNs on field-programmable gate arrays (FPGAs). It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency. An efficient tiling of large matrices has been employed to distribute memory and computing resources across different modules on various FPGA platforms. The design is evaluated on Xilinx Alveo U55C and U200 data center cards containing Ultrascale+ FPGAs. Experimental results are presented that show that it can attain a maximum throughput, number of parallel attention heads, embedding dimension and tile size of 328 (giga operations/second (GOPS)), 8, 768 and 64 respectively on the U55C. Furthermore, it is 3.28$\times$ and 2.6$\times$ faster than the Intel Xeon Gold 5220R CPU and NVIDIA V100 GPU respectively. It is also 1.3$\times$ faster than the fastest state-of-the-art FPGA-based accelerator.

1.2ARSep 21, 2024

ProTEA: Programmable Transformer Encoder Acceleration on FPGA

Ehsan Kabir, Jason D. Bakos, David Andrews et al.

Transformer neural networks (TNN) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV). Their widespread adoption has been primarily driven by the exceptional performance of their multi-head self-attention block used to extract key features from sequential data. The multi-head self-attention block is followed by feedforward neural networks, which play a crucial role in introducing non-linearity to assist the model in learning complex patterns. Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks. Most prior works have concentrated on sparse architectures that are not flexible for popular TNN variants. This paper introduces \textit{ProTEA}, a runtime programmable accelerator tailored for the dense computations of most of state-of-the-art transformer encoders. \textit{ProTEA} is designed to reduce latency by maximizing parallelism. We introduce an efficient tiling of large matrices that can distribute memory and computing resources across different hardware components within the FPGA. We provide run time evaluations of \textit{ProTEA} on a Xilinx Alveo U55C high-performance data center accelerator card. Experimental results demonstrate that \textit{ProTEA} can host a wide range of popular transformer networks and achieve near optimal performance with a tile size of 64 in the multi-head self-attention block and 6 in the feedforward networks block when configured with 8 parallel attention heads, 12 layers, and an embedding dimension of 768 on the U55C. Comparative results are provided showing \textit{ProTEA} is 2.5$\times$ faster than an NVIDIA Titan XP GPU. Results also show that it achieves 1.3 -- 2.8$\times$ speed up compared with current state-of-the-art custom designed FPGA accelerators.

2.3ARNov 27, 2024Code

A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs

Ehsan Kabir, Jason D. Bakos, David Andrews et al.

Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2$\times$ and 2.87$\times$ more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25$\times$ compared to some state-of-the-art FPGA-based accelerators.

1.2ETMay 28, 2025

CrossNAS: A Cross-Layer Neural Architecture Search Framework for PIM Systems

Md Hasibul Amin, Mohammadreza Mohammadi, Jason D. Bakos et al.

In this paper, we propose the CrossNAS framework, an automated approach for exploring a vast, multidimensional search space that spans various design abstraction layers-circuits, architecture, and systems-to optimize the deployment of machine learning workloads on analog processing-in-memory (PIM) systems. CrossNAS leverages the single-path one-shot weight-sharing strategy combined with the evolutionary search for the first time in the context of PIM system mapping and optimization. CrossNAS sets a new benchmark for PIM neural architecture search (NAS), outperforming previous methods in both accuracy and energy efficiency while maintaining comparable or shorter search times.