X. Sharon Hu

CV
h-index40
17papers
234citations
Novelty52%
AI Score47

17 Papers

CVSep 4, 2022
Data-Driven Deep Supervision for Skin Lesion Classification

Suraj Mishra, Yizhe Zhang, Li Zhang et al.

Automatic classification of pigmented, non-pigmented, and depigmented non-melanocytic skin lesions have garnered lots of attention in recent years. However, imaging variations in skin texture, lesion shape, depigmentation contrast, lighting condition, etc. hinder robust feature extraction, affecting classification accuracy. In this paper, we propose a new deep neural network that exploits input data for robust feature extraction. Specifically, we analyze the convolutional network's behavior (field-of-view) to find the location of deep supervision for improved feature extraction. To achieve this, first, we perform activation mapping to generate an object mask, highlighting the input regions most critical for classification output generation. Then the network layer whose layer-wise effective receptive field matches the approximated object shape in the object mask is selected as our focus for deep supervision. Utilizing different types of convolutional feature extractors and classifiers on three melanoma detection datasets and two vitiligo detection datasets, we verify the effectiveness of our new method.

ETApr 15, 2022
Experimentally realized memristive memory augmented neural network

Ruibin Mao, Bo Wen, Yahui Zhao et al.

Lifelong on-device learning is a key challenge for machine intelligence, and this requires learning from few, often single, samples. Memory augmented neural network has been proposed to achieve the goal, but the memory module has to be stored in an off-chip memory due to its size. Therefore the practical use has been heavily limited. Previous works on emerging memory-based implementation have difficulties in scaling up because different modules with various structures are difficult to integrate on the same chip and the small sense margin of the content addressable memory for the memory module heavily limited the degree of mismatch calculation. In this work, we implement the entire memory augmented neural network architecture in a fully integrated memristive crossbar platform and achieve an accuracy that closely matches standard software on digital hardware for the Omniglot dataset. The successful demonstration is supported by implementing new functions in crossbars in addition to widely reported matrix multiplications. For example, the locality-sensitive hashing operation is implemented in crossbar arrays by exploiting the intrinsic stochasticity of memristor devices. Besides, the content-addressable memory module is realized in crossbars, which also supports the degree of mismatches. Simulations based on experimentally validated models show such an implementation can be efficiently scaled up for one-shot learning on the Mini-ImageNet dataset. The successful demonstration paves the way for practical on-device lifelong learning and opens possibilities for novel attention-based algorithms not possible in conventional hardware.

32.3ETApr 6
Probabilistic Tree Inference Enabled by FDSOI Ferroelectric FETs

Pengyu Ren, Xingtian Wang, Boyang Cheng et al.

Artificial intelligence applications in autonomous driving, medical diagnostics, and financial systems increasingly demand machine learning models that can provide robust uncertainty quantification, interpretability, and noise resilience. Bayesian decision trees (BDTs) are attractive for these tasks because they combine probabilistic reasoning, interpretable decision-making, and robustness to noise. However, existing hardware implementations of BDTs based on CPUs and GPUs are limited by memory bottlenecks and irregular processing patterns, while multi-platform solutions exploiting analog content-addressable memory (ACAM) and Gaussian random number generators (GRNGs) introduce integration complexity and energy overheads. Here we report a monolithic FDSOI-FeFET hardware platform that natively supports both ACAM and GRNG functionalities. The ferroelectric polarization of FeFETs enables compact, energy-efficient multi-bit storage for ACAM, and band-to-band tunneling in the gate-to-drain overlap region and subsequent hole storage in the floating body provides a high-quality entropy source for GRNG. System-level evaluations demonstrate that the proposed architecture provides robust uncertainty estimation, interpretability, and noise tolerance with high energy efficiency. Under both dataset noise and device variations, it achieves over 40% higher classification accuracy on MNIST compared to conventional decision trees. Moreover, it delivers more than two orders of magnitude speedup over CPU and GPU baselines and over four orders of magnitude improvement in energy efficiency, making it a scalable solution for deploying BDTs in resource-constrained and safety-critical environments.

ARJul 16, 2022
Associative Memory Based Experience Replay for Deep Reinforcement Learning

Mengyuan Li, Arman Kazemi, Ann Franchesca Laguna et al.

Experience replay is an essential component in deep reinforcement learning (DRL), which stores the experiences and generates experiences for the agent to learn in real time. Recently, prioritized experience replay (PER) has been proven to be powerful and widely deployed in DRL agents. However, implementing PER on traditional CPU or GPU architectures incurs significant latency overhead due to its frequent and irregular memory accesses. This paper proposes a hardware-software co-design approach to design an associative memory (AM) based PER, AMPER, with an AM-friendly priority sampling operation. AMPER replaces the widely-used time-costly tree-traversal-based priority sampling in PER while preserving the learning performance. Further, we design an in-memory computing hardware architecture based on AM to support AMPER by leveraging parallel in-memory search operations. AMPER shows comparable learning performance while achieving 55x to 270x latency improvement when running on the proposed hardware compared to the state-of-the-art PER running on GPU.

SDNov 21, 2024
Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Ruiyang Qin, Dancheng Liu, Gelei Xu et al.

The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.

CVOct 10, 2025
Cell Instance Segmentation: The Devil Is in the Boundaries

Peixian Liang, Yifan Ding, Yizhe Zhang et al.

State-of-the-art (SOTA) methods for cell instance segmentation are based on deep learning (DL) semantic segmentation approaches, focusing on distinguishing foreground pixels from background pixels. In order to identify cell instances from foreground pixels (e.g., pixel clustering), most methods decompose instance information into pixel-wise objectives, such as distances to foreground-background boundaries (distance maps), heat gradients with the center point as heat source (heat diffusion maps), and distances from the center point to foreground-background boundaries with fixed angles (star-shaped polygons). However, pixel-wise objectives may lose significant geometric properties of the cell instances, such as shape, curvature, and convexity, which require a collection of pixels to represent. To address this challenge, we present a novel pixel clustering method, called Ceb (for Cell boundaries), to leverage cell boundary features and labels to divide foreground pixels into cell instances. Starting with probability maps generated from semantic segmentation, Ceb first extracts potential foreground-foreground boundaries with a revised Watershed algorithm. For each boundary candidate, a boundary feature representation (called boundary signature) is constructed by sampling pixels from the current foreground-foreground boundary as well as the neighboring background-foreground boundaries. Next, a boundary classifier is used to predict its binary boundary label based on the corresponding boundary signature. Finally, cell instances are obtained by dividing or merging neighboring regions based on the predicted boundary labels. Extensive experiments on six datasets demonstrate that Ceb outperforms existing pixel clustering methods on semantic segmentation probability maps. Moreover, Ceb achieves highly competitive performance compared to SOTA cell instance segmentation methods.

LGJul 16, 2025
Trustworthy Tree-based Machine Learning by $MoS_2$ Flash-based Analog CAM with Inherent Soft Boundaries

Bo Wen, Guoyun Gao, Zhicheng Xu et al.

The rapid advancement of artificial intelligence has raised concerns regarding its trustworthiness, especially in terms of interpretability and robustness. Tree-based models like Random Forest and XGBoost excel in interpretability and accuracy for tabular data, but scaling them remains computationally expensive due to poor data locality and high data dependence. Previous efforts to accelerate these models with analog content addressable memory (CAM) have struggled, due to the fact that the difficult-to-implement sharp decision boundaries are highly susceptible to device variations, which leads to poor hardware performance and vulnerability to adversarial attacks. This work presents a novel hardware-software co-design approach using $MoS_2$ Flash-based analog CAM with inherent soft boundaries, enabling efficient inference with soft tree-based models. Our soft tree model inference experiments on $MoS_2$ analog CAM arrays show this method achieves exceptional robustness against device variation and adversarial attacks while achieving state-of-the-art accuracy. Specifically, our fabricated analog CAM arrays achieve $96\%$ accuracy on Wisconsin Diagnostic Breast Cancer (WDBC) database, while maintaining decision explainability. Our experimentally calibrated model validated only a $0.6\%$ accuracy drop on the MNIST dataset under $10\%$ device threshold variation, compared to a $45.3\%$ drop for traditional decision trees. This work paves the way for specialized hardware that enhances AI's trustworthiness and efficiency.

AROct 22, 2024
A 10.60 $μ$W 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection

Yifan Qin, Zhenge Jia, Zheyu Yan et al.

This paper proposes an ultra-low power, mixed-bit-width sparse convolutional neural network (CNN) accelerator to accelerate ventricular arrhythmia (VA) detection. The chip achieves 50% sparsity in a quantized 1D CNN using a sparse processing element (SPE) architecture. Measurement on the prototype chip TSMC 40nm CMOS low-power (LP) process for the VA classification task demonstrates that it consumes 10.60 $μ$W of power while achieving a performance of 150 GOPS and a diagnostic accuracy of 99.95%. The computation power density is only 0.57 $μ$W/mm$^2$, which is 14.23X smaller than state-of-the-art works, making it highly suitable for implantable and wearable medical devices.

CVJan 14, 2024
Efficient approximation of Earth Mover's Distance Based on Nearest Neighbor Search

Guangyu Meng, Ruyu Zhou, Liu Liu et al.

Earth Mover's Distance (EMD) is an important similarity measure between two distributions, used in computer vision and many other application domains. However, its exact calculation is computationally and memory intensive, which hinders its scalability and applicability for large-scale problems. Various approximate EMD algorithms have been proposed to reduce computational costs, but they suffer lower accuracy and may require additional memory usage or manual parameter tuning. In this paper, we present a novel approach, NNS-EMD, to approximate EMD using Nearest Neighbor Search (NNS), in order to achieve high accuracy, low time complexity, and high memory efficiency. The NNS operation reduces the number of data points compared in each NNS iteration and offers opportunities for parallel processing. We further accelerate NNS-EMD via vectorization on GPU, which is especially beneficial for large datasets. We compare NNS-EMD with both the exact EMD and state-of-the-art approximate EMD algorithms on image classification and retrieval tasks. We also apply NNS-EMD to calculate transport mapping and realize color transfer between images. NNS-EMD can be 44x to 135x faster than the exact EMD implementation, and achieves superior accuracy, speedup, and memory efficiency over existing approximate EMD methods.

IVJul 6, 2021
Image Complexity Guided Network Compression for Biomedical Image Segmentation

Suraj Mishra, Danny Z. Chen, X. Sharon Hu

Compression is a standard procedure for making convolutional neural networks (CNNs) adhere to some specific computing resource constraints. However, searching for a compressed architecture typically involves a series of time-consuming training/validation experiments to determine a good compromise between network size and performance accuracy. To address this, we propose an image complexity-guided network compression technique for biomedical image segmentation. Given any resource constraints, our framework utilizes data complexity and network architecture to quickly estimate a compressed model which does not require network training. Specifically, we map the dataset complexity to the target network accuracy degradation caused by compression. Such mapping enables us to predict the final accuracy for different network sizes, based on the computed dataset complexity. Thus, one may choose a solution that meets both the network size and segmentation accuracy requirements. Finally, the mapping is used to determine the convolutional layer-wise multiplicative factor for generating a compressed network. We conduct experiments using 5 datasets, employing 3 commonly-used CNN architectures for biomedical image segmentation as representative networks. Our proposed framework is shown to be effective for generating compressed segmentation networks, retaining up to $\approx 95\%$ of the full-sized network segmentation accuracy, and at the same time, utilizing $\approx 32x$ fewer network trainable weights (average reduction) of the full-sized networks.

CVApr 17, 2021
Objective-Dependent Uncertainty Driven Retinal Vessel Segmentation

Suraj Mishra, Danny Z. Chen, X. Sharon Hu

From diagnosing neovascular diseases to detecting white matter lesions, accurate tiny vessel segmentation in fundus images is critical. Promising results for accurate vessel segmentation have been known. However, their effectiveness in segmenting tiny vessels is still limited. In this paper, we study retinal vessel segmentation by incorporating tiny vessel segmentation into our framework for the overall accurate vessel segmentation. To achieve this, we propose a new deep convolutional neural network (CNN) which divides vessel segmentation into two separate objectives. Specifically, we consider the overall accurate vessel segmentation and tiny vessel segmentation as two individual objectives. Then, by exploiting the objective-dependent (homoscedastic) uncertainty, we enable the network to learn both objectives simultaneously. Further, to improve the individual objectives, we propose: (a) a vessel weight map based auxiliary loss for enhancing tiny vessel connectivity (i.e., improving tiny vessel segmentation), and (b) an enhanced encoder-decoder architecture for improved localization (i.e., for accurate vessel segmentation). Using 3 public retinal vessel segmentation datasets (CHASE_DB1, DRIVE, and STARE), we verify the superiority of our proposed framework in segmenting tiny vessels (8.3% average improvement in sensitivity) while achieving better area under the receiver operating characteristic curve (AUC) compared to state-of-the-art methods.

ETNov 13, 2020
In-Memory Nearest Neighbor Search with FeFET Multi-Bit Content-Addressable Memories

Arman Kazemi, Mohammad Mehdi Sharifi, Ann Franchesca Laguna et al.

Nearest neighbor (NN) search is an essential operation in many applications, such as one/few-shot learning and image classification. As such, fast and low-energy hardware support for accurate NN search is highly desirable. Ternary content-addressable memories (TCAMs) have been proposed to accelerate NN search for few-shot learning tasks by implementing $L_\infty$ and Hamming distance metrics, but they cannot achieve software-comparable accuracies. This paper proposes a novel distance function that can be natively evaluated with multi-bit content-addressable memories (MCAMs) based on ferroelectric FETs (FeFETs) to perform a single-step, in-memory NN search. Moreover, this approach achieves accuracies comparable to floating-point precision implementations in software for NN classification and one/few-shot learning tasks. As an example, the proposed method achieves a 98.34% accuracy for a 5-way, 5-shot classification task for the Omniglot dataset (only 0.8% lower than software-based implementations) with a 3-bit MCAM. This represents a 13% accuracy improvement over state-of-the-art TCAM-based implementations at iso-energy and iso-delay. The presented distance function is resilient to the effects of FeFET device-to-device variations. Furthermore, this work experimentally demonstrates a 2-bit implementation of FeFET MCAM using AND arrays from GLOBALFOUNDRIES to further validate proof of concept.

ETMay 29, 2019
Nonvolatile Spintronic Memory Cells for Neural Networks

Andrew W. Stephan, Qiuwen Lou, Michael Niemier et al.

A new spintronic nonvolatile memory cell analogous to 1T DRAM with non-destructive read is proposed. The cells can be used as neural computing units. A dual-circuit neural network architecture is proposed to leverage these devices against the complex operations involved in convolutional networks. Simulations based on HSPICE and Matlab were performed to study the performance of this architecture when classifying images as well as the effect of varying the size and stability of the nanomagnets. The spintronic cells outperform a purely charge-based implementation of the same network, consuming about 100 pJ total per image processed.

ETFeb 28, 2019
Application-level Studies of Cellular Neural Network-based Hardware Accelerators

Qiuwen Lou, Indranil Palit, Tang Li et al.

As cost and performance benefits associated with Moore's Law scaling slow, researchers are studying alternative architectures (e.g., based on analog and/or spiking circuits) and/or computational models (e.g., convolutional and recurrent neural networks) to perform application-level tasks faster, more energy efficiently, and/or more accurately. We investigate cellular neural network (CeNN)-based co-processors at the application-level for these metrics. While it is well-known that CeNNs can be well-suited for spatio-temporal information processing, few (if any) studies have quantified the energy/delay/accuracy of a CeNN-friendly algorithm and compared the CeNN-based approach to the best von Neumann algorithm at the application level. We present an evaluation framework for such studies. As a case study, a CeNN-friendly target-tracking algorithm was developed and mapped to an array architecture developed in conjunction with the algorithm. We compare the energy, delay, and accuracy of our architecture/algorithm (assuming all overheads) to the most accurate von Neumann algorithm (Struck). Von Neumann CPU data is measured on an Intel i5 chip. The CeNN approach is capable of matching the accuracy of Struck, and can offer approximately 1000x improvements in energy-delay product.

CVJan 6, 2019
CC-Net: Image Complexity Guided Network Compression for Biomedical Image Segmentation

Suraj Mishra, Peixian Liang, Adam Czajka et al.

Convolutional neural networks (CNNs) for biomedical image analysis are often of very large size, resulting in high memory requirement and high latency of operations. Searching for an acceptable compressed representation of the base CNN for a specific imaging application typically involves a series of time-consuming training/validation experiments to achieve a good compromise between network size and accuracy. To address this challenge, we propose CC-Net, a new image complexity-guided CNN compression scheme for biomedical image segmentation. Given a CNN model, CC-Net predicts the final accuracy of networks of different sizes based on the average image complexity computed from the training data. It then selects a multiplicative factor for producing a desired network with acceptable network accuracy and size. Experiments show that CC-Net is effective for generating compressed segmentation networks, retaining up to 95% of the base network segmentation accuracy and utilizing only 0.1% of trainable parameters of the full-sized networks in the best case.

CVOct 30, 2018
A mixed signal architecture for convolutional neural networks

Qiuwen Lou, Chenyun Pan, John McGuiness et al.

Deep neural network (DNN) accelerators with improved energy and delay are desirable for meeting the requirements of hardware targeted for IoT and edge computing systems. Convolutional neural networks (CoNNs) belong to one of the most popular types of DNN architectures. This paper presents the design and evaluation of an accelerator for CoNNs. The system-level architecture is based on mixed-signal, cellular neural networks (CeNNs). Specifically, we present (i) the implementation of different layers, including convolution, ReLU, and pooling, in a CoNN using CeNN, (ii) modified CoNN structures with CeNN-friendly layers to reduce computational overheads typically associated with a CoNN, (iii) a mixed-signal CeNN architecture that performs CoNN computations in the analog and mixed signal domain, and (iv) design space exploration that identifies what CeNN-based algorithm and architectural features fare best compared to existing algorithms and architectures when evaluated over common datasets -- MNIST and CIFAR-10. Notably, the proposed approach can lead to 8.7$\times$ improvements in energy-delay product (EDP) per digit classification for the MNIST dataset at iso-accuracy when compared with the state-of-the-art DNN engine, while our approach could offer 4.3$\times$ improvements in EDP when compared to other network implementations for the CIFAR-10 dataset.

CVSep 1, 2018
DAC-SDC Low Power Object Detection Challenge for UAV Applications

Xiaowei Xu, Xinyi Zhang, Bei Yu et al.

The 55th Design Automation Conference (DAC) held its first System Design Contest (SDC) in 2018. SDC'18 features a lower power object detection challenge (LPODC) on designing and implementing novel algorithms based object detection in images taken from unmanned aerial vehicles (UAV). The dataset includes 95 categories and 150k images, and the hardware platforms include Nvidia's TX2 and Xilinx's PYNQ Z1. DAC-SDC'18 attracted more than 110 entries from 12 countries. This paper presents in detail the dataset and evaluation procedure. It further discusses the methods developed by some of the entries as well as representative results. The paper concludes with directions for future improvements.