Wendong Mao

CV
h-index36
11papers
44citations
Novelty50%
AI Score42

11 Papers

LGNov 4, 2022
An Efficient FPGA-based Accelerator for Deep Forest

Mingyu Zhu, Jiapeng Luo, Wendong Mao et al.

Deep Forest is a prominent machine learning algorithm known for its high accuracy in forecasting. Compared with deep neural networks, Deep Forest has almost no multiplication operations and has better performance on small datasets. However, due to the deep structure and large forest quantity, it suffers from large amounts of calculation and memory consumption. In this paper, an efficient hardware accelerator is proposed for deep forest models, which is also the first work to implement Deep Forest on FPGA. Firstly, a delicate node computing unit (NCU) is designed to improve inference speed. Secondly, based on NCU, an efficient architecture and an adaptive dataflow are proposed, in order to alleviate the problem of node computing imbalance in the classification process. Moreover, an optimized storage scheme in this design also improves hardware utilization and power efficiency. The proposed design is implemented on an FPGA board, Intel Stratix V, and it is evaluated by two typical datasets, ADULT and Face Mask Detection. The experimental results show that the proposed design can achieve around 40x speedup compared to that on a 40 cores high performance x86 CPU.

CVAug 16, 2023
S2R: Exploring a Double-Win Transformer-Based Framework for Ideal and Blind Super-Resolution

Minghao She, Wendong Mao, Huihong Shi et al.

Nowadays, deep learning based methods have demonstrated impressive performance on ideal super-resolution (SR) datasets, but most of these methods incur dramatically performance drops when directly applied in real-world SR reconstruction tasks with unpredictable blur kernels. To tackle this issue, blind SR methods are proposed to improve the visual results on random blur kernels, which causes unsatisfactory reconstruction effects on ideal low-resolution images similarly. In this paper, we propose a double-win framework for ideal and blind SR task, named S2R, including a light-weight transformer-based SR model (S2R transformer) and a novel coarse-to-fine training strategy, which can achieve excellent visual results on both ideal and random fuzzy conditions. On algorithm level, S2R transformer smartly combines some efficient and light-weight blocks to enhance the representation ability of extracted features with relatively low number of parameters. For training strategy, a coarse-level learning process is firstly performed to improve the generalization of the network with the help of a large-scale external dataset, and then, a fast fine-tune process is developed to transfer the pre-trained model to real-world SR tasks by mining the internal features of the image. Experimental results show that the proposed S2R outperforms other single-image SR models in ideal SR condition with only 578K parameters. Meanwhile, it can achieve better visual results than regular blind SR models in blind fuzzy conditions with only 10 gradient updates, which improve convergence speed by 300 times, significantly accelerating the transfer-learning process in real-world situations.

CVMay 6, 2024Code
Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

Huihong Shi, Haikuo Shao, Wendong Mao et al.

Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly. Specifically, at the algorithm level, we develop a {tailored post-training quantization engine} taking the unique activation distributions of Softmax-free efficient ViTs into full consideration, aiming to boost quantization accuracy. Furthermore, at the hardware level, we build an accelerator dedicated to the specific Convolution-Transformer hybrid architecture of efficient ViTs, thereby enhancing hardware efficiency. Extensive experimental results consistently prove the effectiveness of our Trio-ViT framework. {Particularly, we can gain up to $\uparrow$$\mathbf{3.6}\times$, $\uparrow$$\mathbf{5.0}\times$, and $\uparrow$$\mathbf{7.3}\times$ FPS under comparable accuracy over state-of-the-art ViT accelerators, as well as $\uparrow$$\mathbf{6.0}\times$, $\uparrow$$\mathbf{1.5}\times$, and $\uparrow$$\mathbf{2.1}\times$ DSP efficiency.} Codes are available at \url{https://github.com/shihuihong214/Trio-ViT}.

CVMar 22, 2018Code
BSD-GAN: Branched Generative Adversarial Network for Scale-Disentangled Representation Learning and Image Synthesis

Zili Yi, Zhiqin Chen, Hao Cai et al.

We introduce BSD-GAN, a novel multi-branch and scale-disentangled training method which enables unconditional Generative Adversarial Networks (GANs) to learn image representations at multiple scales, benefiting a wide range of generation and editing tasks. The key feature of BSD-GAN is that it is trained in multiple branches, progressively covering both the breadth and depth of the network, as resolutions of the training images increase to reveal finer-scale features. Specifically, each noise vector, as input to the generator network of BSD-GAN, is deliberately split into several sub-vectors, each corresponding to, and is trained to learn, image representations at a particular scale. During training, we progressively "de-freeze" the sub-vectors, one at a time, as a new set of higher-resolution images is employed for training and more network layers are added. A consequence of such an explicit sub-vector designation is that we can directly manipulate and even combine latent (sub-vector) codes which model different feature scales.Extensive experiments demonstrate the effectiveness of our training method in scale-disentangled learning of image representations and synthesis of novel image contents, without any extra labels and without compromising quality of the synthesized high-resolution images. We further demonstrate several image generation and manipulation applications enabled or improved by BSD-GAN. Source codes are available at https://github.com/duxingren14/BSD-GAN.

ARMar 29, 2024
An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT

Haikuo Shao, Huihong Shi, Wendong Mao et al.

Vision Transformers (ViTs) have achieved significant success in computer vision. However, their intensive computations and massive memory footprint challenge ViTs' deployment on embedded devices, calling for efficient ViTs. Among them, EfficientViT, the state-of-the-art one, features a Convolution-Transformer hybrid architecture, enhancing both accuracy and hardware efficiency. Unfortunately, existing accelerators cannot fully exploit the hardware benefits of EfficientViT due to its unique architecture. In this paper, we propose an FPGA-based accelerator for EfficientViT to advance the hardware efficiency frontier of ViTs. Specifically, we design a reconfigurable architecture to efficiently support various operation types, including lightweight convolutions and attention, boosting hardware utilization. Additionally, we present a time-multiplexed and pipelined dataflow to facilitate both intra- and inter-layer fusions, reducing off-chip data access costs. Experimental results show that our accelerator achieves up to 780.2 GOPS in throughput and 105.1 GOPS/W in energy efficiency at 200MHz on the Xilinx ZCU102 FPGA, which significantly outperforms prior works.

CVSep 17, 2025
Deep Lookup Network

Yulan Guo, Longguang Wang, Wendong Mao et al.

Convolutional neural networks are constructed with massive operations with different types and are highly computationally intensive. Among these operations, multiplication operation is higher in computational complexity and usually requires {more} energy consumption with longer inference time than other operations, which hinders the deployment of convolutional neural networks on mobile devices. In many resource-limited edge devices, complicated operations can be calculated via lookup tables to reduce computational cost. Motivated by this, in this paper, we introduce a generic and efficient lookup operation which can be used as a basic operation for the construction of neural networks. Instead of calculating the multiplication of weights and activation values, simple yet efficient lookup operations are adopted to compute their responses. To enable end-to-end optimization of the lookup operation, we construct the lookup tables in a differentiable manner and propose several training strategies to promote their convergence. By replacing computationally expensive multiplication operations with our lookup operations, we develop lookup networks for the image classification, image super-resolution, and point cloud classification tasks. It is demonstrated that our lookup networks can benefit from the lookup operations to achieve higher efficiency in terms of energy consumption and inference speed while maintaining competitive performance to vanilla convolutional networks. Extensive experiments show that our lookup networks produce state-of-the-art performance on different tasks (both classification and regression tasks) and different data types (both images and point clouds).

CVSep 7, 2025
StripDet: Strip Attention-Based Lightweight 3D Object Detection from Point Cloud

Weichao Wang, Wendong Mao, Zhongfeng Wang

The deployment of high-accuracy 3D object detection models from point cloud remains a significant challenge due to their substantial computational and memory requirements. To address this, we introduce StripDet, a novel lightweight framework designed for on-device efficiency. First, we propose the novel Strip Attention Block (SAB), a highly efficient module designed to capture long-range spatial dependencies. By decomposing standard 2D convolutions into asymmetric strip convolutions, SAB efficiently extracts directional features while reducing computational complexity from quadratic to linear. Second, we design a hardware-friendly hierarchical backbone that integrates SAB with depthwise separable convolutions and a simple multiscale fusion strategy, achieving end-to-end efficiency. Extensive experiments on the KITTI dataset validate StripDet's superiority. With only 0.65M parameters, our model achieves a 79.97% mAP for car detection, surpassing the baseline PointPillars with a 7x parameter reduction. Furthermore, StripDet outperforms recent lightweight and knowledge distillation-based methods, achieving a superior accuracy-efficiency trade-off while establishing itself as a practical solution for real-world 3D detection on edge devices.

CVJul 13, 2025
A Memory-Efficient Framework for Deformable Transformer with Neural Architecture Search

Wendong Mao, Mingfan Zhao, Jianfeng Guan et al.

Deformable Attention Transformers (DAT) have shown remarkable performance in computer vision tasks by adaptively focusing on informative image regions. However, their data-dependent sampling mechanism introduces irregular memory access patterns, posing significant challenges for efficient hardware deployment. Existing acceleration methods either incur high hardware overhead or compromise model accuracy. To address these issues, this paper proposes a hardware-friendly optimization framework for DAT. First, a neural architecture search (NAS)-based method with a new slicing strategy is proposed to automatically divide the input feature into uniform patches during the inference process, avoiding memory conflicts without modifying model architecture. The method explores the optimal slice configuration by jointly optimizing hardware cost and inference accuracy. Secondly, an FPGA-based verification system is designed to test the performance of this framework on edge-side hardware. Algorithm experiments on the ImageNet-1K dataset demonstrate that our hardware-friendly framework can maintain have only 0.2% accuracy drop compared to the baseline DAT. Hardware experiments on Xilinx FPGA show the proposed method reduces DRAM access times to 18% compared with existing DAT acceleration methods.

GRApr 8, 2025
CDM-QTA: Quantized Training Acceleration for Efficient LoRA Fine-Tuning of Diffusion Model

Jinming Lu, Minghao She, Wendong Mao et al.

Fine-tuning large diffusion models for custom applications demands substantial power and time, which poses significant challenges for efficient implementation on mobile devices. In this paper, we develop a novel training accelerator specifically for Low-Rank Adaptation (LoRA) of diffusion models, aiming to streamline the process and reduce computational complexity. By leveraging a fully quantized training scheme for LoRA fine-tuning, we achieve substantial reductions in memory usage and power consumption while maintaining high model fidelity. The proposed accelerator features flexible dataflow, enabling high utilization for irregular and variable tensor shapes during the LoRA process. Experimental results show up to 1.81x training speedup and 5.50x energy efficiency improvements compared to the baseline, with minimal impact on image generation quality.

LGOct 2, 2018
Multi-scale Convolution Aggregation and Stochastic Feature Reuse for DenseNets

Mingjie Wang, Jun Zhou, Wendong Mao et al.

Recently, Convolution Neural Networks (CNNs) obtained huge success in numerous vision tasks. In particular, DenseNets have demonstrated that feature reuse via dense skip connections can effectively alleviate the difficulty of training very deep networks and that reusing features generated by the initial layers in all subsequent layers has strong impact on performance. To feed even richer information into the network, a novel adaptive Multi-scale Convolution Aggregation module is presented in this paper. Composed of layers for multi-scale convolutions, trainable cross-scale aggregation, maxout, and concatenation, this module is highly non-linear and can boost the accuracy of DenseNet while using much fewer parameters. In addition, due to high model complexity, the network with extremely dense feature reuse is prone to overfitting. To address this problem, a regularization method named Stochastic Feature Reuse is also presented. Through randomly dropping a set of feature maps to be reused for each mini-batch during the training phase, this regularization method reduces training costs and prevents co-adaptation. Experimental results on CIFAR-10, CIFAR-100 and SVHN benchmarks demonstrated the effectiveness of the proposed methods.

CVOct 2, 2018
Semi-dense Stereo Matching using Dual CNNs

Wendong Mao, Mingjie Wang, Jun Zhou et al.

A robust solution for semi-dense stereo matching is presented. It utilizes two CNN models for computing stereo matching cost and performing confidence-based filtering, respectively. Compared to existing CNNs-based matching cost generation approaches, our method feeds additional global information into the network so that the learned model can better handle challenging cases, such as lighting changes and lack of textures. Through utilizing non-parametric transforms, our method is also more self-reliant than most existing semi-dense stereo approaches, which rely highly on the adjustment of parameters. The experimental results based on Middlebury Stereo dataset demonstrate that the proposed approach outperforms the state-of-the-art semi-dense stereo approaches.