Zhenhua Yu

CV
h-index5
6papers
21citations
Novelty48%
AI Score33

6 Papers

LGSep 8, 2022
CWP: Instance complexity weighted channel-wise soft masks for network pruning

Jiapeng Wang, Ming Ma, Zhenhua Yu

Existing differentiable channel pruning methods often attach scaling factors or masks behind channels to prune filters with less importance, and implicitly assume uniform contribution of input samples to filter importance. Specifically, the effects of instance complexity on pruning performance are not yet fully investigated in static network pruning. In this paper, we propose a simple yet effective differentiable network pruning method CWP based on instance complexity weighted filter importance scores. We define instance complexity related weight for each instance by giving higher weights to hard instances, and measure the weighted sum of instance-specific soft masks to model non-uniform contribution of different inputs, which encourages hard instances to dominate the pruning process and the model performance to be well preserved. In addition, we introduce a regularizer to maximize polarization of the masks, such that a sweet spot can be easily found to identify the filters to be pruned. Performance evaluations on various network architectures and datasets demonstrate CWP has advantages over the state-of-the-arts in pruning large networks. For instance, CWP improves the accuracy of ResNet56 on CIFAR-10 dataset by 0.32% aftering removing 64.11% FLOPs, and prunes 87.75% FLOPs of ResNet50 on ImageNet dataset with only 0.93% Top-1 accuracy loss.

ROApr 20, 2025
An LLM-enabled Multi-Agent Autonomous Mechatronics Design Framework

Zeyu Wang, Frank P. -W. Lo, Qian Chen et al.

Existing LLM-enabled multi-agent frameworks are predominantly limited to digital or simulated environments and confined to narrowly focused knowledge domain, constraining their applicability to complex engineering tasks that require the design of physical embodiment, cross-disciplinary integration, and constraint-aware reasoning. This work proposes a multi-agent autonomous mechatronics design framework, integrating expertise across mechanical design, optimization, electronics, and software engineering to autonomously generate functional prototypes with minimal direct human design input. Operating primarily through a language-driven workflow, the framework incorporates structured human feedback to ensure robust performance under real-world constraints. To validate its capabilities, the framework is applied to a real-world challenge involving autonomous water-quality monitoring and sampling, where traditional methods are labor-intensive and ecologically disruptive. Leveraging the proposed system, a fully functional autonomous vessel was developed with optimized propulsion, cost-effective electronics, and advanced control. The design process was carried out by specialized agents, including a high-level planning agent responsible for problem abstraction and dedicated agents for structural, electronics, control, and software development. This approach demonstrates the potential of LLM-based multi-agent systems to automate real-world engineering workflows and reduce reliance on extensive domain expertise.

DCSep 3, 2025
FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs

Haijun Zhang, Jinxiang Wang, Zhenhua Yu et al.

Large language models (LLMs) have made a profound impact across various fields due to their advanced capabilities. However, training these models at unprecedented scales requires extensive AI accelerator clusters and sophisticated parallelism strategies, which pose significant challenges in maintaining system reliability over prolonged training periods. A major concern is the substantial loss of training time caused by inevitable hardware and software failures. To address these challenges, we present FlashRecovery, a fast and low-cost failure recovery system comprising three core modules: (1) Active and real-time failure detection. This module performs continuous training state monitoring, enabling immediate identification of hardware and software failures within seconds, thus ensuring rapid incident response; (2) Scale-independent task restart. By employing different recovery strategies for normal and faulty nodes, combined with an optimized communication group reconstruction protocol, our approach ensures that the recovery time remains nearly constant, regardless of cluster scale; (3) Checkpoint-free recovery within one step. Our novel recovery mechanism enables single-step restoration, completely eliminating dependence on traditional checkpointing methods and their associated overhead. Collectively, these innovations enable FlashRecovery to achieve optimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO), substantially improving the reliability and efficiency of long-duration LLM training. Experimental results demonstrate that FlashRecovery system can achieve training restoration on training cluster with 4, 800 devices in 150 seconds. We also verify that the time required for failure recovery is nearly consistent for different scales of training tasks.

CVJan 6, 2025
PARF-Net: integrating pixel-wise adaptive receptive fields into hybrid Transformer-CNN network for medical image segmentation

Xu Ma, Mengsheng Chen, Junhui Zhang et al.

Convolutional neural networks (CNNs) excel in local feature extraction while Transformers are superior in processing global semantic information. By leveraging the strengths of both, hybrid Transformer-CNN networks have become the major architectures in medical image segmentation tasks. However, existing hybrid methods still suffer deficient learning of local semantic features due to the fixed receptive fields of convolutions, and also fall short in effectively integrating local and long-range dependencies. To address these issues, we develop a new method PARF-Net to integrate convolutions of Pixel-wise Adaptive Receptive Fields (Conv-PARF) into hybrid Network for medical image segmentation. The Conv-PARF is introduced to cope with inter-pixel semantic differences and dynamically adjust convolutional receptive fields for each pixel, thus providing distinguishable features to disentangle the lesions with varying shapes and scales from the background. The features derived from the Conv-PARF layers are further processed using hybrid Transformer-CNN blocks under a lightweight manner, to effectively capture local and long-range dependencies, thus boosting the segmentation performance. By assessing PARF-Net on four widely used medical image datasets including MoNuSeg, GlaS, DSB2018 and multi-organ Synapse, we showcase the advantages of our method over the state-of-the-arts. For instance, PARF-Net achieves 84.27% mean Dice on the Synapse dataset, surpassing existing methods by a large margin.

ROAug 4, 2021
A Method to use Nonlinear Dynamics in a Whisker Sensor for Terrain Identification by Mobile Robots

Zhenhua Yu, S. M. Hadi Sadati, Hasitha Wegiriya et al.

This paper shows analytical and experimental evidence of using the vibration dynamics of a compliant whisker for accurate terrain classification during steady state motion of a mobile robot. A Hall effect sensor was used to measure whisker vibrations due to perturbations from the ground. Analytical results predict that the whisker vibrations will have a dominant frequency at the vertical perturbation frequency of the mobile robot sandwiched by two other less dominant but distinct frequency components. These frequency components may come from bifurcation of vibration frequency due to nonlinear interaction dynamics at steady state. Experimental results also exhibit distinct dominant frequency components unique to the speed of the robot and the terrain roughness. This nonlinear dynamic feature is used in a deep multi-layer perceptron neural network to classify terrains. We achieved 85.6\% prediction success rate for seven flat terrain surfaces with different textures.

CVDec 16, 2019
Towards Omni-Supervised Face Alignment for Large Scale Unlabeled Videos

Congcong Zhu, Hao Liu, Zhenhua Yu et al.

In this paper, we propose a spatial-temporal relational reasoning networks (STRRN) approach to investigate the problem of omni-supervised face alignment in videos. Unlike existing fully supervised methods which rely on numerous annotations by hand, our learner exploits large scale unlabeled videos plus available labeled data to generate auxiliary plausible training annotations. Motivated by the fact that neighbouring facial landmarks are usually correlated and coherent across consecutive frames, our approach automatically reasons about discriminative spatial-temporal relationships among landmarks for stable face tracking. Specifically, we carefully develop an interpretable and efficient network module, which disentangles facial geometry relationship for every static frame and simultaneously enforces the bi-directional cycle-consistency across adjacent frames, thus allowing the modeling of intrinsic spatial-temporal relations from raw face sequences. Extensive experimental results demonstrate that our approach surpasses the performance of most fully supervised state-of-the-arts.