Pengxiang Xu

OC
h-index2
5papers
23citations
Novelty47%
AI Score44

5 Papers

24.9DCMay 6
SGEMM-cube: Precision-Recovery FP32 GEMM Approximation on Ascend NPUs with FP16 Matrix Engines

Weicheng Xue, Baisong Xu, Kai Yang et al.

Modern AI accelerators provide high-throughput low-precision matrix engines, but their support for FP32 GEMM is often limited or inefficient. This work presents SGEMM-cube, a precision-recovery FP32 GEMM approximation on Ascend NPUs using FP16 Cube units. Rather than claiming bit-exact FP32 approximation, SGEMM-cube targets near-FP32 accuracy for inputs whose magnitudes are representable within the FP16 dynamic range. The method follows a two-component FP32-to-FP16 splitting strategy related to Ozaki-style and Ootomo-style schemes: each FP32 operand is represented by an FP16 high component and a scaled FP16 residual component, and the matrix product is reconstructed from the dominant high-high and high-low terms while omitting the low-low term. The main contribution of this paper is not a new splitting paradigm, but an architecture-specific realization and analysis of this precision-recovery scheme on Ascend NPUs. We analyze the effects of round-to-nearest conversion, underflow, residual scaling, and accumulation order under the Ascend execution model, and clarify the range and accuracy limitations of the approach. We further adapt standard high-performance GEMM techniques, including L1-aware blocking and double-buffered pipelining, to the software-managed memory hierarchy of Ascend NPUs. Experiments on Ascend 910A show that SGEMM-cube recovers substantially higher accuracy than native FP16 GEMM and approaches FP32 SGEMM accuracy for moderate-range inputs, while achieving up to 65.3 TFLOP/s, corresponding to 77\% of the FP32-equivalent peak defined by the three-GEMM decomposition cost. These results demonstrate that FP32-accuracy GEMM approximation can be made practical on FP16-only NPU matrix engines, provided that its range, error, and implementation constraints are explicitly managed.

LGOct 28, 2025Code
Calibrating and Rotating: A Unified Framework for Weight Conditioning in PEFT

Da Chang, Peng Xue, Yu Li et al.

Parameter-Efficient Fine-Tuning (PEFT) methods are crucial for adapting large pre-trained models. Among these, LoRA is considered a foundational approach. Building on this, the influential DoRA method enhances performance by decomposing weight updates into magnitude and direction. However, its underlying mechanism remains unclear, and it introduces significant computational overhead. In this work, we first identify that DoRA's success stems from its capacity to increase the singular value entropy of the weight update matrix, which promotes a more uniform update distribution akin to full fine-tuning. We then reformulate DoRA into a mathematically equivalent and more efficient matrix form, revealing it as a learnable weight conditioning method. Based on this insight, we propose a unified framework for designing advanced PEFT methods by exploring two orthogonal dimensions: the architectural placement and the transformation type of the conditioning matrix. Within this framework, we introduce two novel methods: (1) \textbf{Pre-Diag}, which applies a diagonal conditioning matrix before the LoRA update to efficiently calibrate the pre-trained weights, thereby enhancing performance while reducing training time; and (2) \textbf{S}kewed \textbf{O}rthogonal \textbf{R}otation \textbf{A}daptation (\textbf{SORA}), which employs a parameter-efficient orthogonal rotation to perform a more powerful, norm-preserving transformation of the feature space. Extensive experiments on natural language understanding and generation tasks demonstrate that our proposed methods achieve superior performance and efficiency compared to both LoRA and DoRA. The code is available at https://github.com/MaeChd/SORA.

OCJun 14, 2021
NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep Learning

Minghan Yang, Dong Xu, Qiwen Cui et al.

In this paper, a novel second-order method called NG+ is proposed. By following the rule ``the shape of the gradient equals the shape of the parameter", we define a generalized fisher information matrix (GFIM) using the products of gradients in the matrix form rather than the traditional vectorization. Then, our generalized natural gradient direction is simply the inverse of the GFIM multiplies the gradient in the matrix form. Moreover, the GFIM and its inverse keeps the same for multiple steps so that the computational cost can be controlled and is comparable with the first-order methods. A global convergence is established under some mild conditions and a regret bound is also given for the online learning setting. Numerical results on image classification with ResNet50, quantum chemistry modeling with Schnet, neural machine translation with Transformer and recommendation system with DLRM illustrate that GN+ is competitive with the state-of-the-art methods.

CVMar 1, 2021
DST: Data Selection and joint Training for Learning with Noisy Labels

Yi Wei, Xue Mei, Xin Liu et al.

Training a deep neural network heavily relies on a large amount of training data with accurate annotations. To alleviate this problem, various methods have been proposed to annotate the data automatically. However, automatically generating annotations will inevitably yields noisy labels. In this paper, we propose a Data Selection and joint Training (DST) method to automatically select training samples with accurate annotations. Specifically, DST fits a mixture model according to the original annotation as well as the predicted label for each training sample, and the mixture model is utilized to dynamically divide the training dataset into a correctly labeled dataset, a correctly predicted set and a wrong dataset. Then, DST is trained with these datasets in a supervised manner. Due to confirmation bias problem, we train the two networks alternately, and each network is tasked to establish the data division to teach another network. For each iteration, the correctly labeled and predicted labels are reweighted respectively by the probabilities from the mixture model, and a uniform distribution is used to generate the probabilities of the wrong samples. Experiments on CIFAR-10, CIFAR-100 and Clothing1M demonstrate that DST is the comparable or superior to the state-of-the-art methods.

OCJun 10, 2020
Sketchy Empirical Natural Gradient Methods for Deep Learning

Minghan Yang, Dong Xu, Zaiwen Wen et al.

In this paper, we develop an efficient sketchy empirical natural gradient method (SENG) for large-scale deep learning problems. The empirical Fisher information matrix is usually low-rank since the sampling is only practical on a small amount of data at each iteration. Although the corresponding natural gradient direction lies in a small subspace, both the computational cost and memory requirement are still not tractable due to the high dimensionality. We design randomized techniques for different neural network structures to resolve these challenges. For layers with a reasonable dimension, sketching can be performed on a regularized least squares subproblem. Otherwise, since the gradient is a vectorization of the product between two matrices, we apply sketching on the low-rank approximations of these matrices to compute the most expensive parts. A distributed version of SENG is also developed for extremely large-scale applications. Global convergence to stationary points is established under some mild assumptions and a fast linear convergence is analyzed under the neural tangent kernel (NTK) case. Extensive experiments on convolutional neural networks show the competitiveness of SENG compared with the state-of-the-art methods. On the task ResNet50 with ImageNet-1k, SENG achieves 75.9\% Top-1 testing accuracy within 41 epochs. Experiments on the distributed large-batch training show that the scaling efficiency is quite reasonable.