Yuwei Fan

LG
h-index24
21papers
287citations
Novelty43%
AI Score53

21 Papers

NAFeb 25, 2019
A multiscale neural network based on hierarchical nested bases

Yuwei Fan, Jordi Feliu-Faba, Lin Lin et al.

In recent years, deep learning has led to impressive results in many fields. In this paper, we introduce a multi-scale artificial neural network for high-dimensional non-linear maps based on the idea of hierarchical nested bases in the fast multipole method and the $\mathcal{H}^2$-matrices. This approach allows us to efficiently approximate discretized nonlinear maps arising from partial differential equations or integral equations. It also naturally extends our recent work based on the generalization of hierarchical matrices [Fan et al. arXiv:1807.01883] but with a reduced number of parameters. In particular, the number of parameters of the neural network grows linearly with the dimension of the parameter space of the discretized PDE. We demonstrate the properties of the architecture by approximating the solution maps of non-linear Schr{ö}dinger equation, the radiative transfer equation, and the Kohn-Sham map.

NAJan 29, 2018
Fast algorithms for integral formulations of steady-state radiative transfer equation

Yuwei Fan, Jing An, Lexing Ying

We investigate integral formulations and fast algorithms for the steady-state radiative transfer equation with isotropic and anisotropic scattering. When the scattering term is a smooth convolution on the unit sphere, a model reduction step in the angular domain using the Fourier transformation in 2D and the spherical harmonic transformation in 3D significantly reduces the number of degrees of freedoms. The resulting Fourier coefficients or spherical harmonic coefficients satisfy a Fredholm integral equation of the second kind. We study the uniqueness of the equation and proved an a priori estimate. For a homogeneous medium, the integral equation can be solved efficiently using the FFT and iterative methods. For an inhomogeneous medium, the recursive skeletonization factorization method is applied instead. Numerical simulations demonstrate the efficiency of the proposed algorithms in both homogeneous and inhomogeneous cases and for both transport and diffusion regimes.

NANov 19, 2018
Filtered Hyperbolic Moment Method for the Vlasov Equation

Yana Di, Yuwei Fan, Zhenzhong Kou et al.

In this paper, we investigate the effect of the filter for the hyperbolic moment equations(HME) [15] of the Vlasov-Poisson equations and propose a novel quasi time-consistent filter to suppress the numerical recurrence effect. By taking properties of HME into consideration, the filter preserves a lot of physical properties of HME, including Galilean invariance and the conservation of mass, momentum and energy. We present two viewpoints, collisional viewpoint and dissipative viewpoint, to dissect the filter, and show that the filtered hyperbolic moment method can be treated as a solver of Vlasov equation. Numerical simulations of the linear Landau damping and two stream instability are tested to demonstrate the effectiveness of the filter in restraining recurrence arising from particle streaming. Both the analysis and the numerical results indicate that the filtered HME can capture the evolution of the Vlasov equation, even when phase mixing and filamentation are dominant.

NAApr 18, 2017
Resolving Knudsen Layer by High Order Moment Expansion

Yuwei Fan, Jun Li, Ruo Li et al.

We model the Knudsen layer in Kramers' problem by linearized high order hyperbolic moment system. Due to the hyperbolicity, the boundary conditions of the moment system is properly reduced from the kinetic boundary condition. For Kramers' problem, we give the analytical solutions of moment systems. With the order increasing of the moment model, the solutions are approaching to the solution of the linearized BGK kinetic equation. The velocity profile in the Knudsen layer is captured with improved accuracy for a wide range of accommodation coefficients.

NAJul 3, 2018
An entropic fourier method for the Boltzmann equation

Zhenning Cai, Yuwei Fan, Lexing Ying

We propose an entropic Fourier method for the numerical discretization of the Boltzmann collision operator. The method, which is obtained by modifying a Fourier Galerkin method to match the form of the discrete velocity method, can be viewed both as a discrete velocity method and as a Fourier method. As a discrete velocity method, it preserves the positivity of the solution and satisfies a discrete version of the H-theorem. As a Fourier method, it allows one to readily apply the FFT-based fast algorithms. A second-order convergence rate is validated by numerical experiments

MATH-PHJan 21, 2017
13-Moment System with Global Hyperbolicity for Quantum Gas

Yana Di, Yuwei Fan, Ruo Li

We point out that the quantum Grad's 13-moment system [R. Yano, Physica A: Statistical Mechanics and its Applications, 416:231-241, 2014] is lack of global hyperbolicity, and even worse, the thermodynamic equilibrium is not an interior point of the hyperbolicity region of the system. To remedy this problem, by fully considering Grad's expansion, we split the expansion into the equilibrium part and the non-equilibrium part, and propose a regularization for the system with the help of the new theory developed in [Z. Cai et al., SIAM J. Appl. Math., 75(5):2001-2023, 2015, Y. Fan, J. Stat. Phys., 161(4), 2015]. This provides us a new model which is hyperbolic for all admissible thermodynamic states, and meanwhile preserves the approximate accuracy of the original system. It should be noted that this procedure is not a trivial application of the theory in [Z. Cai et al., SIAM J. Appl. Math., 75(5):2001-2023, 2015, Y. Fan, J. Stat. Phys., 161(4), 2015].

MLMay 13
Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Lingchao Zheng, Yuwei Fan, Jun Li et al.

Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled compute units (e.g., Ascend NPUs), dequantization operations can consume more cycles than the matrix multiplication itself, leaving the high-throughput tensor cores underutilized. This paper presents Multi-Scale Dequant (MSD), a quantization framework that removes weight/KV dequantization from the GEMM critical path. Instead of lifting low-bit weights to BF16 precision, MSD decomposes high-precision BF16 activations into multiple low-precision components, each of which can be multiplied directly with quantized weights via native hardware-accelerated GEMM. This approach shifts the computational paradigm from precision conversion to multi-scale approximation, avoiding INT8-to-BF16 weight conversion before GEMM. We instantiate MSD for two weight formats and derive tight error bounds for each. For INT8 weights (W4A16), two-pass INT8 decomposition achieves near 16 effective bits. For MXFP4 weights (W4A16), two-pass MXFP4 decomposition yields near 6.6 effective bits with error bound 1/64 per block surpassing single-pass MXFP8(5.24 bits) while maintaining the same effective GEMM compute time. We further derive closed-form latency and HBM traffic models showing that MSD avoids the Vector-Cube pipeline stall caused by dequantization and reduces KV cache HBM traffic by up to 2.5 times in attention. Numerical simulations on matrix multiplication and Flash Attention kernels confirm that MSD does not degrade accuracy compared to dequantization baselines, and in many settings achieves lower L2 error.

MLMay 13
AIS: Adaptive Importance Sampling for Quantized RL

Jiajun Zhou, Wei Shao, Lingchao Zheng et al.

Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three real-time diagnostics, namely weight reliability, divergence severity, and variance amplification, into a single mixing coefficient that interpolates between the uncorrected and fully importance-weighted gradients, suppressing the destabilizing component of the mismatch while preserving its exploratory benefit. We integrate AIS into GRPO and evaluate it on the diffusion-based LLaDA-8B-Instruct and the autoregressive Qwen3-8B and Qwen3.5-9B across mathematical reasoning and planning benchmarks. AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8.

LGSep 24, 2025Code
AMLA: MUL by ADD in FlashAttention Rescaling

Qichen Liao, Chengqiu Hu, Fangzheng Miao et al.

Multi-head Latent Attention (MLA) significantly reduces KVCache memory usage in Large Language Models while introducing substantial computational overhead and intermediate variable expansion. This poses challenges for efficient hardware implementation -- especially during the decode phase. This paper introduces Ascend MLA (AMLA), a high-performance kernel specifically optimized for Huawei's Ascend NPUs. AMLA is built on two core innovations: (1) A novel FlashAttention-based algorithm that replaces floating-point multiplications with integer additions for output block rescaling, leveraging binary correspondence between FP32 and INT32 representations; (2) A Preload Pipeline strategy with hierarchical tiling that maximizes FLOPS utilization: the Preload Pipeline achieves Cube-bound performance, while hierarchical tiling overlaps data movement and computation within the Cube core. Experiments show that on Ascend 910 NPUs (integrated in CloudMatrix384), AMLA achieves up to 614 TFLOPS, reaching 86.8% of the theoretical maximum FLOPS, outperforming the state-of-the-art open-source FlashMLA implementation, whose FLOPS utilization is up to 66.7% on NVIDIA H800 SXM5. The AMLA kernel has been integrated into Huawei's CANN and will be released soon.

DCMay 7
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

Tianci Bu, Yuan Lyu, Zixi Chen et al.

Data-parallel (DP) load balancing has emerged as a first-order bottleneck in large-scale LLM serving. When a model is sharded across devices via tensor parallelism (TP) or expert parallelism (EP) and replicated across many DP workers, every decode step ends in a synchronization barrier whose latency is set by the most heavily loaded worker; even modest persistent imbalance across DP workers compounds, step after step, into a substantial fraction of wasted compute. The problem is hard for reasons specific to LLM decoding: assignments are sticky (migrating KV caches has a high cost), per-request loads grow over time, arrivals are non-stationary, and the router must decide within a sub-100\,ms decode budget over hundreds of waiting requests and tens of workers. We present \textbf{BalanceRoute}, a family of practical online routing algorithms that target this bottleneck. The first, \textbf{BR-0}, requires no prediction infrastructure and uses a piecewise-linear F-score that captures the sharp asymmetry between admissions that fill safe margin and those that overflow into the envelope; a two-stage decomposition keeps per-step cost compatible with millisecond-scale scheduling. The second, \textbf{BR-H}, generalizes BR-0 with a short, constant lookahead $H$ and a lightweight termination-classifier interface, extending the F-score to a horizon-discounted form. We deploy BalanceRoute on a 144-NPU cluster and evaluate against vLLM baselines on both a proprietary production trace and the public Azure-2024 trace. Across both workloads, BalanceRoute substantially reduces average DP imbalance and improves end-to-end serving throughput.

LGJan 29
Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

Chendong Song, Meixuan Wang, Hang Zhou et al.

Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an $r$A-$1$F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.

CLNov 7, 2025
LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

Wei Shao, Lingchao Zheng, Pengyu Wang et al.

Long context inference scenarios have become increasingly important for large language models, yet they introduce significant computational latency. While prior research has optimized long-sequence inference through operators, model architectures, and system frameworks, tokenization remains an overlooked bottleneck. Existing parallel tokenization methods accelerate processing through text segmentation and multi-process tokenization, but they suffer from inconsistent results due to boundary artifacts that occur after merging. To address this, we propose LoPT, a novel Lossless Parallel Tokenization framework that ensures output identical to standard sequential tokenization. Our approach employs character-position-based matching and dynamic chunk length adjustment to align and merge tokenized segments accurately. Extensive experiments across diverse long-text datasets demonstrate that LoPT achieves significant speedup while guaranteeing lossless tokenization. We also provide theoretical proof of consistency and comprehensive analytical studies to validate the robustness of our method.

CLOct 4, 2025
Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models

Canhui Wu, Qiong Cao, Chang Li et al.

Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as "overthinking." Existing solutions via reinforcement learning (RL) typically penalize generated tokens to promote conciseness. However, these methods encounter two challenges: responses with fewer tokens do not always correspond to fewer reasoning steps, and models may develop hacking behavior in later stages of training by discarding reasoning steps to minimize token usage. In this work, we introduce \textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps. Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when the length of any output step exceeds the upper limit, we halt updates to prevent hacking behavior caused by merging steps. Extensive experiments across four reasoning benchmarks demonstrate that SP achieves state-of-the-art accuracy while significantly reducing response length. For instance, on AIME24, SP reduces token usage by \textbf{69.7\%}.

LGNov 30, 2024
Fine-Tuning Pre-trained Large Time Series Models for Prediction of Wind Turbine SCADA Data

Yuwei Fan, Tao Song, Chenlong Feng et al.

The remarkable achievements of large models in the fields of natural language processing (NLP) and computer vision (CV) have sparked interest in their application to time series forecasting within industrial contexts. This paper explores the application of a pre-trained large time series model, Timer, which was initially trained on a wide range of time series data from multiple domains, in the prediction of Supervisory Control and Data Acquisition (SCADA) data collected from wind turbines. The model was fine-tuned on SCADA datasets sourced from two wind farms, which exhibited differing characteristics, and its accuracy was subsequently evaluated. Additionally, the impact of data volume was studied to evaluate the few-shot ability of the Timer. Finally, an application study on one-turbine fine-tuning for whole-plant prediction was implemented where both few-shot and cross-turbine generalization capacity is required. The results reveal that the pre-trained large model does not consistently outperform other baseline models in terms of prediction accuracy whenever the data is abundant or not, but demonstrates superior performance in the application study. This result underscores the distinctive advantages of the pre-trained large time series model in facilitating swift deployment.

AIJun 1, 2024
Domain-specific ReAct for physics-integrated iterative modeling: A case study of LLM agents for gas path analysis of gas turbines

Tao Song, Yuwei Fan, Chenlong Feng et al.

This study explores the application of large language models (LLMs) with callable tools in energy and power engineering domain, focusing on gas path analysis of gas turbines. We developed a dual-agent tool-calling process to integrate expert knowledge, predefined tools, and LLM reasoning. We evaluated various LLMs, including LLama3, Qwen1.5 and GPT. Smaller models struggled with tool usage and parameter extraction, while larger models demonstrated favorable capabilities. All models faced challenges with complex, multi-component problems. Based on the test results, we infer that LLMs with nearly 100 billion parameters could meet professional scenario requirements with fine-tuning and advanced prompt design. Continued development are likely to enhance their accuracy and effectiveness, paving the way for more robust AI-driven solutions.

COMP-PHNov 27, 2019
Solving Inverse Wave Scattering with Deep Learning

Yuwei Fan, Lexing Ying

This paper proposes a neural network approach for solving two classical problems in the two-dimensional inverse wave scattering: far field pattern problem and seismic imaging. The mathematical problem of inverse wave scattering is to recover the scatterer field of a medium based on the boundary measurement of the scattered wave from the medium, which is high-dimensional and nonlinear. For the far field pattern problem under the circular experimental setup, a perturbative analysis shows that the forward map can be approximated by a vectorized convolution operator in the angular direction. Motivated by this and filtered back-projection, we propose an effective neural network architecture for the inverse map using the recently introduced BCR-Net along with the standard convolution layers. Analogously for the seismic imaging problem, we propose a similar neural network architecture under the rectangular domain setup with a depth-dependent background velocity. Numerical results demonstrate the efficiency of the proposed neural networks.

NANov 25, 2019
Solving Traveltime Tomography with Deep Learning

Yuwei Fan, Lexing Ying

This paper introduces a neural network approach for solving two-dimensional traveltime tomography (TT) problems based on the eikonal equation. The mathematical problem of TT is to recover the slowness field of a medium based on the boundary measurement of the traveltimes of waves going through the medium. This inverse map is high-dimensional and nonlinear. For the circular tomography geometry, a perturbative analysis shows that the forward map can be approximated by a vectorized convolution operator in the angular direction. Motivated by this and filtered back-projection, we propose an effective neural network architecture for the inverse map using the recently proposed BCR-Net, with weights learned from training datasets. Numerical results demonstrate the efficiency of the proposed neural networks.

COMP-PHOct 10, 2019
Solving Optical Tomography with Deep Learning

Yuwei Fan, Lexing Ying

This paper presents a neural network approach for solving two-dimensional optical tomography (OT) problems based on the radiative transfer equation. The mathematical problem of OT is to recover the optical properties of an object based on the albedo operator that is accessible from boundary measurements. Both the forward map from the optical properties to the albedo operator and the inverse map are high-dimensional and nonlinear. For the circular tomography geometry, a perturbative analysis shows that the forward map can be approximated by a vectorized convolution operator in the angular direction. Motivated by this, we propose effective neural network architectures for the forward and inverse maps based on convolution layers, with weights learned from training datasets. Numerical results demonstrate the efficiency of the proposed neural networks.

NAJun 16, 2019
Meta-learning Pseudo-differential Operators with Deep Neural Networks

Jordi Feliu-Faba, Yuwei Fan, Lexing Ying

This paper introduces a meta-learning approach for parameterized pseudo-differential operators with deep neural networks. With the help of the nonstandard wavelet form, the pseudo-differential operators can be approximated in a compressed form with a collection of vectors. The nonlinear map from the parameter to this collection of vectors and the wavelet transform are learned together from a small number of matrix-vector multiplications of the pseudo-differential operator. Numerical results for Green's functions of elliptic partial differential equations and the radiative transfer equations demonstrate the efficiency and accuracy of the proposed approach.

COMP-PHJun 6, 2019
Solving Electrical Impedance Tomography with Deep Learning

Yuwei Fan, Lexing Ying

This paper introduces a new approach for solving electrical impedance tomography (EIT) problems using deep neural networks. The mathematical problem of EIT is to invert the electrical conductivity from the Dirichlet-to-Neumann (DtN) map. Both the forward map from the electrical conductivity to the DtN map and the inverse map are high-dimensional and nonlinear. Motivated by the linear perturbative analysis of the forward map and based on a numerically low-rank property, we propose compact neural network architectures for the forward and inverse maps for both 2D and 3D problems. Numerical results demonstrate the efficiency of the proposed neural networks.

NAOct 20, 2018
BCR-Net: a neural network based on the nonstandard wavelet form

Yuwei Fan, Cindy Orozco Bohorquez, Lexing Ying

This paper proposes a novel neural network architecture inspired by the nonstandard form proposed by Beylkin, Coifman, and Rokhlin in [Communications on Pure and Applied Mathematics, 44(2), 141-183]. The nonstandard form is a highly effective wavelet-based compression scheme for linear integral operators. In this work, we first represent the matrix-vector product algorithm of the nonstandard form as a linear neural network where every scale of the multiresolution computation is carried out by a locally connected linear sub-network. In order to address nonlinear problems, we propose an extension, called BCR-Net, by replacing each linear sub-network with a deeper and more powerful nonlinear one. Numerical results demonstrate the efficiency of the new architecture by approximating nonlinear maps that arise in homogenization theory and stochastic computation.