CVMay 20
Rethinking Cross-Layer Information Routing in Diffusion TransformersChao Xu, Maohua Li, Qirui Li et al.
Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.
NEJun 21, 2022
On the Intrinsic Structures of Spiking Neural NetworksShao-Qun Zhang, Jia-Yi Chen, Jin-Hui Wu et al.
Recent years have emerged a surge of interest in SNNs owing to their remarkable potential to handle time-dependent and event-driven data. The performance of SNNs hinges not only on selecting an apposite architecture and fine-tuning connection weights, similar to conventional ANNs, but also on the meticulous configuration of intrinsic structures within spiking computations. However, there has been a dearth of comprehensive studies examining the impact of intrinsic structures. Consequently, developers often find it challenging to apply a standardized configuration of SNNs across diverse datasets or tasks. This work delves deep into the intrinsic structures of SNNs. Initially, we unveil two pivotal components of intrinsic structures: the integration operation and firing-reset mechanism, by elucidating their influence on the expressivity of SNNs. Furthermore, we draw two key conclusions: the membrane time hyper-parameter is intimately linked to the eigenvalues of the integration operation, dictating the functional topology of spiking dynamics, and various hyper-parameters of the firing-reset mechanism govern the overall firing capacity of an SNN, mitigating the injection ratio or sampling density of input data. These findings elucidate why the efficacy of SNNs hinges heavily on the configuration of intrinsic structures and lead to a recommendation that enhancing the adaptability of these structures contributes to improving the overall performance and applicability of SNNs. Inspired by this recognition, we propose two feasible approaches to enhance SNN learning. These involve leveraging self-connection architectures and employing stochastic spiking neurons to augment the adaptability of the integration operation and firing-reset mechanism, respectively. We verify the effectiveness of the proposed methods from perspectives of theory and practice.
LGOct 27, 2022
On the Approximation and Complexity of Deep Neural Networks to Invariant FunctionsGao Zhang, Jin-Hui Wu, Shao-Qun Zhang
Recent years have witnessed a hot wave of deep neural networks in various domains; however, it is not yet well understood theoretically. A theoretical characterization of deep neural networks should point out their approximation ability and complexity, i.e., showing which architecture and size are sufficient to handle the concerned tasks. This work takes one step on this direction by theoretically studying the approximation and complexity of deep neural networks to invariant functions. We first prove that the invariant functions can be universally approximated by deep neural networks. Then we show that a broad range of invariant functions can be asymptotically approximated by various types of neural network models that includes the complex-valued neural networks, convolutional neural networks, and Bayesian neural networks using a polynomial number of parameters or optimization iterations. We also provide a feasible application that connects the parameter estimation and forecasting of high-resolution signals with our theoretical conclusions. The empirical results obtained on simulation experiments demonstrate the effectiveness of our method.
QMJun 15, 2024Code
Horizon-wise Learning Paradigm Promotes Gene Splicing IdentificationQi-Jie Li, Qian Sun, Shao-Qun Zhang
Identifying gene splicing is a core and significant task confronted in modern collaboration between artificial intelligence and bioinformatics. Past decades have witnessed great efforts on this concern, such as the bio-plausible splicing pattern AT-CG and the famous SpliceAI. In this paper, we propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI). The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor. In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation, improving accuracy and efficiency. The experiments conducted on the real-world Human dataset show that our proposed H-GSI outperforms SpliceAI and achieves the best accuracy of 97.20\%. The source code is available from this link.
NEApr 26
Generalization Bounds of Spiking Neural Networks via Rademacher ComplexityShao-Qun Zhang, Zhi-Hua Zhou
Spiking Neural Networks (SNNs) have garnered increasing attention as one of bio-inspired models due to their great potential in neuromorphic computing and sparse computation. Many practical algorithms and techniques have been developed; however, theoretical understandings of the generalization, that is, the extent to which SNNs perform well on unseen data, are far from clear. Recent advances disclosed an excitation-dependent and architecture-related generalization bound such that the Rademacher complexity of SNNs with stochastic firing can be upper bounded by an exponential function relative to the excitation probability and the architecture depth. In this paper, we theoretically investigate the generalization bounds of SNNs with several integration-and-fire schemes via Rademacher complexity. We recognize that the empirical Rademacher complexity of SNNs is close to the SNN configurations, which is exponential to the network depth and the maximum time duration of received spike sequences, superlinear and subquadratic to the network width, polynomial to the parameter norm, inverse-linear to the number of training samples, and independent of the computations within spiking neurons, achieving a more precise rate than conventional studies. Our theoretical results may support the scope of SNN theories and shed some insight into the development of SNNs.
LGNov 5, 2025
Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural KernelsYong-Ming Tian, Shuang Liang, Shao-Qun Zhang et al.
While deep learning has achieved remarkable success across a wide range of applications, its theoretical understanding of representation learning remains limited. Deep neural kernels provide a principled framework to interpret over-parameterized neural networks by mapping hierarchical feature transformations into kernel spaces, thereby combining the expressive power of deep architectures with the analytical tractability of kernel methods. Recent advances, particularly neural tangent kernels (NTKs) derived by gradient inner products, have established connections between infinitely wide neural networks and nonparametric Bayesian inference. However, the existing NTK paradigm has been predominantly confined to the infinite-width regime, while overlooking the representational role of network depth. To address this gap, we propose a depth-induced NTK kernel based on a shortcut-related architecture, which converges to a Gaussian process as the network depth approaches infinity. We theoretically analyze the training invariance and spectrum properties of the proposed kernel, which stabilizes the kernel dynamics and mitigates degeneration. Experimental results further underscore the effectiveness of our proposed method. Our findings significantly extend the existing landscape of the neural kernel theory and provide an in-depth understanding of deep learning and the scaling law.
LGApr 10
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware DistillationShu-Hao Zhang, Le-Tong Huang, Xiang-Sheng Deng et al.
Recent years have witnessed an increasing interest in deploying LLMs on resource-constrained devices, among which quantization has emerged as a promising lightweight technique that converts full-precision model weights and activations into lower-bit formats. Existing weight quantization approaches can be roughly divided into three categories: Post-Training Quantization (PTQ) that calibrates quantized parameters on a small dataset without retraining but suffers from severe performance degradation below 4-bit, Quantization-Aware Training (QAT) that searches low-bit parameters using surrogate gradients but demands substantial computational resources, and Quantization-Aware Distillation that integrates QAT with knowledge transfer from a full-precision teacher but manually selects features to distill and relies heavily on teacher-specific data. In this paper, we propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and extremely low-bit weight quantization. The EdgeRazor framework contains three modules: Mixed-Precision Quantization-Aware Distillation for the fine-grained control of precision, Adaptive Feature Distillation that derives an $n$-bit student from its 16-bit teacher, and Entropy-Aware KL Divergence on both human-annotated and distilled datasets, whose forward-reverse balance is determined solely by the teacher's output distribution. Empirical investigations of EdgeRazor are conducted on base, instruction-tuned, and multimodal LLMs. Notably, EdgeRazor with 1.88-bit surpasses all contenders with the 3-bit precision, especially outperforms the leading 2-bit PTQ methods by 11.3 points, within a 4-10$\times$ lower training budget than the leading QAT approach. EdgeRazor delivers higher compression ratios at all bit width; the 1.58-bit Qwen3-0.6B reduces storage from 1.41 GB to 0.28 GB while accelerating decoding by 15.1$\times$ relative to the 16-bit baseline.
LGMar 26, 2024
A Unified Kernel for Neural Network LearningShao-Qun Zhang, Zong-Yi Chen, Yong-Ming Tian et al.
Past decades have witnessed a great interest in the distinction and connection between neural network learning and kernel learning. Recent advancements have made theoretical progress in connecting infinite-wide neural networks and Gaussian processes. Two predominant approaches have emerged: the Neural Network Gaussian Process (NNGP) and the Neural Tangent Kernel (NTK). The former, rooted in Bayesian inference, represents a zero-order kernel, while the latter, grounded in the tangent space of gradient descents, is a first-order kernel. In this paper, we present the Unified Neural Kernel (UNK), which {is induced by the inner product of produced variables and characterizes the learning dynamics of neural networks with gradient descents and parameter initialization.} The proposed UNK kernel maintains the limiting properties of both NNGP and NTK, exhibiting behaviors akin to NTK with a finite learning step and converging to NNGP as the learning step approaches infinity. Besides, we also theoretically characterize the uniform tightness and learning convergence of the UNK kernel, providing comprehensive insights into this unified kernel. Experimental results underscore the effectiveness of our proposed method.
CVOct 23, 2025
TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled KnowledgeShu-Hao Zhang, Wei-Cheng Tang, Chen Wu et al.
Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.
LGMay 19, 2025
Theoretical Investigation on Inductive Bias of Isolation ForestQin-Cheng Zheng, Shao-Qun Zhang, Shen-Huan Lyu et al.
Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector, primarily owing to its remarkable runtime efficiency and superior performance in large-scale tasks. Despite its widespread adoption, a theoretical foundation explaining iForest's success remains unclear. This paper focuses on the inductive bias of iForest, which theoretically elucidates under what circumstances and to what extent iForest works well. The key is to formulate the growth process of iForest, where the split dimensions and split values are randomly selected. We model the growth process of iForest as a random walk, enabling us to derive the expected depth function, which is the outcome of iForest, using transition probabilities. The case studies reveal key inductive biases: iForest exhibits lower sensitivity to central anomalies while demonstrating greater parameter adaptability compared to $k$-Nearest Neighbor. Our study provides a theoretical understanding of the effectiveness of iForest and establishes a foundation for further theoretical exploration.
LGNov 11, 2021
Theoretical Exploration of Flexible Transmitter ModelJin-Hui Wu, Shao-Qun Zhang, Yuan Jiang et al.
Neural network models generally involve two important components, i.e., network architecture and neuron model. Although there are abundant studies about network architectures, only a few neuron models have been developed, such as the MP neuron model developed in 1943 and the spiking neuron model developed in the 1950s. Recently, a new bio-plausible neuron model, Flexible Transmitter (FT) model, has been proposed. It exhibits promising behaviors, particularly on temporal-spatial signals, even when simply embedded into the common feedforward network architecture. This paper attempts to understand the properties of the FT network (FTNet) theoretically. Under mild assumptions, we show that: i) FTNet is a universal approximator; ii) the approximation complexity of FTNet can be exponentially smaller than those of commonly-used real-valued neural networks with feedforward/recurrent architectures and is of the same order in the worst case; iii) any local minimum of FTNet is the global minimum, implying that it is possible to identify global minima by local search algorithms.
MLNov 8, 2021
ARISE: ApeRIodic SEmi-parametric Process for Efficient Markets without Periodogram and Gaussianity AssumptionsShao-Qun Zhang, Zhi-Hua Zhou
Mimicking and learning the long-term memory of efficient markets is a fundamental problem in the interaction between machine learning and financial economics to sequential data. Despite the prominence of this issue, current treatments either remain largely limited to heuristic techniques or rely significantly on periodogram or Gaussianty assumptions. In this paper, we present the ApeRIodic SEmi-parametric (ARISE) process for investigating efficient markets. The ARISE process is formulated as an infinite-sum function of some known processes and employs the aperiodic spectrum estimation to determine the key hyper-parameters, thus possessing the power and potential of modeling the price data with long-term memory, non-stationarity, and aperiodic spectrum. We further theoretically show that the ARISE process has the mean-square convergence, consistency, and asymptotic normality without periodogram and Gaussianity assumptions. In practice, we apply the ARISE process to identify the efficiency of real-world markets. Besides, we also provide two alternative ARISE applications: studying the long-term memorability of various machine-learning models and developing a latent state-space model for inference and forecasting of time series. The numerical experiments confirm the superiority of our proposed approaches.
LGSep 30, 2021
LIFE: Learning Individual Features for Multivariate Time Series Prediction with Missing ValuesZhao-Yu Zhang, Shao-Qun Zhang, Yuan Jiang et al.
Multivariate time series (MTS) prediction is ubiquitous in real-world fields, but MTS data often contains missing values. In recent years, there has been an increasing interest in using end-to-end models to handle MTS with missing values. To generate features for prediction, existing methods either merge all input dimensions of MTS or tackle each input dimension independently. However, both approaches are hard to perform well because the former usually produce many unreliable features and the latter lacks correlated information. In this paper, we propose a Learning Individual Features (LIFE) framework, which provides a new paradigm for MTS prediction with missing values. LIFE generates reliable features for prediction by using the correlated dimensions as auxiliary information and suppressing the interference from uncorrelated dimensions with missing values. Experiments on three real-world data sets verify the superiority of LIFE to existing state-of-the-art models.
LGAug 29, 2021
Neural Network Gaussian Processes by Increasing DepthShao-Qun Zhang, Fei Wang, Feng-Lei Fan
Recent years have witnessed an increasing interest in the correspondence between infinitely wide networks and Gaussian processes. Despite the effectiveness and elegance of the current neural network Gaussian process theory, to the best of our knowledge, all the neural network Gaussian processes are essentially induced by increasing width. However, in the era of deep learning, what concerns us more regarding a neural network is its depth as well as how depth impacts the behaviors of a network. Inspired by a width-depth symmetry consideration, we use a shortcut network to show that increasing the depth of a neural network can also give rise to a Gaussian process, which is a valuable addition to the existing theory and contributes to revealing the true picture of deep learning. Beyond the proposed Gaussian process by depth, we theoretically characterize its uniform tightness property and the smallest eigenvalue of the Gaussian process kernel. These characterizations can not only enhance our understanding of the proposed depth-induced Gaussian process but also pave the way for future applications. Lastly, we examine the performance of the proposed Gaussian process by regression experiments on two benchmark data sets.
LGAug 15, 2021
Towards Understanding Theoretical Advantages of Complex-Reaction NetworksShao-Qun Zhang, Wei Gao, Zhi-Hua Zhou
Complex-valued neural networks have attracted increasing attention in recent years, while it remains open on the advantages of complex-valued neural networks in comparison with real-valued networks. This work takes one step on this direction by introducing the \emph{complex-reaction network} with fully-connected feed-forward architecture. We prove the universal approximation property for complex-reaction networks, and show that a class of radial functions can be approximated by a complex-reaction network using the polynomial number of parameters, whereas real-valued networks need at least exponential parameters to reach the same approximation level. For empirical risk minimization, our theoretical result shows that the critical point set of complex-reaction networks is a proper subset of that of real-valued networks, which may show some insights on finding the optimal solutions more easily for complex-reaction networks.
NEApr 8, 2020
Flexible Transmitter NetworkShao-Qun Zhang, Zhi-Hua Zhou
Current neural networks are mostly built upon the MP model, which usually formulates the neuron as executing an activation function on the real-valued weighted aggregation of signals received from other neurons. In this paper, we propose the Flexible Transmitter (FT) model, a novel bio-plausible neuron model with flexible synaptic plasticity. The FT model employs a pair of parameters to model the transmitters between neurons and puts up a neuron-exclusive variable to record the regulated neurotrophin density, which leads to the formulation of the FT model as a two-variable two-valued function, taking the commonly-used MP neuron model as its special case. This modeling manner makes the FT model not only biologically more realistic, but also capable of handling complicated data, even time series. To exhibit its power and potential, we present the Flexible Transmitter Network (FTNet), which is built on the most common fully-connected feed-forward architecture taking the FT model as the basic building block. FTNet allows gradient calculation and can be implemented by an improved back-propagation algorithm in the complex-valued domain. Experiments on a board range of tasks show the superiority of the proposed FTNet. This study provides an alternative basic building block in neural networks and exhibits the feasibility of developing artificial neural networks with neuronal plasticity.
NESep 18, 2019
Bifurcation Spiking Neural NetworkShao-Qun Zhang, Zhao-Yu Zhang, Zhi-Hua Zhou
Spiking neural networks (SNNs) has attracted much attention due to its great potential of modeling time-dependent signals. The firing rate of spiking neurons is decided by control rate which is fixed manually in advance, and thus, whether the firing rate is adequate for modeling actual time series relies on fortune. Though it is demanded to have an adaptive control rate, it is a non-trivial task because the control rate and the connection weights learned during the training process are usually entangled. In this paper, we show that the firing rate is related to the eigenvalue of the spike generation function. Inspired by this insight, by enabling the spike generation function to have adaptable eigenvalues rather than parametric control rates, we develop the Bifurcation Spiking Neural Network (BSNN), which has an adaptive firing rate and is insensitive to the setting of control rates. Experiments validate the effectiveness of BSNN on a broad range of tasks, showing that BSNN achieves superior performance to existing SNNs and is robust to the setting of control rates.