Qianxiao Li

h-index1

8papers

43citations

Novelty54%

AI Score47

Ranked #33,039 of 194,257 authors (top 17%)#7,797 in LG (top 19%)

8 Papers

17.0LGFeb 27, 2023

A Brief Survey on the Approximation Theory for Sequence Modelling

Haotian Jiang, Qianxiao Li, Zhong Li et al.

We survey current developments in the approximation theory of sequence modelling in machine learning. Particular emphasis is placed on classifying existing results for various model architectures through the lens of classical approximation paradigms, and the insights one can gain from these results. We also outline some future research directions towards building a theory of sequence modelling.

6.4LGSep 29, 2024

Unifying back-propagation and forward-forward algorithms through model predictive control

Lianhai Ren, Qianxiao Li

We introduce a Model Predictive Control (MPC) framework for training deep neural networks, systematically unifying the Back-Propagation (BP) and Forward-Forward (FF) algorithms. At the same time, it gives rise to a range of intermediate training algorithms with varying look-forward horizons, leading to a performance-efficiency trade-off. We perform a precise analysis of this trade-off on a deep linear network, where the qualitative conclusions carry over to general networks. Based on our analysis, we propose a principled method to choose the optimization horizon based on given objectives and model specifications. Numerical results on various models and tasks demonstrate the versatility of our method.

7.1LGOct 8, 2025

The Effect of Attention Head Count on Transformer Approximation

Penghao Yu, Haotian Jiang, Zeyu Bao et al.

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $ε$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/ε^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.

4.1LGOct 4, 2025

Allocation of Parameters in Transformers

Ruoxi Yu, Haotian Jiang, Jingpu Cheng et al.

Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention heads and head dimensions -- should be allocated across layers to balance expressivity and efficiency. We first provide mathematical analysis on the role of early layers in information extraction from an approximation perspective, with a theoretical characterization on the trade-off between the number of heads and head dimension under a fixed parameter budget. In addition, we uncover and prove the \emph{saturation} behavior of softmax activations: Continuously increasing head dimensions can lead to diminishing returns in learning errors, particularly for long sequences. Supported by both theory and experiments, this saturation pattern suggests that later layers can operate more efficiently with reduced parameters. Combining these insights, we propose principled strategies for allocating attention heads and dimensions across Transformers' layers, shedding light on theoretically-grounded model efficiency of Transformer-based architectures.

9.4LGJun 30, 2025

A unified framework for establishing the universal approximation of transformer-type architectures

Jingpu Cheng, Ting Lin, Zuowei Shen et al.

We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.

4.1LGJun 24, 2025

The Effect of Depth on the Expressivity of Deep Linear State-Space Models

Zeyu Bao, Penghao Yu, Haotian Jiang et al.

Deep state-space models (SSMs) have gained increasing popularity in sequence modelling. While there are numerous theoretical investigations of shallow SSMs, how the depth of the SSM affects its expressiveness remains a crucial problem. In this paper, we systematically investigate the role of depth and width in deep linear SSMs, aiming to characterize how they influence the expressive capacity of the architecture. First, we rigorously prove that in the absence of parameter constraints, increasing depth and increasing width are generally equivalent, provided that the parameter count remains within the same order of magnitude. However, under the assumption that the parameter norms are constrained, the effects of depth and width differ significantly. We show that a shallow linear SSM with large parameter norms can be represented by a deep linear SSM with smaller norms using a constructive method. In particular, this demonstrates that deep SSMs are more capable of representing targets with large norms than shallow SSMs under norm constraints. Finally, we derive upper bounds on the minimal depth required for a deep linear SSM to represent a given shallow linear SSM under constrained parameter norms. We also validate our theoretical results with numerical experiments

4.1LGFeb 2, 2025

Continuity-Preserving Convolutional Autoencoders for Learning Continuous Latent Dynamical Models from Images

Aiqing Zhu, Yuting Pan, Qianxiao Li

Continuous dynamical systems are cornerstones of many scientific and engineering disciplines. While machine learning offers powerful tools to model these systems from trajectory data, challenges arise when these trajectories are captured as images, resulting in pixel-level observations that are discrete in nature. Consequently, a naive application of a convolutional autoencoder can result in latent coordinates that are discontinuous in time. To resolve this, we propose continuity-preserving convolutional autoencoders (CpAEs) to learn continuous latent states and their corresponding continuous latent dynamical models from discrete image frames. We present a mathematical formulation for learning dynamics from image frames, which illustrates issues with previous approaches and motivates our methodology based on promoting the continuity of convolution filters, thereby preserving the continuity of the latent states. This approach enables CpAEs to produce latent states that evolve continuously with the underlying dynamics, leading to more accurate latent dynamical models. Extensive experiments across various scenarios demonstrate the effectiveness of CpAEs.

19.2LGMay 29, 2023

Approximation Rate of the Transformer Architecture for Sequence Modeling

Haotian Jiang, Qianxiao Li

The Transformer architecture is widely applied in sequence modeling applications, yet the theoretical understanding of its working principles remains limited. In this work, we investigate the approximation rate for single-layer Transformers with one head. We consider a class of non-linear relationships and identify a novel notion of complexity measures to establish an explicit Jackson-type approximation rate estimate for the Transformer. This rate reveals the structural properties of the Transformer and suggests the types of sequential relationships it is best suited for approximating. In particular, the results on approximation rates enable us to concretely analyze the differences between the Transformer and classical sequence modeling methods, such as recurrent neural networks.