Linbo Qiao

LG
h-index12
20papers
278citations
Novelty53%
AI Score56

20 Papers

LGJun 10, 2022
Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Zhiquan Lai, Shengwei Li, Xudong Tang et al. · pku, tencent-ai

Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.

CLFeb 26, 2024Code
Two-stage Generative Question Answering on Temporal Knowledge Graph Using Large Language Models

Yifu Gao, Linbo Qiao, Zhigang Kan et al.

Temporal knowledge graph question answering (TKGQA) poses a significant challenge task, due to the temporal constraints hidden in questions and the answers sought from dynamic structured knowledge. Although large language models (LLMs) have made considerable progress in their reasoning ability over structured data, their application to the TKGQA task is a relatively unexplored area. This paper first proposes a novel generative temporal knowledge graph question answering framework, GenTKGQA, which guides LLMs to answer temporal questions through two phases: Subgraph Retrieval and Answer Generation. First, we exploit LLM's intrinsic knowledge to mine temporal constraints and structural links in the questions without extra training, thus narrowing down the subgraph search space in both temporal and structural dimensions. Next, we design virtual knowledge indicators to fuse the graph neural network signals of the subgraph and the text representations of the LLM in a non-shallow way, which helps the open-source LLM deeply understand the temporal order and structural dependencies among the retrieved facts through instruction tuning. Experimental results on two widely used datasets demonstrate the superiority of our model.

CLMar 2Code
Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering

Xufei Lv, Jiahui Yang, Yifu Gao et al.

Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to do next, already yields substantial gains even in a strict zero-shot setting. Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval. Experiments on MultiTQ demonstrate large improvements: AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including a +20.1% gain on challenging multi-target queries, showing that agentic autonomy can decisively outperform fine-tuning for temporal question answering. Code and the full set of sampled trajectories are available on https://github.com/AT2QA-Official-Code/AT2QA-Official-Code

CLDec 7, 2025
Rhea: Role-aware Heuristic Episodic Attention for Conversational LLMs

Wanyang Hong, Zhaoning Zhang, Yi Chen et al.

Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations. We define this phenomenon as cumulative contextual decay - a progressive degradation of contextual integrity caused by attention pollution, dilution, and drift. To address this challenge, we propose Rhea (Role-aware Heuristic Episodic Attention), a novel framework that decouples conversation history into two functionally independent memory modules: (1) an Instructional Memory (IM) that persistently stores high-fidelity global constraints via a structural priority mechanism, and (2) an Episodic Memory (EM) that dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval. During inference, Rhea constructs a high signal-to-noise context by applying its priority attention: selectively integrating relevant episodic information while always prioritizing global instructions. To validate this approach, experiments on multiple multi-turn conversation benchmarks - including MT-Eval and Long-MT-Bench+ - show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (a 16% relative gain over strong baselines). Moreover, Rhea maintains near-perfect instruction fidelity (IAR > 8.1) across long-horizon interactions. These results demonstrate that Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs.

67.1CLApr 9
GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning

Kaiyuan Tian, Yu Tang, Gongqingjian Jiang et al.

Full-parameter fine-tuning of large language models is constrained by substantial GPU memory requirements. Low-rank adaptation methods mitigate this challenge by updating only a subset of parameters. However, these approaches often limit model expressiveness and yield lower performance than full-parameter fine-tuning. Layer-wise fine-tuning methods have emerged as an alternative, enabling memory-efficient training through static layer importance sampling strategies. However, these methods overlook variations in layer importance across tasks and training stages, resulting in suboptimal performance on downstream tasks. To address these limitations, we propose GRASS, a gradient-based adaptive layer-wise importance sampling framework. GRASS utilizes mean gradient norms as a task-aware and training-stage-aware metric for estimating layer importance. Furthermore, GRASS adaptively adjusts layer sampling probabilities through an adaptive training strategy. We also introduce a layer-wise optimizer state offloading mechanism that overlaps computation and communication to further reduce memory usage while maintaining comparable training throughput. Extensive experiments across multiple models and benchmarks demonstrate that GRASS consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of up to 4.38 points and reducing memory usage by up to 19.97\%.

71.5LGApr 9
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

Baihui Liu, Kaiyuan Tian, Wei Wang et al.

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.

LGJan 21, 2025
A Survey on Memory-Efficient Transformer-Based Model Training in AI for Science

Kaiyuan Tian, Linbo Qiao, Baihui Liu et al.

Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews transformer-based LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the continuous expansion of model size has led to significant memory demands, hindering further development and application of LLMs for science. This survey systematically reviews and categorizes memory-efficient pre-training techniques for large-scale transformers, including algorithm-level, system-level, and hardware-software co-optimization. Using AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods can reduce storage needs while preserving prediction accuracy. By bridging model efficiency and scientific application needs, we hope to provide insights for scalable and cost-effective LLM training in AI for science.

LGNov 17, 2025
ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

Zhixin Ou, Peng Liang, Jianchen Han et al.

Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.

CVJan 29, 2024
TFDMNet: A Novel Network Structure Combines the Time Domain and Frequency Domain Features

Hengyue Pan, Yixin Chen, Zhiliang Tian et al.

Convolutional neural network (CNN) has achieved impressive success in computer vision during the past few decades. The image convolution operation helps CNNs to get good performance on image-related tasks. However, it also has high computation complexity and hard to be parallelized. This paper proposes a novel Element-wise Multiplication Layer (EML) to replace convolution layers, which can be trained in the frequency domain. Theoretical analyses show that EMLs lower the computation complexity and easier to be parallelized. Moreover, we introduce a Weight Fixation mechanism to alleviate the problem of over-fitting, and analyze the working behavior of Batch Normalization and Dropout in the frequency domain. To get the balance between the computation complexity and memory usage, we propose a new network structure, namely Time-Frequency Domain Mixture Network (TFDMNet), which combines the advantages of both convolution layers and EMLs. Experimental results imply that TFDMNet achieves good performance on MNIST, CIFAR-10 and ImageNet databases with less number of operations comparing with corresponding CNNs.

LGMar 30, 2022
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

Yu Tang, Chenyu Wang, Yufan Zhang et al.

The further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. However, as an emerging domain, several challenges remain:1)The efficiency of recomputation is limited for both static and dynamic methods. 2)Swapping requires offloading parameters manually, which incurs a great time cost. 3) There is no such dynamic and fine-grained method that involves tensor swapping together with tensor recomputation nowadays. To remedy the above issues, we propose a novel scheduler manager named DELTA(Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge, we are the first to make a reasonable dynamic runtime scheduler on the combination of tensor swapping and tensor recomputation without user oversight. In DELTA, we propose a filter algorithm to select the optimal tensors to be released out of GPU memory and present a director algorithm to select a proper action for each of these tensors. Furthermore, prefetching and overlapping are deliberately considered to overcome the time cost caused by swapping and recomputing tensors. Experimental results show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent but also gets comparable convergence results as the baseline with acceptable time delay. Also, DELTA gains 2.04$\times$ maximum batchsize when training ResNet-50 and 2.25$\times$ when training ResNet-101 compared with the baseline. Besides, comparisons between the swapping cost and recomputation cost in our experiments demonstrate the importance of making a reasonable dynamic scheduler on tensor swapping and tensor recomputation, which refutes the arguments in some related work that swapping should be the first and best choice.

LGJan 30, 2021
Inertial Proximal Deep Learning Alternating Minimization for Efficient Neutral Network Training

Linbo Qiao, Tao Sun, Hengyue Pan et al.

In recent years, the Deep Learning Alternating Minimization (DLAM), which is actually the alternating minimization applied to the penalty form of the deep neutral networks training, has been developed as an alternative algorithm to overcome several drawbacks of Stochastic Gradient Descent (SGD) algorithms. This work develops an improved DLAM by the well-known inertial technique, namely iPDLAM, which predicts a point by linearization of current and last iterates. To obtain further training speed, we apply a warm-up technique to the penalty parameter, that is, starting with a small initial one and increasing it in the iterations. Numerical results on real-world datasets are reported to demonstrate the efficiency of our proposed algorithm.

LGJun 10, 2020
P-ADMMiRNN: Training RNN with Stable Convergence via An Efficient and Paralleled ADMM Approach

Yu Tang, Zhigang Kan, Dequan Sun et al.

It is hard to train Recurrent Neural Network (RNN) with stable convergence and avoid gradient vanishing and exploding problems, as the weights in the recurrent unit are repeated from iteration to iteration. Moreover, RNN is sensitive to the initialization of weights and bias, which brings difficulties in training. The Alternating Direction Method of Multipliers (ADMM) has become a promising algorithm to train neural networks beyond traditional stochastic gradient algorithms with the gradient-free features and immunity to unsatisfactory conditions. However, ADMM could not be applied to train RNN directly since the state in the recurrent unit is repetitively updated over timesteps. Therefore, this work builds a new framework named ADMMiRNN upon the unfolded form of RNN to address the above challenges simultaneously. We also provide novel update rules and theoretical convergence analysis. We explicitly specify essential update rules in the iterations of ADMMiRNN with constructed approximation techniques and solutions to each sub-problem instead of vanilla ADMM. Numerical experiments are conducted on MNIST, IMDb, and text classification tasks. ADMMiRNN achieves convergent results and outperforms the compared baselines. Furthermore, ADMMiRNN trains RNN more stably without gradient vanishing or exploding than stochastic gradient algorithms. We also provide a distributed paralleled algorithm regarding ADMMiRNN, named P-ADMMiRNN, including Synchronous Parallel ADMMiRNN (SP-ADMMiRNN) and Asynchronous Parallel ADMMiRNN (AP-ADMMiRNN), which is the first to train RNN with ADMM in an asynchronous parallel manner. The source code is publicly available.

CLJun 2, 2020
Event Arguments Extraction via Dilate Gated Convolutional Neural Network with Enhanced Local Features

Zhigang Kan, Linbo Qiao, Sen Yang et al.

Event Extraction plays an important role in information-extraction to understand the world. Event extraction could be split into two subtasks: one is event trigger extraction, the other is event arguments extraction. However, the F-Score of event arguments extraction is much lower than that of event trigger extraction, i.e. in the most recent work, event trigger extraction achieves 80.7%, while event arguments extraction achieves only 58%. In pipelined structures, the difficulty of event arguments extraction lies in its lack of classification feature, and the much higher computation consumption. In this work, we proposed a novel Event Extraction approach based on multi-layer Dilate Gated Convolutional Neural Network (EE-DGCNN) which has fewer parameters. In addition, enhanced local information is incorporated into word features, to assign event arguments roles for triggers predicted by the first subtask. The numerical experiments demonstrated significant performance improvement beyond state-of-art event extraction approaches on real-world datasets. Further analysis of extraction procedure is presented, as well as experiments are conducted to analyze impact factors related to the performance improvement.

MLSep 11, 2018
An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines

Lei Guan, Linbo Qiao, Dongsheng Li et al.

Support vector machines (SVMs) with sparsity-inducing nonconvex penalties have received considerable attentions for the characteristics of automatic classification and variable selection. However, it is quite challenging to solve the nonconvex penalized SVMs due to their nondifferentiability, nonsmoothness and nonconvexity. In this paper, we propose an efficient ADMM-based algorithm to the nonconvex penalized SVMs. The proposed algorithm covers a large class of commonly used nonconvex regularization terms including the smooth clipped absolute deviation (SCAD) penalty, minimax concave penalty (MCP), log-sum penalty (LSP) and capped-$\ell_1$ penalty. The computational complexity analysis shows that the proposed algorithm enjoys low computational cost. Moreover, the convergence of the proposed algorithm is guaranteed. Extensive experimental evaluations on five benchmark datasets demonstrate the superior performance of the proposed algorithm to other three state-of-the-art approaches.

OCJan 23, 2018
Non-ergodic Complexity of Convex Proximal Inertial Gradient Descents

Tao Sun, Linbo Qiao, Dongsheng Li

The proximal inertial gradient descent is efficient for the composite minimization and applicable for broad of machine learning problems. In this paper, we revisit the computational complexity of this algorithm and present other novel results, especially on the convergence rates of the objective function values. The non-ergodic O(1/k) rate is proved for proximal inertial gradient descent with constant stepzise when the objective function is coercive. When the objective function fails to promise coercivity, we prove the sublinear rate with diminishing inertial parameters. In the case that the objective function satisfies optimal strong convexity condition (which is much weaker than the strong convexity), the linear convergence is proved with much larger and general stepsize than previous literature. We also extend our results to the multi-block version and present the computational complexity. Both cyclic and stochastic index selection strategies are considered.

LGJan 22, 2018
On the Iteration Complexity Analysis of Stochastic Primal-Dual Hybrid Gradient Approach with High Probability

Linbo Qiao, Tianyi Lin, Qi Qin et al.

In this paper, we propose a stochastic Primal-Dual Hybrid Gradient (PDHG) approach for solving a wide spectrum of regularized stochastic minimization problems, where the regularization term is composite with a linear function. It has been recognized that solving this kind of problem is challenging since the closed-form solution of the proximal mapping associated with the regularization term is not available due to the imposed linear composition, and the per-iteration cost of computing the full gradient of the expected objective function is extremely high when the number of input data samples is considerably large. Our new approach overcomes these issues by exploring the special structure of the regularization term and sampling a few data points at each iteration. Rather than analyzing the convergence in expectation, we provide the detailed iteration complexity analysis for the cases of both uniformly and non-uniformly averaged iterates with high probability. This strongly supports the good practical performance of the proposed approach. Numerical experiments demonstrate that the efficiency of stochastic PDHG, which outperforms other competing algorithms, as expected by the high-probability convergence analysis.

CVDec 4, 2017
Learning for Disparity Estimation through Feature Constancy

Zhengfa Liang, Yiliu Feng, Yulan Guo et al.

Stereo matching algorithms usually consist of four steps, including matching cost calculation, matching cost aggregation, disparity calculation, and disparity refinement. Existing CNN-based methods only adopt CNN to solve parts of the four steps, or use different networks to deal with different steps, making them difficult to obtain the overall optimal solution. In this paper, we propose a network architecture to incorporate all steps of stereo matching. The network consists of three parts. The first part calculates the multi-scale shared features. The second part performs matching cost calculation, matching cost aggregation and disparity calculation to estimate the initial disparity using shared features. The initial disparity and the shared features are used to calculate the feature constancy that measures correctness of the correspondence between two input images. The initial disparity and the feature constancy are then fed to a sub-network to refine the initial disparity. The proposed method has been evaluated on the Scene Flow and KITTI datasets. It achieves the state-of-the-art performance on the KITTI 2012 and KITTI 2015 benchmarks while maintaining a very fast running time.

LGAug 20, 2017
Stochastic Primal-Dual Proximal ExtraGradient Descent for Compositely Regularized Optimization

Tianyi Lin, Linbo Qiao, Teng Zhang et al.

We consider a wide range of regularized stochastic minimization problems with two regularization terms, one of which is composed with a linear function. This optimization model abstracts a number of important applications in artificial intelligence and machine learning, such as fused Lasso, fused logistic regression, and a class of graph-guided regularized minimization. The computational challenges of this model are in two folds. On one hand, the closed-form solution of the proximal mapping associated with the composed regularization term or the expected objective function is not available. On the other hand, the calculation of the full gradient of the expectation in the objective is very expensive when the number of input data samples is considerably large. To address these issues, we propose a stochastic variant of extra-gradient type methods, namely \textsf{Stochastic Primal-Dual Proximal ExtraGradient descent (SPDPEG)}, and analyze its convergence property for both convex and strongly convex objectives. For general convex objectives, the uniformly average iterates generated by \textsf{SPDPEG} converge in expectation with $O(1/\sqrt{t})$ rate. While for strongly convex objectives, the uniformly and non-uniformly average iterates generated by \textsf{SPDPEG} converge with $O(\log(t)/t)$ and $O(1/t)$ rates, respectively. The order of the rate of the proposed algorithm is known to match the best convergence rate for first-order stochastic algorithms. Experiments on fused logistic regression and graph-guided regularized logistic regression problems show that the proposed algorithm performs very efficiently and consistently outperforms other competing algorithms.

CVNov 19, 2016
Semantic tracking: Single-target tracking with inter-supervised convolutional networks

Jingjing Xiao, Qiang Lan, Linbo Qiao et al.

This article presents a semantic tracker which simultaneously tracks a single target and recognises its category. In general, it is hard to design a tracking model suitable for all object categories, e.g., a rigid tracker for a car is not suitable for a deformable gymnast. Category-based trackers usually achieve superior tracking performance for the objects of that specific category, but have difficulties being generalised. Therefore, we propose a novel unified robust tracking framework which explicitly encodes both generic features and category-based features. The tracker consists of a shared convolutional network (NetS), which feeds into two parallel networks, NetC for classification and NetT for tracking. NetS is pre-trained on ImageNet to serve as a generic feature extractor across the different object categories for NetC and NetT. NetC utilises those features within fully connected layers to classify the object category. NetT has multiple branches, corresponding to multiple categories, to distinguish the tracked object from the background. Since each branch in NetT is trained by the videos of a specific category or groups of similar categories, NetT encodes category-based features for tracking. During online tracking, NetC and NetT jointly determine the target regions with the right category and foreground labels for target estimation. To improve the robustness and precision, NetC and NetT inter-supervise each other and trigger network adaptation when their outputs are ambiguous for the same image regions (i.e., when the category label contradicts the foreground/background classification). We have compared the performance of our tracker to other state-of-the-art trackers on a large-scale tracking benchmark (100 sequences)---the obtained results demonstrate the effectiveness of our proposed tracker as it outperformed other 38 state-of-the-art tracking algorithms.