LGApr 7, 2023Code
On Efficient Training of Large-Scale Deep Learning Models: A Literature ReviewLi Shen, Yan Sun, Zhiyuan Yu et al.
The field of deep learning has witnessed significant progress, particularly in computer vision (CV), natural language processing (NLP), and speech. The use of large-scale models trained on vast amounts of data holds immense promise for practical applications, enhancing industrial productivity and facilitating social development. With the increasing demands on computational capacity, though numerous studies have explored the efficient training, a comprehensive summarization on acceleration techniques of training deep learning models is still much anticipated. In this survey, we present a detailed review for training acceleration. We consider the fundamental update formulation and split its basic components into five main perspectives: (1) data-centric: including dataset regularization, data sampling, and data-centric curriculum learning techniques, which can significantly reduce the computational complexity of the data samples; (2) model-centric, including acceleration of basic modules, compression training, model initialization and model-centric curriculum learning techniques, which focus on accelerating the training via reducing the calculations on parameters; (3) optimization-centric, including the selection of learning rate, the employment of large batchsize, the designs of efficient objectives, and model average techniques, which pay attention to the training policy and improving the generality for the large-scale models; (4) budgeted training, including some distinctive acceleration methods on source-constrained situations; (5) system-centric, including some efficient open-source distributed libraries/systems which provide adequate hardware support for the implementation of acceleration algorithms. By presenting this comprehensive taxonomy, our survey presents a comprehensive review to understand the general mechanisms within each component and their joint interaction.
LGFeb 21, 2023Code
FedSpeed: Larger Local Interval, Less Communication Round, and Higher Generalization AccuracyYan Sun, Li Shen, Tiansheng Huang et al.
Federated learning is an emerging distributed machine learning framework which jointly trains a global model via a large number of local devices with data privacy protections. Its performance suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting. In this paper, we propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems. Concretely, FedSpeed applies the prox-correction term on the current local updates to efficiently reduce the biases introduced by the prox-term, a necessary regularizer to maintain the strong local consistency. Furthermore, FedSpeed merges the vanilla stochastic gradient with a perturbation computed from an extra gradient ascent step in the neighborhood, thereby alleviating the issue of local over-fitting. Our theoretical analysis indicates that the convergence rate is related to both the communication rounds $T$ and local intervals $K$ with a upper bound $\small \mathcal{O}(1/T)$ if setting a proper local interval. Moreover, we conduct extensive experiments on the real-world dataset to demonstrate the efficiency of our proposed FedSpeed, which performs significantly faster and achieves the state-of-the-art (SOTA) performance on the general FL experimental settings than several baselines. Our code is available at \url{https://github.com/woodenchild95/FL-Simulator.git}.
LGOct 4, 2023Code
Efficient Federated Prompt Tuning for Black-box Large Pre-trained ModelsZihao Lin, Yan Sun, Yifan Shi et al.
With the blowout development of pre-trained models (PTMs), the efficient tuning of these models for diverse downstream applications has emerged as a pivotal research concern. Although recent investigations into prompt tuning have provided promising avenues, three salient challenges persist: (1) memory constraint: the continuous growth in the size of open-source PTMs renders fine-tuning, even a fraction of their parameters, challenging for many practitioners. (2) model privacy: existing PTMs often function as public API services, with their parameters inaccessible for effective or tailored fine-tuning. (3) data privacy: the fine-tuning of PTMs necessitates high-quality datasets, which are typically localized and not shared to public. To optimally harness each local dataset while navigating memory constraints and preserving privacy, we propose Federated Black-Box Prompt Tuning (Fed-BBPT). This innovative approach eschews reliance on parameter architectures and private dataset access, instead capitalizing on a central server that aids local users in collaboratively training a prompt generator through regular aggregation. Local users leverage API-driven learning via a zero-order optimizer, obviating the need for PTM deployment. Relative to extensive fine-tuning, Fed-BBPT proficiently sidesteps memory challenges tied to PTM storage and fine-tuning on local machines, tapping into comprehensive, high-quality, yet private training datasets. A thorough evaluation across 40 datasets spanning CV and NLP tasks underscores the robustness of our proposed model.
LGFeb 21, 2023Code
Fusion of Global and Local Knowledge for Personalized Federated LearningTiansheng Huang, Li Shen, Yan Sun et al.
Personalized federated learning, as a variant of federated learning, trains customized models for clients using their heterogeneously distributed data. However, it is still inconclusive about how to design personalized models with better representation of shared global knowledge and personalized pattern. To bridge the gap, we in this paper explore personalized models with low-rank and sparse decomposition. Specifically, we employ proper regularization to extract a low-rank global knowledge representation (GKR), so as to distill global knowledge into a compact representation. Subsequently, we employ a sparse component over the obtained GKR to fuse the personalized pattern into the global knowledge. As a solution, we propose a two-stage proximal-based algorithm named \textbf{Fed}erated learning with mixed \textbf{S}parse and \textbf{L}ow-\textbf{R}ank representation (FedSLR) to efficiently search for the mixed models. Theoretically, under proper assumptions, we show that the GKR trained by FedSLR can at least sub-linearly converge to a stationary point of the regularized problem, and that the sparse component being fused can converge to its stationary point under proper settings. Extensive experiments also demonstrate the superior empirical performance of FedSLR. Moreover, FedSLR reduces the number of parameters, and lowers the down-link communication complexity, which are all desirable for federated learning algorithms. Source code is available in \url{https://github.com/huangtiansheng/fedslr}.
IRJun 4
OneReason Technical ReportOneRec Team, Biao Yang, Boyang Ding et al.
Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.
LGFeb 8, 2023
Improving the Model Consistency of Decentralized Federated LearningYifan Shi, Li Shen, Kang Wei et al.
To mitigate the privacy leakages and communication burdens of Federated Learning (FL), decentralized FL (DFL) discards the central server and each client only communicates with its neighbors in a decentralized communication network. However, existing DFL suffers from high inconsistency among local clients, which results in severe distribution shift and inferior performance compared with centralized FL (CFL), especially on heterogeneous data or sparse communication topology. To alleviate this issue, we propose two DFL algorithms named DFedSAM and DFedSAM-MGS to improve the performance of DFL. Specifically, DFedSAM leverages gradient perturbation to generate local flat models via Sharpness Aware Minimization (SAM), which searches for models with uniformly low loss values. DFedSAM-MGS further boosts DFedSAM by adopting Multiple Gossip Steps (MGS) for better model consistency, which accelerates the aggregation of local flat models and better balances communication complexity and generalization. Theoretically, we present improved convergence rates $\small \mathcal{O}\big(\frac{1}{\sqrt{KT}}+\frac{1}{T}+\frac{1}{K^{1/2}T^{3/2}(1-λ)^2}\big)$ and $\small \mathcal{O}\big(\frac{1}{\sqrt{KT}}+\frac{1}{T}+\frac{λ^Q+1}{K^{1/2}T^{3/2}(1-λ^Q)^2}\big)$ in non-convex setting for DFedSAM and DFedSAM-MGS, respectively, where $1-λ$ is the spectral gap of gossip matrix and $Q$ is the number of MGS. Empirically, our methods can achieve competitive performance compared with CFL methods and outperform existing DFL methods.
CVAug 14, 2023
Temporal Sentence Grounding in Streaming VideosTian Gan, Xiao Wang, Yan Sun et al.
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.
LGFeb 24, 2023
Subspace based Federated UnlearningGuanghao Li, Li Shen, Yan Sun et al.
Federated learning (FL) enables multiple clients to train a machine learning model collaboratively without exchanging their local data. Federated unlearning is an inverse FL process that aims to remove a specified target client's contribution in FL to satisfy the user's right to be forgotten. Most existing federated unlearning algorithms require the server to store the history of the parameter updates, which is not applicable in scenarios where the server storage resource is constrained. In this paper, we propose a simple-yet-effective subspace based federated unlearning method, dubbed SFU, that lets the global model perform gradient ascent in the orthogonal space of input gradient spaces formed by other clients to eliminate the target client's contribution without requiring additional storage. Specifically, the server first collects the gradients generated from the target client after performing gradient ascent, and the input representation matrix is computed locally by the remaining clients. We also design a differential privacy method to protect the privacy of the representation matrix. Then the server merges those representation matrices to get the input gradient subspace and updates the global model in the orthogonal subspace of the input gradient subspace to complete the forgetting task with minimal model performance degradation. Experiments on MNIST, CIFAR10, and CIFAR100 show that SFU outperforms several state-of-the-art (SOTA) federated unlearning algorithms by a large margin in various settings.
LGMar 15, 2023
Visual Prompt Based Personalized Federated LearningGuanghao Li, Wansen Wu, Yan Sun et al.
As a popular paradigm of distributed learning, personalized federated learning (PFL) allows personalized models to improve generalization ability and robustness by utilizing knowledge from all distributed clients. Most existing PFL algorithms tackle personalization in a model-centric way, such as personalized layer partition, model regularization, and model interpolation, which all fail to take into account the data characteristics of distributed clients. In this paper, we propose a novel PFL framework for image classification tasks, dubbed pFedPT, that leverages personalized visual prompts to implicitly represent local data distribution information of clients and provides that information to the aggregation model to help with classification tasks. Specifically, in each round of pFedPT training, each client generates a local personalized prompt related to local data distribution. Then, the local model is trained on the input composed of raw data and a visual prompt to learn the distribution information contained in the prompt. During model testing, the aggregated model obtains prior knowledge of the data distributions based on the prompts, which can be seen as an adaptive fine-tuning of the aggregation model to improve model performances on different clients. Furthermore, the visual prompt can be added as an orthogonal method to implement personalization on the client for existing FL methods to boost their performance. Experiments on the CIFAR10 and CIFAR100 datasets show that pFedPT outperforms several state-of-the-art (SOTA) PFL algorithms by a large margin in various settings.
LGJul 30, 2023
Efficient Federated Learning via Local Adaptive Amended Optimizer with Linear SpeedupYan Sun, Li Shen, Hao Sun et al.
Adaptive optimization has achieved notable success for distributed learning while extending adaptive optimizer to federated Learning (FL) suffers from severe inefficiency, including (i) rugged convergence due to inaccurate gradient estimation in global adaptive optimizer; (ii) client drifts exacerbated by local over-fitting with the local adaptive optimizer. In this work, we propose a novel momentum-based algorithm via utilizing the global gradient descent and locally adaptive amended optimizer to tackle these difficulties. Specifically, we incorporate a locally amended technique to the adaptive optimizer, named Federated Local ADaptive Amended optimizer (\textit{FedLADA}), which estimates the global average offset in the previous communication round and corrects the local offset through a momentum-like term to further improve the empirical training speed and mitigate the heterogeneous over-fitting. Theoretically, we establish the convergence rate of \textit{FedLADA} with a linear speedup property on the non-convex case under the partial participation settings. Moreover, we conduct extensive experiments on the real-world dataset to demonstrate the efficacy of our proposed \textit{FedLADA}, which could greatly reduce the communication rounds and achieves higher accuracy than several baselines.
CVAug 25, 2023Code
TriGait: Aligning and Fusing Skeleton and Silhouette Gait Data via a Tri-Branch NetworkYan Sun, Xueling Feng, Liyan Ma et al.
Gait recognition is a promising biometric technology for identification due to its non-invasiveness and long-distance. However, external variations such as clothing changes and viewpoint differences pose significant challenges to gait recognition. Silhouette-based methods preserve body shape but neglect internal structure information, while skeleton-based methods preserve structure information but omit appearance. To fully exploit the complementary nature of the two modalities, a novel triple branch gait recognition framework, TriGait, is proposed in this paper. It effectively integrates features from the skeleton and silhouette data in a hybrid fusion manner, including a two-stream network to extract static and motion features from appearance, a simple yet effective module named JSA-TC to capture dependencies between all joints, and a third branch for cross-modal learning by aligning and fusing low-level features of two modalities. Experimental results demonstrate the superiority and effectiveness of TriGait for gait recognition. The proposed method achieves a mean rank-1 accuracy of 96.0% over all conditions on CASIA-B dataset and 94.3% accuracy for CL, significantly outperforming all the state-of-the-art methods. The source code will be available at https://github.com/feng-xueling/TriGait/.
LGJun 9, 2023
Understanding How Consistency Works in Federated Learning via Stage-wise Relaxed InitializationYan Sun, Li Shen, Dacheng Tao
Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the ``client-drift'' problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of the ``client drift'' and explore its substance in FL, in this paper, we first design an efficient FL algorithm \textit{FedInit}, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, \textit{FedInit} initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. This relaxed initialization helps to revise the local divergence and enhance the local consistency level. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error of the proposed \textit{FedInit} method. Our studies show that optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound in \textit{FedInit}. Extensive experiments are conducted to validate this conclusion. Our proposed \textit{FedInit} could achieve state-of-the-art~(SOTA) results compared to several advanced benchmarks without any additional costs. Meanwhile, stage-wise relaxed initialization could also be incorporated into the current advanced algorithms to achieve higher performance in the FL paradigm.
LGOct 9, 2022
Nonlinear Sufficient Dimension Reduction with a Stochastic Neural NetworkSiqi Liang, Yan Sun, Faming Liang
Sufficient dimension reduction is a powerful tool to extract core information hidden in the high-dimensional data and has potentially many important applications in machine learning tasks. However, the existing nonlinear sufficient dimension reduction methods often lack the scalability necessary for dealing with large-scale data. We propose a new type of stochastic neural network under a rigorous probabilistic framework and show that it can be used for sufficient dimension reduction for large-scale data. The proposed stochastic neural network is trained using an adaptive stochastic gradient Markov chain Monte Carlo algorithm, whose convergence is rigorously studied in the paper as well. Through extensive experiments on real-world classification and regression problems, we show that the proposed method compares favorably with the existing state-of-the-art sufficient dimension reduction methods and is computationally more efficient for large-scale data.
LGFeb 13, 2023
Deep Orthogonal Hypersphere Compression for Anomaly DetectionYunhe Zhang, Yan Sun, Jinyu Cai et al.
Many well-known and effective anomaly detection methods assume that a reasonable decision boundary has a hypersphere shape, which however is difficult to obtain in practice and is not sufficiently compact, especially when the data are in high-dimensional spaces. In this paper, we first propose a novel deep anomaly detection model that improves the original hypersphere learning through an orthogonal projection layer, which ensures that the training data distribution is consistent with the hypersphere hypothesis, thereby increasing the true positive rate and decreasing the false negative rate. Moreover, we propose a bi-hypersphere compression method to obtain a hyperspherical shell that yields a more compact decision region than a hyperball, which is demonstrated theoretically and numerically. The proposed methods are not confined to common datasets such as image and tabular data, but are also extended to a more challenging but promising scenario, graph-level anomaly detection, which learns graph representation with maximum mutual information between the substructure and global structure features while exploring orthogonal single- or bi-hypersphere anomaly decision boundaries. The numerical and visualization results on benchmark datasets demonstrate the superiority of our methods in comparison to many baselines and state-of-the-art methods.
CVJul 29, 2023Code
GaitASMS: Gait Recognition by Adaptive Structured Spatial Representation and Multi-Scale Temporal AggregationYan Sun, Hu Long, Xueling Feng et al.
Gait recognition is one of the most promising video-based biometric technologies. The edge of silhouettes and motion are the most informative feature and previous studies have explored them separately and achieved notable results. However, due to occlusions and variations in viewing angles, their gait recognition performance is often affected by the predefined spatial segmentation strategy. Moreover, traditional temporal pooling usually neglects distinctive temporal information in gait. To address the aforementioned issues, we propose a novel gait recognition framework, denoted as GaitASMS, which can effectively extract the adaptive structured spatial representations and naturally aggregate the multi-scale temporal information. The Adaptive Structured Representation Extraction Module (ASRE) separates the edge of silhouettes by using the adaptive edge mask and maximizes the representation in semantic latent space. Moreover, the Multi-Scale Temporal Aggregation Module (MSTA) achieves effective modeling of long-short-range temporal information by temporally aggregated structure. Furthermore, we propose a new data augmentation, denoted random mask, to enrich the sample space of long-term occlusion and enhance the generalization of the model. Extensive experiments conducted on two datasets demonstrate the competitive advantage of proposed method, especially in complex scenes, i.e. BG and CL. On the CASIA-B dataset, GaitASMS achieves the average accuracy of 93.5\% and outperforms the baseline on rank-1 accuracies by 3.4\% and 6.3\%, respectively, in BG and CL. The ablation experiments demonstrate the effectiveness of ASRE and MSTA. The source code is available at https://github.com/YanSungithub/GaitASMS.
LGJul 25, 2022
Laplacian-based Cluster-Contractive t-SNE for High Dimensional Data VisualizationYan Sun, Yi Han, Jicong Fan
Dimensionality reduction techniques aim at representing high-dimensional data in low-dimensional spaces to extract hidden and useful information or facilitate visual understanding and interpretation of the data. However, few of them take into consideration the potential cluster information contained implicitly in the high-dimensional data. In this paper, we propose LaptSNE, a new graph-layout nonlinear dimensionality reduction method based on t-SNE, one of the best techniques for visualizing high-dimensional data as 2D scatter plots. Specifically, LaptSNE leverages the eigenvalue information of the graph Laplacian to shrink the potential clusters in the low-dimensional embedding when learning to preserve the local and global structure from high-dimensional space to low-dimensional space. It is nontrivial to solve the proposed model because the eigenvalues of normalized symmetric Laplacian are functions of the decision variable. We provide a majorization-minimization algorithm with convergence guarantee to solve the optimization problem of LaptSNE and show how to calculate the gradient analytically, which may be of broad interest when considering optimization with Laplacian-composited objective. We evaluate our method by a formal comparison with state-of-the-art methods on seven benchmark datasets, both visually and via established quantitative measurements. The results demonstrate the superiority of our method over baselines such as t-SNE and UMAP. We also provide out-of-sample extension, large-scale extension and mini-batch extension for our LaptSNE to facilitate dimensionality reduction in various scenarios.
IRMar 24
Reasoning over Semantic IDs Enhances Generative RecommendationYingzhi He, Yan Sun, Junfei Tan et al.
Recent advances in generative recommendation have leveraged pretrained LLMs by formulating sequential recommendation as autoregressive generation over a unified token space comprising language tokens and itemic identifiers, where each item is represented by a compact sequence of discrete tokens, namely Semantic IDs (SIDs). This SID-based formulation enables efficient decoding over large-scale item corpora and provides a natural interface for LLM-based recommenders to leverage rich world knowledge. Meanwhile, breakthroughs in LLM reasoning motivate reasoning-enhanced recommendation, yet effective reasoning over SIDs remains underexplored and challenging. Itemic tokens are not natively meaningful to LLMs; moreover, recommendation-oriented SID reasoning is hard to evaluate, making high-quality supervision scarce. To address these challenges, we propose SIDReasoner, a two-stage framework that elicits reasoning over SIDs by strengthening SID--language alignment to unlock transferable LLM reasoning, rather than relying on large amounts of recommendation-specific reasoning traces. Concretely, SIDReasoner first enhances SID-language alignment via multi-task training on an enriched SID-centered corpus synthesized by a stronger teacher model, grounding itemic tokens in diverse semantic and behavioral contexts. Building on this enhanced alignment, SIDReasoner further improves recommendation reasoning through outcome-driven reinforced optimization, which guides the model toward effective reasoning trajectories without requiring explicit reasoning annotations. Extensive experiments on three real-world datasets demonstrate the effectiveness of our reasoning-augmented SID-based generative recommendation. Beyond accuracy, the results highlight the broader potential of large reasoning models for generative recommendation, including improved interpretability and cross-domain generalization.
LGApr 14
Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and GeneralizationLi Shen, Yan Sun, Dacheng Tao
Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the ``client-drift'' problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of ``client drift'' and explore its substance in FL, in this paper, we first propose an efficient FL algorithm FedInit, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, FedInit initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error in FL. Our studies show that optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound. Extensive experiments are conducted to validate its efficiency. The proposed FedInit method could achieve comparable results compared to several advanced benchmarks without any additional training or communication costs. Meanwhile, the stage-wise personalized relaxed initialization could also be incorporated into several current advanced algorithms to achieve higher generalization performance in the FL paradigm.
LGSep 27, 2024
A-FedPD: Aligning Dual-Drift is All Federated Primal-Dual Learning NeedsYan Sun, Li Shen, Dacheng Tao
As a popular paradigm for juggling data privacy and collaborative training, federated learning (FL) is flourishing to distributively process the large scale of heterogeneous datasets on edged clients. Due to bandwidth limitations and security considerations, it ingeniously splits the original problem into multiple subproblems to be solved in parallel, which empowers primal dual solutions to great application values in FL. In this paper, we review the recent development of classical federated primal dual methods and point out a serious common defect of such methods in non-convex scenarios, which we say is a "dual drift" caused by dual hysteresis of those longstanding inactive clients under partial participation training. To further address this problem, we propose a novel Aligned Federated Primal Dual (A-FedPD) method, which constructs virtual dual updates to align global consensus and local dual variables for those protracted unparticipated local clients. Meanwhile, we provide a comprehensive analysis of the optimization and generalization efficiency for the A-FedPD method on smooth non-convex objectives, which confirms its high efficiency and practicality. Extensive experiments are conducted on several classical FL setups to validate the effectiveness of our proposed method.
LGOct 5, 2023
Towards Understanding Generalization and Stability Gaps between Centralized and Decentralized Federated LearningYan Sun, Li Shen, Dacheng Tao
As two mainstream frameworks in federated learning (FL), both centralized and decentralized approaches have shown great application value in practical scenarios. However, existing studies do not provide sufficient evidence and clear guidance for analysis of which performs better in the FL community. Although decentralized methods have been proven to approach the comparable convergence of centralized with less communication, their test performance always falls short of expectations in empirical studies. To comprehensively and fairly compare their efficiency gaps in FL, in this paper, we explore their stability and generalization efficiency. Specifically, we prove that on the general smooth non-convex objectives, 1) centralized FL (CFL) always generalizes better than decentralized FL (DFL); 2) CFL achieves the best performance via adopting partial participation instead of full participation; and, 3) there is a necessary requirement for the topology in DFL to avoid performance collapse as the training scale increases. We also conduct extensive experiments on several common setups in FL to validate that our theoretical analysis is consistent with experimental phenomena and contextually valid in several general and practical scenarios.
MLAug 16, 2024
A Confidence Interval for the $\ell_2$ Expected Calibration ErrorYan Sun, Pratik Chaudhari, Ian J. Barnett et al.
Recent advances in machine learning have significantly improved prediction accuracy in various applications. However, ensuring the calibration of probabilistic predictions remains a significant challenge. Despite efforts to enhance model calibration, the rigorous statistical evaluation of model calibration remains less explored. In this work, we develop confidence intervals the $\ell_2$ Expected Calibration Error (ECE). We consider top-1-to-$k$ calibration, which includes both the popular notion of confidence calibration as well as full calibration. For a debiased estimator of the ECE, we show asymptotic normality, but with different convergence rates and asymptotic variances for calibrated and miscalibrated models. We develop methods to construct asymptotically valid confidence intervals for the ECE, accounting for this behavior as well as non-negativity. Our theoretical findings are supported through extensive experiments, showing that our methods produce valid confidence intervals with shorter lengths compared to those obtained by resampling-based methods.
LGAug 28, 2024
Convergent Differential Privacy Analysis for General Federated Learning: the $f$-DP PerspectiveYan Sun, Li Shen, Dacheng Tao
Federated learning (FL) is an efficient collaborative training paradigm extensively developed with a focus on local privacy, and differential privacy (DP) is a classical approach to capture and ensure the reliability of private security. Their powerful cooperation provides a promising paradigm for the large-scale private clients. As a predominant implementation, the noisy perturbation has been widely studied, being theoretically proven to offer significant protections. However, existing analyses in FL-DP mostly rely on the composition theorem and cannot tightly quantify the privacy leakage challenges, which is tight for a few communication rounds but yields an arbitrarily loose and divergent bound eventually. This also implies a counterintuitive judgment, suggesting that FL-DP may not provide adequate privacy support during long-term training. To further investigate the convergent privacy and reliability of the FL-DP framework, in this paper, we comprehensively evaluate the worst privacy of two classical methods under the non-convex and smooth objectives based on the $f$-DP analysis. With the aid of the shifted interpolation technique, we successfully prove that privacy in {\ttfamily Noisy-FedAvg} has a tight convergent bound. Moreover, with the regularization of the proxy term, privacy in {\ttfamily Noisy-FedProx} has a stable constant lower bound. Our analysis further demonstrates a solid theoretical foundation for the reliability of privacy in FL-DP. Meanwhile, our conclusions can also be losslessly converted to other classical DP analytical frameworks, e.g. $(ε,δ)$-DP and R$\acute{\text{e}}$nyi-DP (RDP).
MLDec 31, 2025
MultiRisk: Multiple Risk Control via Iterative Score ThresholdingSunay Joshi, Yan Sun, Hamed Hassani et al.
As generative AI systems are increasingly deployed in real-world applications, regulating multiple dimensions of model behavior has become essential. We focus on test-time filtering: a lightweight mechanism for behavior control that compares performance scores to estimated thresholds, and modifies outputs when these bounds are violated. We formalize the problem of enforcing multiple risk constraints with user-defined priorities, and introduce two efficient dynamic programming algorithms that leverage this sequential structure. The first, MULTIRISK-BASE, provides a direct finite-sample procedure for selecting thresholds, while the second, MULTIRISK, leverages data exchangeability to guarantee simultaneous control of the risks. Under mild assumptions, we show that MULTIRISK achieves nearly tight control of all constraint risks. The analysis requires an intricate iterative argument, upper bounding the risks by introducing several forms of intermediate symmetrized risk functions, and carefully lower bounding the risks by recursively counting jumps in symmetrized risk functions between appropriate risk levels. We evaluate our framework on a three-constraint Large Language Model alignment task using the PKU-SafeRLHF dataset, where the goal is to maximize helpfulness subject to multiple safety constraints, and where scores are generated by a Large Language Model judge and a perplexity filter. Our experimental results show that our algorithm can control each individual risk at close to the target level.
LGMar 24
Kronecker-Structured Nonparametric Spatiotemporal Point ProcessesZhitong Xu, Qiwei Yuan, Yinghao Chen et al.
Events in spatiotemporal domains arise in numerous real-world applications, where uncovering event relationships and enabling accurate prediction are central challenges. Classical Poisson and Hawkes processes rely on restrictive parametric assumptions that limit their ability to capture complex interaction patterns, while recent neural point process models increase representational capacity but integrate event information in a black-box manner, hindering interpretable relationship discovery. To address these limitations, we propose a Kronecker-Structured Nonparametric Spatiotemporal Point Process (KSTPP) that enables transparent event-wise relationship discovery while retaining high modeling flexibility. We model the background intensity with a spatial Gaussian process (GP) and the influence kernel as a spatiotemporal GP, allowing rich interaction patterns including excitation, inhibition, neutrality, and time-varying effects. To enable scalable training and prediction, we adopt separable product kernels and represent the GPs on structured grids, inducing Kronecker-structured covariance matrices. Exploiting Kronecker algebra substantially reduces computational cost and allows the model to scale to large event collections. In addition, we develop a tensor-product Gauss-Legendre quadrature scheme to efficiently evaluate intractable likelihood integrals. Extensive experiments demonstrate the effectiveness of our framework.
NIJan 20
Generative Intent Prediction Agentic AI empowered Edge Service Function Chain OrchestrationYan Sun, Shaoyong Guo, Sai Huang et al.
With the development of artificial intelligence (AI), Agentic AI (AAI) based on large language models (LLMs) is gradually being applied to network management. However, in edge network environments, high user mobility and implicit service intents pose significant challenges to the passive and reactive management of traditional AAI. To address the limitations of existing approaches in handling dynamic demands and predicting users' implicit intents, in this paper we propose an edge service function chain (SFC) orchestration framework empowered by a Generative Intent Prediction Agent (GIPA). Our GIPA aims to shift the paradigm from passive execution to proactive prediction and orchestration. First, we construct a multidimensional intent space that includes functional preferences, QoS sensitivity, and resource requirements, enabling the mapping from unstructured natural language to quantifiable physical resource demands. Second, to cope with the complexity and randomness of intent sequences, we design an intent prediction model based on a Generative Diffusion Model (GDM), which reconstructs users' implicit intents from multidimensional context through a reverse denoising process. Finally, the predicted implicit intents are embedded as global prompts into the SFC orchestration model to guide the network in proactively and ahead-of-time optimizing SFC deployment strategies. Experiment results show that GIPA outperforms existing baseline methods in highly concurrent and highly dynamic scenarios.
LGMar 27Code
VAN-AD: Visual Masked Autoencoder with Normalizing Flow For Time Series Anomaly DetectionPengYu Chen, Shang Wan, Xiaohou Shi et al.
Time series anomaly detection (TSAD) is essential for maintaining the reliability and security of IoT-enabled service systems. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. To address this limitation, foundation models have emerged as a promising direction. However, existing approaches either repurpose large language models (LLMs) or construct largescale time series datasets to develop general anomaly detection foundation models, and still face challenges caused by severe cross-modal gaps or in-domain heterogeneity. In this paper, we investigate the applicability of large-scale vision models to TSAD. Specifically, we adapt a visual Masked Autoencoder (MAE) pretrained on ImageNet to the TSAD task. However, directly transferring MAE to TSAD introduces two key challenges: overgeneralization and limited local perception. To address these challenges, we propose VAN-AD, a novel MAE-based framework for TSAD. To alleviate the over-generalization issue, we design an Adaptive Distribution Mapping Module (ADMM), which maps the reconstruction results before and after MAE into a unified statistical space to amplify discrepancies caused by abnormal patterns. To overcome the limitation of local perception, we further develop a Normalizing Flow Module (NFM), which combines MAE with normalizing flow to estimate the probability density of the current window under the global distribution. Extensive experiments on nine real-world datasets demonstrate that VAN-AD consistently outperforms existing state-of-the-art methods across multiple evaluation metrics.We make our code and datasets available at https://github.com/PenyChen/VAN-AD.
MLJul 31, 2024
Extended Fiducial Inference: Toward an Automated Process of Statistical InferenceFaming Liang, Sehwan Kim, Yan Sun
While fiducial inference was widely considered a big blunder by R.A. Fisher, the goal he initially set --`inferring the uncertainty of model parameters on the basis of observations' -- has been continually pursued by many statisticians. To this end, we develop a new statistical inference method called extended Fiducial inference (EFI). The new method achieves the goal of fiducial inference by leveraging advanced statistical computing techniques while remaining scalable for big data. EFI involves jointly imputing random errors realized in observations using stochastic gradient Markov chain Monte Carlo and estimating the inverse function using a sparse deep neural network (DNN). The consistency of the sparse DNN estimator ensures that the uncertainty embedded in observations is properly propagated to model parameters through the estimated inverse function, thereby validating downstream statistical inference. Compared to frequentist and Bayesian methods, EFI offers significant advantages in parameter estimation and hypothesis testing. Specifically, EFI provides higher fidelity in parameter estimation, especially when outliers are present in the observations; and eliminates the need for theoretical reference distributions in hypothesis testing, thereby automating the statistical inference process. EFI also provides an innovative framework for semi-supervised learning.
MLOct 5, 2023
Sparse Deep Learning for Time Series Data: Theory and ApplicationsMingxuan Zhang, Yan Sun, Faming Liang
Sparse deep learning has become a popular technique for improving the performance of deep neural networks in areas such as uncertainty quantification, variable selection, and large-scale network compression. However, most existing research has focused on problems where the observations are independent and identically distributed (i.i.d.), and there has been little work on the problems where the observations are dependent, such as time series data and sequential data in natural language processing. This paper aims to address this gap by studying the theory for sparse deep learning with dependent data. We show that sparse recurrent neural networks (RNNs) can be consistently estimated, and their predictions are asymptotically normally distributed under appropriate assumptions, enabling the prediction uncertainty to be correctly quantified. Our numerical results show that sparse deep learning outperforms state-of-the-art methods, such as conformal predictions, in prediction uncertainty quantification for time series data. Furthermore, our results indicate that the proposed method can consistently identify the autoregressive order for time series data and outperform existing methods in large-scale model compression. Our proposed method has important practical implications in fields such as finance, healthcare, and energy, where both accurate point estimates and prediction uncertainty quantification are of concern.
CLDec 30, 2025
Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask LearningZiqing Fan, Yuqiao Xian, Yan Sun et al.
A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.
LGFeb 24
Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed GraphsYifei Liang, Yan Sun, Xiaochun Cao et al.
Push-Sum-based decentralized learning enables optimization over directed communication networks, where information exchange may be asymmetric. While convergence properties of such methods are well understood, their finite-iteration stability and generalization behavior remain unclear due to structural bias induced by column-stochastic mixing and asymmetric error propagation. In this work, we develop a unified uniform-stability framework for the Stochastic Gradient Push (SGP) algorithm that captures the effect of directed topology. A key technical ingredient is an imbalance-aware consistency bound for Push-Sum, which controls consensus deviation through two quantities: the stationary distribution imbalance parameter $δ$ and the spectral gap $(1-λ)$ governing mixing speed. This decomposition enables us to disentangle statistical effects from topology-induced bias. We establish finite-iteration stability and optimization guarantees for both convex objectives and non-convex objectives satisfying the Polyak--Łojasiewicz condition. For convex problems, SGP attains excess generalization error of order $\tilde{\mathcal{O}}\!\left(\frac{1}{\sqrt{mn}}+\fracγ{δ(1-λ)}+γ\right)$ under step-size schedules, and we characterize the corresponding optimal early stopping time that minimizes this bound. For PŁ objectives, we obtain convex-like optimization and generalization rates with dominant dependence proportional to $κ\!\left(1+\frac{1}{δ(1-λ)}\right)$, revealing a multiplicative coupling between problem conditioning and directed communication topology. Our analysis clarifies when Push-Sum correction is necessary compared with standard decentralized SGD and quantifies how imbalance and mixing jointly shape the best attainable learning performance.
CLOct 21, 2025Code
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking ModelLing Team, Anqi Shen, Baihui Li et al.
We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.
LGJun 15, 2025Code
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language ModelsYan Sun, Qixin Zhang, Zhiyuan Yu et al.
The rapid scaling of large language models (LLMs) has made inference efficiency a primary bottleneck in the practical deployment. To address this, semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights, thereby enabling hardware-friendly acceleration and reduced memory. However, existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven combinatorial learning, which incurs prohibitive training costs. To tackle these challenges, we propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling without replacement. Furthermore, to mitigate the training instability induced by the high variance of policy gradients in the super large combinatorial space, we propose a novel update method by introducing a moving average tracker of loss residuals instead of vanilla loss. Finally, we conduct comprehensive theoretical analysis and extensive experiments to validate the superior performance of MaskPro, as well as its excellent scalability in memory efficiency and exceptional robustness to data samples. Our code is available at https://github.com/woodenchild95/Maskpro.git.
NAApr 22, 2025Code
Physics-Informed Inference Time Scaling via Simulation-Calibrated Scientific Machine LearningZexi Fan, Yan Sun, Shihao Yang et al.
High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time scaling strategies in language models, we propose Simulation-Calibrated Scientific Machine Learning (SCaSML), a physics-informed framework that dynamically refines and debiases the SCiML predictions during inference by enforcing the physical laws. SCaSML leverages derived new physical laws that quantifies systematic errors and employs Monte Carlo solvers based on the Feynman-Kac and Elworthy-Bismut-Li formulas to dynamically correct the prediction. Both numerical and theoretical analysis confirms enhanced convergence rates via compute-optimal inference methods. Our numerical experiments demonstrate that SCaSML reduces errors by 20-50% compared to the base surrogate model, establishing it as the first algorithm to refine approximated solutions to high-dimensional PDE during inference. Code of SCaSML is available at https://github.com/Francis-Fan-create/SCaSML.
AIFeb 2Code
Scaling-Aware Adapter for Structure-Grounded LLM ReasoningZihao Jing, Qiuhao Zeng, Ruiyi Fang et al.
Large language models (LLMs) are enabling reasoning over biomolecular structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric groundings requisite for mitigating structural hallucinations or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified all-atom LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across diverse all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning. Code is available at the project repository.
LGNov 25, 2025Code
RED-F: Reconstruction-Elimination based Dual-stream Contrastive Forecasting for Multivariate Time Series Anomaly PredictionPengYu Chen, Xiaohou Shi, Yuan Chang et al.
Anomaly prediction (AP) in multivariate time series (MTS) is crucial to ensure system dependability. Existing methods either focus solely on whether an anomaly is imminent without providing precise predictions for the future anomaly, or performing predictions directly on historical data, which is easily drowned out by the normal patterns. To address the challenges in AP task, we propose RED-F, a novel framework comprised of the Reconstruction-Elimination Model (REM) and the Dual-stream Contrastive Forecasting Model (DFM). We utilize REM to construct a baseline of normal patterns from historical data, providing a foundation for subsequent predictions of anomalies. Then DFM simultaneously predicts both the constructed normal pattern and the current window, employing a contrastive forecast that transforms the difficult AP task into a simpler, more robust task of relative trajectory comparison by computing the divergence between these two predictions. To enable the forecasting model to generate a prediction not easily obscured by normal patterns, we propose a Multi-Series Prediction (MSP) training objective to enhance its sensitivity to the current window. Extensive experiments on multiple real-world datasets demonstrate the superior capability of RED-F in anomaly prediction tasks. Our code is available at http://github.com/PenyChen/RED-F.
CLJan 1
Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution FeedbackYan Sun, Ming Cai, Stanley Kok
As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.
LGOct 24, 2025Code
Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation LearningZihao Jing, Yan Sun, Yan Yi Li et al.
Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github.com/selmiss/MuMo.
CVJan 28, 2025Code
Unsupervised Domain Adaptation with Dynamic Clustering and Contrastive Refinement for Gait RecognitionXiaolei Liu, Yan Sun, Zhiliang Wang et al.
Gait recognition is an emerging identification technology that distinguishes individuals at long distances by analyzing individual walking patterns. Traditional techniques rely heavily on large-scale labeled datasets, which incurs high costs and significant labeling challenges. Recently, researchers have explored unsupervised gait recognition with clustering-based unsupervised domain adaptation methods and achieved notable success. However, these methods directly use pseudo-label generated by clustering and neglect pseudolabel noise caused by domain differences, which affects the effect of the model training process. To mitigate these issues, we proposed a novel model called GaitDCCR, which aims to reduce the influence of noisy pseudo labels on clustering and model training. Our approach can be divided into two main stages: clustering and training stage. In the clustering stage, we propose Dynamic Cluster Parameters (DCP) and Dynamic Weight Centroids (DWC) to improve the efficiency of clustering and obtain reliable cluster centroids. In the training stage, we employ the classical teacher-student structure and propose Confidence-based Pseudo-label Refinement (CPR) and Contrastive Teacher Module (CTM) to encourage noisy samples to converge towards clusters containing their true identities. Extensive experiments on public gait datasets have demonstrated that our simple and effective method significantly enhances the performance of unsupervised gait recognition, laying the foundation for its application in the real-world. We will release the code at https://github.com/YanSun-github/GaitDCCR upon acceptance.
LGOct 30, 2024Code
WaveRoRA: Wavelet Rotary Route Attention for Multivariate Time Series ForecastingAobo Liang, Yan Sun, Nadra Guizani
In recent years, Transformer-based models (Transformers) have achieved significant success in multivariate time series forecasting (MTSF). However, previous works focus on extracting features either from the time domain or the frequency domain, which inadequately captures the trends and periodic characteristics. To address this issue, we propose a wavelet learning framework to model complex temporal dependencies of the time series data. The wavelet domain integrates both time and frequency information, allowing for the analysis of local characteristics of signals at different scales. Additionally, the Softmax self-attention mechanism used by Transformers has quadratic complexity, which leads to excessive computational costs when capturing long-term dependencies. Therefore, we propose a novel attention mechanism: Rotary Route Attention (RoRA). Unlike Softmax attention, RoRA utilizes rotary position embeddings to inject relative positional information to sequence tokens and introduces a small number of routing tokens $r$ to aggregate information from the $KV$ matrices and redistribute it to the $Q$ matrix, offering linear complexity. We further propose WaveRoRA, which leverages RoRA to capture inter-series dependencies in the wavelet domain. We conduct extensive experiments on eight real-world datasets. The results indicate that WaveRoRA outperforms existing state-of-the-art models while maintaining lower computational costs. Our code is available at https://github.com/Leopold2333/WaveRoRA.
CLMay 9
SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity RegularizationYan Sun, Guoxia Wang, Jinle Zeng et al.
Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus hindering the efficiency of representation learning. While similarity-based regularization has demonstrated benefit in supervised fine-tuning and classification tasks, its application and efficacy in large-scale LLM pretraining remains underexplored. In this work, we propose the SimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground-truth label within each sequence to be more similar, while enforcing separation from different-label tokens via a contrastive loss. Our analysis reveals that this mechanism introduces gains by enlarging multi-classification margins, thereby enabling more efficient classification. Extensive experiments across dense and Mixture-of-Experts (MoE) architectures demonstrate that SimReg consistently accelerates training convergence by over 30% and improves average zero-shot downstream performance by over 1% across standard benchmarks. Further ablation studies and analyses offer practical insights into hyperparameter tuning and loss effectiveness.
LGMay 8
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVRTao Wang, Shuo Li, Yan Sun et al.
Reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving the reasoning capabilities of large language models. Group-based policy optimization methods, such as GRPO, typically allocate a fixed number of rollouts to every prompt. This uniform allocation can be inefficient: it over-allocates compute to prompts whose sampled groups are already saturated while under-exploring prompts for which additional samples may reveal useful correct trajectories. To address this limitation, we introduce hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct. Building on this notion, we propose Hit-Utility Optimal Rollout Allocation (HORA), a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. HORA adaptively reallocates rollout budgets while leaving the downstream reward evaluation and group-based advantage estimator unchanged. Across four mathematical reasoning benchmarks and three model scales, HORA preserves comparable Pass@1 and improves Pass@K over compute-matched GRPO in ten of twelve model--benchmark configurations, with one tie and one saturated exception. It is also drop-in compatible with other group-based estimators such as RLOO. Ablation studies indicate that the uniform prior used by HORA is competitive with five prompt-conditioned learned-prior alternatives.
LGFeb 5
ZeroS: Zero-Sum Linear Attention for Efficient TransformersJiecheng Lu, Xu Han, Yan Sun et al.
Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.
LGNov 2, 2025
Efficient Reinforcement Learning for Large Language Models with Intrinsic ExplorationYan Sun, Jia Guo, Stanley Kok et al.
Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization, considering the amount of computation required. This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR. We propose PREPO with two complementary components. First, we adopt prompt perplexity as an indicator of model adaptability in learning, enabling the model to progress from well-understood contexts to more challenging ones. Second, we amplify the discrepancy among the rollouts by differentiating their relative entropy, and prioritize sequences that exhibit a higher degree of exploration. Together, these mechanisms reduce rollout demand while preserving competitive performance. On the Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than the baselines. Beyond empirical gains, we provide theoretical and in-depth analyses explaining the underlying rationale of our method to improve the data efficiency of RLVR.
MLMar 4, 2024
CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous VariablesJiecheng Lu, Xu Han, Yan Sun et al.
For Multivariate Time Series Forecasting (MTSF), recent deep learning applications show that univariate models frequently outperform multivariate ones. To address the difficiency in multivariate models, we introduce a method to Construct Auxiliary Time Series (CATS) that functions like a 2D temporal-contextual attention mechanism, which generates Auxiliary Time Series (ATS) from Original Time Series (OTS) to effectively represent and incorporate inter-series relationships for forecasting. Key principles of ATS - continuity, sparsity, and variability - are identified and implemented through different modules. Even with a basic 2-layer MLP as core predictor, CATS achieves state-of-the-art, significantly reducing complexity and parameters compared to previous multivariate models, marking it an efficient and transferable MTSF solution.
LGFeb 22, 2025
A Survey on Mechanistic Interpretability for Multi-Modal Foundation ModelsZihao Lin, Samyadeep Basu, Mohammad Beigi et al.
The rise of foundation models has transformed machine learning research, prompting efforts to uncover their inner workings and develop more efficient and reliable applications for better control. While significant progress has been made in interpreting Large Language Models (LLMs), multimodal foundation models (MMFMs) - such as contrastive vision-language models, generative vision-language models, and text-to-image models - pose unique interpretability challenges beyond unimodal frameworks. Despite initial studies, a substantial gap remains between the interpretability of LLMs and MMFMs. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems. By systematically reviewing current MMFM analysis techniques, we propose a structured taxonomy of interpretability methods, compare insights across unimodal and multimodal architectures, and highlight critical research gaps.
LGApr 24, 2024
Bi-Mamba+: Bidirectional Mamba for Time Series ForecastingAobo Liang, Xingguo Jiang, Yan Sun et al.
Long-term time series forecasting (LTSF) provides longer insights into future trends and patterns. Over the past few years, deep learning models especially Transformers have achieved advanced performance in LTSF tasks. However, LTSF faces inherent challenges such as long-term dependencies capturing and sparse semantic characteristics. Recently, a new state space model (SSM) named Mamba is proposed. With the selective capability on input data and the hardware-aware parallel computing algorithm, Mamba has shown great potential in balancing predicting performance and computational efficiency compared to Transformers. To enhance Mamba's ability to preserve historical information in a longer range, we design a novel Mamba+ block by adding a forget gate inside Mamba to selectively combine the new features with the historical features in a complementary manner. Furthermore, we apply Mamba+ both forward and backward and propose Bi-Mamba+, aiming to promote the model's ability to capture interactions among time series elements. Additionally, multivariate time series data in different scenarios may exhibit varying emphasis on intra- or inter-series dependencies. Therefore, we propose a series-relation-aware decider that controls the utilization of channel-independent or channel-mixing tokenization strategy for specific datasets. Extensive experiments on 8 real-world datasets show that our model achieves more accurate predictions compared with state-of-the-art methods.
CVAug 1, 2024
MPT-PAR:Mix-Parameters Transformer for Panoramic Activity RecognitionWenqing Gan, Yan Sun, Feiran Liu et al.
The objective of the panoramic activity recognition task is to identify behaviors at various granularities within crowded and complex environments, encompassing individual actions, social group activities, and global activities. Existing methods generally use either parameter-independent modules to capture task-specific features or parameter-sharing modules to obtain common features across all tasks. However, there is often a strong interrelatedness and complementary effect between tasks of different granularities that previous methods have yet to notice. In this paper, we propose a model called MPT-PAR that considers both the unique characteristics of each task and the synergies between different tasks simultaneously, thereby maximizing the utilization of features across multi-granularity activity recognition. Furthermore, we emphasize the significance of temporal and spatial information by introducing a spatio-temporal relation-enhanced module and a scene representation learning module, which integrate the the spatio-temporal context of action and global scene into the feature map of each granularity. Our method achieved an overall F1 score of 47.5\% on the JRDB-PAR dataset, significantly outperforming all the state-of-the-art methods.
LGMar 23
Multinoulli Extension: A Lossless Continuous Relaxation for Partition-Constrained Subset SelectionQixin Zhang, Wei Huang, Yan Sun et al.
Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. More specifically, when the objective function is monotone $α$-weakly DR-submodular or $(γ,β)$-weakly submodular, our Multinoulli-SCG algorithm can attain a value of $(1-e^{-α})\text{OPT}-ε$ or $(\frac{γ^{2}(1-e^{-(β(1-γ)+γ^2)})}{β(1-γ)+γ^2})\text{OPT}-ε$ with only $O(1/ε^{2})$ function evaluations, where OPT denotes the optimal value. The cornerstone of our Multinoulli-SCG algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the concerned partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Furthermore, based on our proposed ME, we also present two novel online algorithms, namely, Multinoulli-OSCG and Multinoulli-OSGA, for the unexplored online subset selection problems over partition constraints.
LGMay 23, 2024
In-context Time Series PredictorJiecheng Lu, Yan Sun, Shihao Yang
Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecasting tasks" as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms, and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures.
AIMay 25, 2025
Foundations of Top-$k$ Decoding For Language ModelsGeorgy Noarov, Soham Mallick, Tao Wang et al.
Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-$k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-$k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider \emph{Bregman decoders} obtained by minimizing a separable Bregman divergence (for both the \emph{primal} and \emph{dual} cases) with a sparsity-inducing $\ell_0$ regularization. Despite the combinatorial nature of the objective, we show how to optimize it efficiently for a large class of divergences. We show that the optimal decoding strategies are greedy, and further that the loss function is discretely convex in $k$, so that binary search provably and efficiently finds the optimal $k$. We show that top-$k$ decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).