Naichen Shi

LG
h-index15
15papers
264citations
Novelty56%
AI Score57

15 Papers

LGAug 20, 2022
Adam Can Converge Without Any Modification On Update Rules

Yushun Zhang, Congliang Chen, Naichen Shi et al.

Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(β_1, β_2)$; while practical applications often fix the problem first and then tune $(β_1, β_2)$. Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $β_2$ is large and $β_1 < \sqrt{β_2}<1$, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. It is worth mentioning that our results cover a wide range of hyperparameters: as $β_2$ increases, our convergence result can cover any $β_1 \in [0,1)$ including $β_1=0.9$, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge without any modification on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When $β_2$ is small, we further point out a large region of $(β_1,β_2)$ where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $β_2$. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.

LGJul 17, 2022
Personalized PCA: Decoupling Shared and Unique Features

Naichen Shi, Raed Al Kontar

In this paper, we tackle a significant challenge in PCA: heterogeneity. When data are collected from different sources with heterogeneous trends while still sharing some congruency, it is critical to extract shared knowledge while retaining the unique features of each source. To this end, we propose personalized PCA (PerPCA), which uses mutually orthogonal global and local principal components to encode both unique and shared features. We show that, under mild conditions, both unique and shared features can be identified and recovered by a constrained optimization problem, even if the covariance matrices are immensely different. Also, we design a fully federated algorithm inspired by distributed Stiefel gradient descent to solve the problem. The algorithm introduces a new group of operations called generalized retractions to handle orthogonality constraints, and only requires global PCs to be shared across sources. We prove the linear convergence of the algorithm under suitable assumptions. Comprehensive numerical experiments highlight PerPCA's superior performance in feature extraction and prediction from heterogeneous datasets. As a systematic approach to decouple shared and unique features from heterogeneous datasets, PerPCA finds applications in several tasks, including video segmentation, topic extraction, and feature clustering.

LGMay 20
Causal Discovery from Heteroscedastic Stochastic Dynamical Systems under Imperfect Physical Models

Jianhong Chen, Naichen Shi, Xubo Yue

Causal discovery is a data-driven paradigm for analyzing complex systems, while physics-based models, such as ordinary differential equations (ODEs), provide mechanistic structure for real-world dynamical processes. Integrating these paradigms can improve identifiability, stability, and robustness. However, real dynamical systems often exhibit cyclic interactions and nonstationarity, whereas many causal discovery methods rely on acyclicity, stationarity, or equilibrium assumptions. We propose an integrative causal discovery framework for dynamical systems that leverages partial physical knowledge through stochastic differential equations (SDEs). The drift term encodes known ODE dynamics, while the diffusion term captures unknown causal couplings beyond the prescribed physics. We develop a scalable sparsity-inducing maximum quasi-likelihood estimator with a theoretically justified stabilization technique to improve the optimization landscape. Under mild conditions, we establish causal graph recovery guarantees for both stable and unstable SDEs. We also analyze robustness of our causal graph estimate to ODE misspecification and clarify how the introduced stabilization technique balances numerical stability and statistical recoverability. Experiments on linear SDEs and nonlinear benchmarks, including Lotka-Volterra and Lorenz dynamics with acyclic and cyclic structures, show improved graph recovery and robustness over data-driven baselines. We also demonstrate practical utility on real-world epidemic data by reconstructing stochastic SIR dynamics within our causal discovery framework.

LGSep 7, 2023
Personalized Tucker Decomposition: Modeling Commonality and Peculiarity on Tensor Data

Jiuyun Hu, Naichen Shi, Raed Al Kontar et al.

We propose personalized Tucker decomposition (perTucker) to address the limitations of traditional tensor decomposition methods in capturing heterogeneity across different datasets. perTucker decomposes tensor data into shared global components and personalized local components. We introduce a mode orthogonality assumption and develop a proximal gradient regularized block coordinate descent algorithm that is guaranteed to converge to a stationary point. By learning unique and common representations across datasets, we demonstrate perTucker's effectiveness in anomaly detection, client classification, and clustering through a simulation study and two case studies on solar flare detection and tonnage signal classification.

MLMay 18
SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate

Lifu Wei, Yinuo Ren, Naichen Shi et al.

Diffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introducing bias, high computational overhead, or both. We introduce \texttt{URGE}, Unbiased Resampling via Girsanov Estimation, a derivative-free inference-time scaling algorithm that performs path-wise importance reweighting via a Girsanov change of measure. Instead of computing gradient-based particle weights in previous work, \texttt{URGE} attaches a simple multiplicative weight to each simulated trajectory and periodically resamples. No score, no Hessian, and no PDE evaluation is required. We establish an equivalence between path-wise and particle-wise SMC: the Girsanov path weight admits a backward conditional expectation that recovers the previous particle-level weights, guaranteeing that both schemes produce the same unbiased terminal law. Empirically, \texttt{URGE} outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving better generation quality, while being significantly simpler to implement and fully gradient-free.

LGMay 10
DiffATS: Diffusion in Aligned Tensor Space

Jinhua Lyu, Tianmin Yu, Brian Kim et al.

Direct diffusion modeling of high-resolution spatiotemporal fields is computationally challenging. Parameter-efficient primitives address this by representing high-dimensional data with a compact set of parameters. In this paper, we construct data-dependent tensor primitives without pretrained compression autoencoders. Our construction starts from Tucker decomposition, which captures low-rank multilinear structure through a core tensor and mode-wise factors. However, Tucker factors are non-unique: the same tensor can be represented by different rotated factors, which complicates generative modeling. We address this issue with orthogonal Procrustes (OP) alignment. Specifically, we select medoid anchor matrices from the data and align the factor matrices to resolve the gauge ambiguity. This yields matrix Grassmannian primitives and tensor Grassmannian primitives that are compact, data-adaptive, and directly decodable by explicit multilinear reconstruction. Theoretically, we prove that the proposed primitive maps are homeomorphisms between low-rank tensors and their corresponding primitive spaces, certifying that the representations are non-degenerate and topologically faithful. Building on these primitives, we propose *Diffusion in Aligned Tensor Space* (DiffATS), a generative framework that trains diffusion models directly on aligned tensor primitives. Across images, videos, and PDE solutions, DiffATS achieves strong unconditional and conditional generation performance while compressing original data by $3.9\times$ to $210\times$, without relying on any pretrained deep compression autoencoders.

CLJun 11, 2025Code
Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

Haoyi Song, Ruihan Ji, Naichen Shi et al.

Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic interpretation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.

LGMar 21, 2024
Triple Component Matrix Factorization: Untangling Global, Local, and Noisy Components

Naichen Shi, Salar Fattahi, Raed Al Kontar

In this work, we study the problem of common and unique feature extraction from noisy data. When we have N observation matrices from N different and associated sources corrupted by sparse and potentially gross noise, can we recover the common and unique components from these noisy observations? This is a challenging task as the number of parameters to estimate is approximately thrice the number of observations. Despite the difficulty, we propose an intuitive alternating minimization algorithm called triple component matrix factorization (TCMF) to recover the three components exactly. TCMF is distinguished from existing works in literature thanks to two salient features. First, TCMF is a principled method to separate the three components given noisy observations provably. Second, the bulk of the computation in TCMF can be distributed. On the technical side, we formulate the problem as a constrained nonconvex nonsmooth optimization problem. Despite the intricate nature of the problem, we provide a Taylor series characterization of its solution by solving the corresponding Karush-Kuhn-Tucker conditions. Using this characterization, we can show that the alternating minimization algorithm makes significant progress at each iteration and converges into the ground truth at a linear rate. Numerical experiments in video segmentation and anomaly detection highlight the superior feature extraction abilities of TCMF.

MLOct 27, 2025
Coupled Flow Matching

Wenxi Cai, Yuheng Wang, Naichen Shi

We introduce Coupled Flow Matching (CPFM), a framework that integrates controllable dimensionality reduction and high-fidelity reconstruction. CPFM learns coupled continuous flows for both the high-dimensional data x and the low-dimensional embedding y, which enables sampling p(y|x) via a latent-space flow and p(x|y) via a data-space flow. Unlike classical dimension-reduction methods, where information discarded during compression is often difficult to recover, CPFM preserves the knowledge of residual information within the weights of a flow network. This design provides bespoke controllability: users may decide which semantic factors to retain explicitly in the latent space, while the complementary information remains recoverable through the flow network. Coupled flow matching builds on two components: (i) an extended Gromov-Wasserstein optimal transport objective that establishes a probabilistic correspondence between data and embeddings, and (ii) a dual-conditional flow-matching network that extrapolates the correspondence to the underlying space. Experiments on multiple benchmarks show that CPFM yields semantically rich embeddings and reconstructs data with higher fidelity than existing baselines.

MLOct 21, 2025
Calibrated Principal Component Regression

Yixuan Florence Wu, Yilun Zhu, Lei Cao et al.

We propose a new method for statistical inference in generalized linear models. In the overparameterized regime, Principal Component Regression (PCR) reduces variance by projecting high-dimensional data to a low-dimensional principal subspace before fitting. However, PCR incurs truncation bias whenever the true regression vector has mass outside the retained principal components (PC). To mitigate the bias, we propose Calibrated Principal Component Regression (CPCR), which first learns a low-variance prior in the PC subspace and then calibrates the model in the original feature space via a centered Tikhonov step. CPCR leverages cross-fitting and controls the truncation bias by softening PCR's hard cutoff. Theoretically, we calculate the out-of-sample risk in the random matrix regime, which shows that CPCR outperforms standard PCR when the regression signal has non-negligible components in low-variance directions. Empirically, CPCR consistently improves prediction across multiple overparameterized problems. The results highlight CPCR's stability and flexibility in modern overparameterized settings.

LGOct 6, 2025
Domain Generalization: A Tale of Two ERMs

Yilun Zhu, Naihao Deng, Naichen Shi et al.

Domain generalization (DG) is the problem of generalizing from several distributions (or domains), for which labeled training data are available, to a new test domain for which no labeled data is available. A common finding in the DG literature is that it is difficult to outperform empirical risk minimization (ERM) on the pooled training data. In this work, we argue that this finding has primarily been reported for datasets satisfying a \emph{covariate shift} assumption. When the dataset satisfies a \emph{posterior drift} assumption instead, we show that ``domain-informed ERM,'' wherein feature vectors are augmented with domain-specific information, outperforms pooling ERM. These claims are supported by a theoretical framework and experiments on language and vision tasks.

LGMay 24, 2023
Personalized Dictionary Learning for Heterogeneous Datasets

Geyu Liang, Naichen Shi, Raed Al Kontar et al.

We introduce a relevant yet challenging problem named Personalized Dictionary Learning (PerDL), where the goal is to learn sparse linear representations from heterogeneous datasets that share some commonality. In PerDL, we model each dataset's shared and unique features as global and local dictionaries. Challenges for PerDL not only are inherited from classical dictionary learning (DL), but also arise due to the unknown nature of the shared and unique features. In this paper, we rigorously formulate this problem and provide conditions under which the global and local dictionaries can be provably disentangled. Under these conditions, we provide a meta-algorithm called Personalized Matching and Averaging (PerMA) that can recover both global and local dictionaries from heterogeneous datasets. PerMA is highly efficient; it converges to the ground truth at a linear rate under suitable conditions. Moreover, it automatically borrows strength from strong learners to improve the prediction of weak learners. As a general framework for extracting global and local dictionaries, we show the application of PerDL in different learning tasks, such as training with imbalanced datasets and video surveillance.

LGNov 9, 2021
The Internet of Federated Things (IoFT): A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

Raed Kontar, Naichen Shi, Xubo Yue et al.

The Internet of Things (IoT) is on the verge of a major paradigm shift. In the IoT system of the future, IoFT, the cloud will be substituted by the crowd where model training is brought to the edge, allowing IoT devices to collaboratively extract knowledge and build smart analytics/models while keeping their personal data stored locally. This paradigm shift was set into motion by the tremendous increase in computational power on IoT devices and the recent advances in decentralized and privacy-preserving model training, coined as federated learning (FL). This article provides a vision for IoFT and a systematic overview of current efforts towards realizing this vision. Specifically, we first introduce the defining characteristics of IoFT and discuss FL data-driven approaches, opportunities, and challenges that allow decentralized inference within three dimensions: (i) a global model that maximizes utility across all IoT devices, (ii) a personalized model that borrows strengths across all devices yet retains its own model, (iii) a meta-learning model that quickly adapts to new devices or learning tasks. We end by describing the vision and challenges of IoFT in reshaping different industries through the lens of domain experts. Those industries include manufacturing, transportation, energy, healthcare, quality & reliability, business, and computing.

MLJul 21, 2021
Fed-ensemble: Improving Generalization through Model Ensembling in Federated Learning

Naichen Shi, Fan Lai, Raed Al Kontar et al.

In this paper we propose Fed-ensemble: a simple approach that bringsmodel ensembling to federated learning (FL). Instead of aggregating localmodels to update a single global model, Fed-ensemble uses random permutations to update a group of K models and then obtains predictions through model averaging. Fed-ensemble can be readily utilized within established FL methods and does not impose a computational overhead as it only requires one of the K models to be sent to a client in each communication round. Theoretically, we show that predictions on newdata from all K models belong to the same predictive posterior distribution under a neural tangent kernel regime. This result in turn sheds light onthe generalization advantages of model averaging. We also illustrate thatFed-ensemble has an elegant Bayesian interpretation. Empirical results show that our model has superior performance over several FL algorithms,on a wide range of data sets, and excels in heterogeneous settings often encountered in FL applications.

GTFeb 15, 2021
ScrofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning

Naichen Shi, Ruichen Li, Sun Youran

People have made remarkable progress in game AIs, especially in domain of perfect information game. However, trick-taking poker game, as a popular form of imperfect information game, has been regarded as a challenge for a long time. Since trick-taking game requires high level of not only reasoning, but also inference to excel, it can be a new milestone for imperfect information game AI. We study Gongzhu, a trick-taking game analogous to, but slightly simpler than contract bridge. Nonetheless, the strategies of Gongzhu are complex enough for both human and computer players. We train a strong Gongzhu AI ScrofaZero from \textit{tabula rasa} by deep reinforcement learning, while few previous efforts on solving trick-taking poker game utilize the representation power of neural networks. Also, we introduce new techniques for imperfect information game including stratified sampling, importance weighting, integral over equivalent class, Bayesian inference, etc. Our AI can achieve human expert level performance. The methodologies in building our program can be easily transferred into a wide range of trick-taking games.