Shihua Zhang

LG
h-index26
30papers
256citations
Novelty52%
AI Score56

30 Papers

LGJun 4
Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

Hu Tan, Kuo Gai, Shihua Zhang

Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.

LGFeb 10, 2023Code
Monte Carlo Neural PDE Solver for Learning PDEs via Probabilistic Representation

Rui Zhang, Qi Meng, Rongchan Zhu et al.

In scenarios with limited available data, training the function-to-function neural PDE solver in an unsupervised manner is essential. However, the efficiency and accuracy of existing methods are constrained by the properties of numerical algorithms, such as finite difference and pseudo-spectral methods, integrated during the training stage. These methods necessitate careful spatiotemporal discretization to achieve reasonable accuracy, leading to significant computational challenges and inaccurate simulations, particularly in cases with substantial spatiotemporal variations. To address these limitations, we propose the Monte Carlo Neural PDE Solver (MCNP Solver) for training unsupervised neural solvers via the PDEs' probabilistic representation, which regards macroscopic phenomena as ensembles of random particles. Compared to other unsupervised methods, MCNP Solver naturally inherits the advantages of the Monte Carlo method, which is robust against spatiotemporal variations and can tolerate coarse step size. In simulating the trajectories of particles, we employ Heun's method for the convection process and calculate the expectation via the probability density function of neighbouring grid points during the diffusion process. These techniques enhance accuracy and circumvent the computational issues associated with Monte Carlo sampling. Our numerical experiments on convection-diffusion, Allen-Cahn, and Navier-Stokes equations demonstrate significant improvements in accuracy and efficiency compared to other unsupervised baselines. The source code will be publicly available at: https://github.com/optray/MCNP.

CVMar 27
Make Geometry Matter for Spatial Reasoning

Shihua Zhang, Qiuhong Shen, Shizun Wang et al.

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

CVMar 23, 2025Code
Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning

Xiang Fang, Shihua Zhang, Hao Zhang et al.

Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba's inherent selectivity, we propose \textbf{CorrMamba}, a \textbf{Corr}espondence filter leveraging \textbf{Mamba}'s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by $2.58$ absolute percentage points in AUC@20\textdegree, highlighting its practical superiority. Our code will be publicly available.

MLSep 1, 2021Code
Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification

Chihao Zhang, Yiling Elaine Chen, Shihua Zhang et al.

Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at https://github.com/JSB-UCLA/ITCA) implements ITCA and search strategies.

LGAug 1, 2024
OTAD: An Optimal Transport-Induced Robust Model for Agnostic Adversarial Attack

Kuo Gai, Sicong Wang, Shihua Zhang

Deep neural networks (DNNs) are vulnerable to small adversarial perturbations of the inputs, posing a significant challenge to their reliability and robustness. Empirical methods such as adversarial training can defend against particular attacks but remain vulnerable to more powerful attacks. Alternatively, Lipschitz networks provide certified robustness to unseen perturbations but lack sufficient expressive power. To harness the advantages of both approaches, we design a novel two-step Optimal Transport induced Adversarial Defense (OTAD) model that can fit the training data accurately while preserving the local Lipschitz continuity. First, we train a DNN with a regularizer derived from optimal transport theory, yielding a discrete optimal transport map linking data to its features. By leveraging the map's inherent regularity, we interpolate the map by solving the convex integration problem (CIP) to guarantee the local Lipschitz property. OTAD is extensible to diverse architectures of ResNet and Transformer, making it suitable for complex data. For efficient computation, the CIP can be solved through training neural networks. OTAD opens a novel avenue for developing reliable and secure deep learning systems through the regularity of optimal transport maps. Empirical results demonstrate that OTAD can outperform other robust models on diverse datasets.

MLJul 2, 2024
Analytical Solution of a Three-layer Network with a Matrix Exponential Activation Function

Kuo Gai, Shihua Zhang

In practice, deeper networks tend to be more powerful than shallow ones, but this has not been understood theoretically. In this paper, we find the analytical solution of a three-layer network with a matrix exponential activation function, i.e., $$ f(X)=W_3\exp(W_2\exp(W_1X)), X\in \mathbb{C}^{d\times d} $$ have analytical solutions for the equations $$ Y_1=f(X_1),Y_2=f(X_2) $$ for $X_1,X_2,Y_1,Y_2$ with only invertible assumptions. Our proof shows the power of depth and the use of a non-linear activation function, since one layer network can only solve one equation,i.e.,$Y=WX$.

AIMay 4
Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective

Xiayang Li, Kuo Gai, Shihua Zhang

Shortcut learning causes deep learning models to rely on non-essential features within the data. However, its formation in deep neural network training still lacks theoretical understanding. In this paper, we provide a formal definition of core and shortcut features and employ evolutionary game theory to analyze the origins of shortcut bias by modeling data samples as players and their corresponding neural tangent features as strategies, assuming the existence of core and shortcut subnetworks. We find that gradient descent (GD) and stochastic gradient descent (SGD) lead to two distinct stochastically stable states, each corresponding to a different strategy. The former primarily optimizes the shortcut subnetwork, while the latter primarily optimizes the core subnetwork. We investigate the influence of these strategies on shortcut bias through a continuous stochastic differential equation, and reveal the impact of data noise and optimization noise on the formation of shortcut bias. In brief, our work employs evolutionary game theory to characterize the dynamics of shortcut bias formation and provides a theoretical view on its mitigation.

LGMay 2, 2024
Progressive Feedforward Collapse of ResNet Training

Sicong Wang, Kuo Gai, Shihua Zhang

Neural collapse (NC) is a simple and symmetric phenomenon for deep neural networks (DNNs) at the terminal phase of training, where the last-layer features collapse to their class means and form a simplex equiangular tight frame aligning with the classifier vectors. However, the relationship of the last-layer features to the data and intermediate layers during training remains unexplored. To this end, we characterize the geometry of intermediate layers of ResNet and propose a novel conjecture, progressive feedforward collapse (PFC), claiming the degree of collapse increases during the forward propagation of DNNs. We derive a transparent model for the well-trained ResNet according to that ResNet with weight decay approximates the geodesic curve in Wasserstein space at the terminal phase. The metrics of PFC indeed monotonically decrease across depth on various datasets. We propose a new surrogate model, multilayer unconstrained feature model (MUFM), connecting intermediate layers by an optimal transport regularizer. The optimal solution of MUFM is inconsistent with NC but is more concentrated relative to the input data. Overall, this study extends NC to PFC to model the collapse phenomenon of intermediate layers and its dependence on the input data, shedding light on the theoretical understanding of ResNet in classification problems.

LGDec 24, 2024
Towards understanding how attention mechanism works in deep learning

Tianyu Ruan, Shihua Zhang

Attention mechanism has been extensively integrated within mainstream neural network architectures, such as Transformers and graph attention networks. Yet, its underlying working principles remain somewhat elusive. What is its essence? Are there any connections between it and traditional machine learning algorithms? In this study, we inspect the process of computing similarity using classic metrics and vector space properties in manifold learning, clustering, and supervised learning. We identify the key characteristics of similarity computation and information propagation in these methods and demonstrate that the self-attention mechanism in deep learning adheres to the same principles but operates more flexibly and adaptively. We decompose the self-attention mechanism into a learnable pseudo-metric function and an information propagation process based on similarity computation. We prove that the self-attention mechanism converges to a drift-diffusion process through continuous modeling provided the pseudo-metric is a transformation of a metric and certain reasonable assumptions hold. This equation could be transformed into a heat equation under a new metric. In addition, we give a first-order analysis of attention mechanism with a general pseudo-metric function. This study aids in understanding the effects and principle of attention mechanism through physical intuition. Finally, we propose a modified attention mechanism called metric-attention by leveraging the concept of metric learning to facilitate the ability to learn desired metrics more effectively. Experimental results demonstrate that it outperforms self-attention regarding training efficiency, accuracy, and robustness.

LGFeb 2, 2025
Learning-Based TSP-Solvers Tend to Be Overly Greedy

Xiayang Li, Shihua Zhang

Deep learning has shown significant potential in solving combinatorial optimization problems such as the Euclidean traveling salesman problem (TSP). However, most training and test instances for existing TSP algorithms are generated randomly from specific distributions like uniform distribution. This has led to a lack of analysis and understanding of the performance of deep learning algorithms in out-of-distribution (OOD) generalization scenarios, which has a close relationship with the worst-case performance in the combinatorial optimization field. For data-driven algorithms, the statistical properties of randomly generated datasets are critical. This study constructs a statistical measure called nearest-neighbor density to verify the asymptotic properties of randomly generated datasets and reveal the greedy behavior of learning-based solvers, i.e., always choosing the nearest neighbor nodes to construct the solution path. Based on this statistical measure, we develop interpretable data augmentation methods that rely on distribution shifts or instance perturbations and validate that the performance of the learning-based solvers degenerates much on such augmented data. Moreover, fine-tuning learning-based solvers with augmented data further enhances their generalization abilities. In short, we decipher the limitations of learning-based TSP solvers tending to be overly greedy, which may have profound implications for AI-empowered combinatorial optimization solvers.

CVMar 31, 2025
CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching

Zizhuo Li, Yifan Lu, Linfeng Tang et al.

This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. Firstly, observing that modeling context interaction over the entire coarse feature map elicits highly redundant computation due to the neighboring representation similarity of tokens, a covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores that are dynamically estimated, thereby ensuring computational efficiency while improving the representational capacity of aggregated tokens simultaneously. Secondly, considering that feature interaction with massive non-covisible areas is distracting, which may degrade feature distinctiveness, a covisibility-assisted attention mechanism is deployed to selectively suppress irrelevant message broadcast from non-covisible reduced tokens, resulting in robust and compact attention to relevant rather than all ones. Thirdly, we find that at the fine-level stage, current methods adjust only the target view's keypoints to subpixel level, while those in the source view remain restricted at the coarse level and thus not informative enough, detrimental to keypoint location-sensitive usages. A simple yet potent fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level, attaining attractive performance improvement. Thorough experimentation across an array of public benchmarks affirms CoMatch's promising accuracy, efficiency, and generalizability.

CVJun 5, 2025
Deep Learning Reforms Image Matching: A Survey and Outlook

Shihua Zhang, Zizhuo Li, Kaining Zhang et al.

Image matching, which establishes correspondences between two-view images to recover 3D structure and camera geometry, serves as a cornerstone in computer vision and underpins a wide range of applications, including visual localization, 3D reconstruction, and simultaneous localization and mapping (SLAM). Traditional pipelines composed of ``detector-descriptor, feature matcher, outlier filter, and geometric estimator'' falter in challenging scenarios. Recent deep-learning advances have significantly boosted both robustness and accuracy. This survey adopts a unique perspective by comprehensively reviewing how deep learning has incrementally transformed the classical image matching pipeline. Our taxonomy highly aligns with the traditional pipeline in two key aspects: i) the replacement of individual steps in the traditional pipeline with learnable alternatives, including learnable detector-descriptor, outlier filter, and geometric estimator; and ii) the merging of multiple steps into end-to-end learnable modules, encompassing middle-end sparse matcher, end-to-end semi-dense/dense matcher, and pose regressor. We first examine the design principles, advantages, and limitations of both aspects, and then benchmark representative methods on relative pose recovery, homography estimation, and visual localization tasks. Finally, we discuss open challenges and outline promising directions for future research. By systematically categorizing and evaluating deep learning-driven strategies, this survey offers a clear overview of the evolving image matching landscape and highlights key avenues for further innovation.

LGOct 2, 2025
Graph Generation with Spectral Geodesic Flow Matching

Xikun Huang, Tianyu Ruan, Chihao Zhang et al.

Graph generation is a fundamental task with wide applications in modeling complex systems. Although existing methods align the spectrum or degree profile of the target graph, they often ignore the geometry induced by eigenvectors and the global structure of the graph. In this work, we propose Spectral Geodesic Flow Matching (SFMG), a novel framework that uses spectral eigenmaps to embed both input and target graphs into continuous Riemannian manifolds. We then define geodesic flows between embeddings and match distributions along these flows to generate output graphs. Our method yields several advantages: (i) captures geometric structure beyond eigenvalues, (ii) supports flexible generation of diverse graphs, and (iii) scales efficiently. Empirically, SFMG matches the performance of state-of-the-art approaches on graphlet, degree, and spectral metrics across diverse benchmarks. In particular, it achieves up to 30$\times$ speedup over diffusion-based models, offering a substantial advantage in scalability and training efficiency. We also demonstrate its ability to generalize to unseen graph scales. Overall, SFMG provides a new approach to graph synthesis by integrating spectral geometry with flow matching.

LGSep 24, 2025
Feature Dynamics as Implicit Data Augmentation: A Depth-Decomposed View on Deep Neural Network Generalization

Tianyu Ruan, Kuo Gai, Shihua Zhang

Why do deep networks generalize well? In contrast to classical generalization theory, we approach this fundamental question by examining not only inputs and outputs, but the evolution of internal features. Our study suggests a phenomenon of temporal consistency that predictions remain stable when shallow features from earlier checkpoints combine with deeper features from later ones. This stability is not a trivial convergence artifact. It acts as a form of implicit, structured augmentation that supports generalization. We show that temporal consistency extends to unseen and corrupted data, but collapses when semantic structure is destroyed (e.g., random labels). Statistical tests further reveal that SGD injects anisotropic noise aligned with a few principal directions, reinforcing its role as a source of structured variability. Together, these findings suggest a conceptual perspective that links feature dynamics to generalization, pointing toward future work on practical surrogates for measuring temporal feature evolution.

LGDec 15, 2021
Rethinking Influence Functions of Neural Networks in the Over-parameterized Regime

Rui Zhang, Shihua Zhang

Understanding the black-box prediction for neural networks is challenging. To achieve this, early studies have designed influence function (IF) to measure the effect of removing a single training point on neural networks. However, the classic implicit Hessian-vector product (IHVP) method for calculating IF is fragile, and theoretical analysis of IF in the context of neural networks is still lacking. To this end, we utilize the neural tangent kernel (NTK) theory to calculate IF for the neural network trained with regularized mean-square loss, and prove that the approximation error can be arbitrarily small when the width is sufficiently large for two-layer ReLU networks. We analyze the error bound for the classic IHVP method in the over-parameterized regime to understand when and why it fails or not. In detail, our theoretical analysis reveals that (1) the accuracy of IHVP depends on the regularization term, and is pretty low under weak regularization; (2) the accuracy of IHVP has a significant correlation with the probability density of corresponding training points. We further borrow the theory from NTK to understand the IFs better, including quantifying the complexity for influential samples and depicting the variation of IFs during the training dynamics. Numerical experiments on real-world data confirm our theoretical results and demonstrate our findings.

LGFeb 28, 2021
Adversarial Information Bottleneck

Penglong Zhai, Shihua Zhang

The information bottleneck (IB) principle has been adopted to explain deep learning in terms of information compression and prediction, which are balanced by a trade-off hyperparameter. How to optimize the IB principle for better robustness and figure out the effects of compression through the trade-off hyperparameter are two challenging problems. Previous methods attempted to optimize the IB principle by introducing random noise into learning the representation and achieved state-of-the-art performance in the nuisance information compression and semantic information extraction. However, their performance on resisting adversarial perturbations is far less impressive. To this end, we propose an adversarial information bottleneck (AIB) method without any explicit assumptions about the underlying distribution of the representations, which can be optimized effectively by solving a Min-Max optimization problem. Numerical experiments on synthetic and real-world datasets demonstrate its effectiveness on learning more invariant representations and mitigating adversarial perturbations compared to several competing IB methods. In addition, we analyse the adversarial robustness of diverse IB methods contrasting with their IB curves, and reveal that IB models with the hyperparameter $β$ corresponding to the knee point in the IB curve achieve the best trade-off between compression and prediction, and has best robustness against various attacks.

LGFeb 18, 2021
A Mathematical Principle of Deep Learning: Learn the Geodesic Curve in the Wasserstein Space

Kuo Gai, Shihua Zhang

Recent studies revealed the mathematical connection of deep neural network (DNN) and dynamic system. However, the fundamental principle of DNN has not been fully characterized with dynamic system in terms of optimization and generalization. To this end, we build the connection of DNN and continuity equation where the measure is conserved to model the forward propagation process of DNN which has not been addressed before. DNN learns the transformation of the input distribution to the output one. However, in the measure space, there are infinite curves connecting two distributions. Which one can lead to good optimization and generaliztion for DNN? By diving the optimal transport theory, we find DNN with weight decay attempts to learn the geodesic curve in the Wasserstein space, which is induced by the optimal transport map. Compared with plain network, ResNet is a better approximation to the geodesic curve, which explains why ResNet can be optimized and generalize better. Numerical experiments show that the data tracks of both plain network and ResNet tend to be line-shape in term of line-shape score (LSS), and the map learned by ResNet is closer to the optimal transport map in term of optimal transport score (OTS). In a word, we conclude a mathematical principle of deep learning is to learn the geodesic curve in the Wasserstein space; and deep learning is a great engineering realization of continuous transformation in high-dimensional space.

LGOct 16, 2020
Learnable Graph-regularization for Matrix Decomposition

Penglong Zhai, Shihua Zhang

Low-rank approximation models of data matrices have become important machine learning and data mining tools in many fields including computer vision, text mining, bioinformatics and many others. They allow for embedding high-dimensional data into low-dimensional spaces, which mitigates the effects of noise and uncovers latent relations. In order to make the learned representations inherit the structures in the original data, graph-regularization terms are often added to the loss function. However, the prior graph construction often fails to reflect the true network connectivity and the intrinsic relationships. In addition, many graph-regularized methods fail to take the dual spaces into account. Probabilistic models are often used to model the distribution of the representations, but most of previous methods often assume that the hidden variables are independent and identically distributed for simplicity. To this end, we propose a learnable graph-regularization model for matrix decomposition (LGMD), which builds a bridge between graph-regularized methods and probabilistic matrix decomposition models. LGMD learns two graphical structures (i.e., two precision matrices) in real-time in an iterative manner via sparse precision matrix estimation and is more robust to noise and missing entries. Extensive numerical results and comparison with competing methods demonstrate its effectiveness.

MLMay 20, 2020
Tessellated Wasserstein Auto-Encoders

Kuo Gai, Shihua Zhang

Non-adversarial generative models such as variational auto-encoder (VAE), Wasserstein auto-encoders with maximum mean discrepancy (WAE-MMD), sliced-Wasserstein auto-encoder (SWAE) are relatively easy to train and have less mode collapse compared to Wasserstein auto-encoder with generative adversarial network (WAE-GAN). However, they are not very accurate in approximating the target distribution in the latent space because they don't have a discriminator to detect the minor difference between real and fake. To this end, we develop a novel non-adversarial framework called Tessellated Wasserstein Auto-encoders (TWAE) to tessellate the support of the target distribution into a given number of regions by the centroidal Voronoi tessellation (CVT) technique and design batches of data according to the tessellation instead of random shuffling for accurate computation of discrepancy. Theoretically, we demonstrate that the error of estimate to the discrepancy decreases when the numbers of samples $n$ and regions $m$ of the tessellation become larger with rates of $\mathcal{O}(\frac{1}{\sqrt{n}})$ and $\mathcal{O}(\frac{1}{\sqrt{m}})$, respectively. Given fixed $n$ and $m$, a necessary condition for the upper bound of measurement error to be minimized is that the tessellation is the one determined by CVT. TWAE is very flexible to different non-adversarial metrics and can substantially enhance their generative performance in terms of Fréchet inception distance (FID) compared to VAE, WAE-MMD, SWAE. Moreover, numerical results indeed demonstrate that TWAE is competitive to the adversarial model WAE-GAN, demonstrating its powerful generative ability.

LGFeb 10, 2020
Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering

Chihao Zhang, Yang Yang, Wei Zhang et al.

Matrix decomposition is one of the fundamental tools to discover knowledge from big data generated by modern applications. However, it is still inefficient or infeasible to process very big data using such a method in a single machine. Moreover, big data are often distributedly collected and stored on different machines. Thus, such data generally bear strong heterogeneous noise. It is essential and useful to develop distributed matrix decomposition for big data analytics. Such a method should scale up well, model the heterogeneous noise, and address the communication issue in a distributed system. To this end, we propose a distributed Bayesian matrix decomposition model (DBMD) for big data mining and clustering. Specifically, we adopt three strategies to implement the distributed computing including 1) the accelerated gradient descent, 2) the alternating direction method of multipliers (ADMM), and 3) the statistical inference. We investigate the theoretical convergence behaviors of these algorithms. To address the heterogeneity of the noise, we propose an optimal plug-in weighted average that reduces the variance of the estimation. Synthetic experiments validate our theoretical results, and real-world experiments show that our algorithms scale up well to big data and achieves superior or competing performance compared to other distributed methods.

LGDec 5, 2019
Towards Understanding Residual and Dilated Dense Neural Networks via Convolutional Sparse Coding

Zhiyang Zhang, Shihua Zhang

Convolutional neural network (CNN) and its variants have led to many state-of-art results in various fields. However, a clear theoretical understanding about them is still lacking. Recently, multi-layer convolutional sparse coding (ML-CSC) has been proposed and proved to equal such simply stacked networks (plain networks). Here, we think three factors in each layer of it including the initialization, the dictionary design and the number of iterations greatly affect the performance of ML-CSC. Inspired by these considerations, we propose two novel multi-layer models--residual convolutional sparse coding model (Res-CSC) and mixed-scale dense convolutional sparse coding model (MSD-CSC), which have close relationship with the residual neural network (ResNet) and mixed-scale (dilated) dense neural network (MSDNet), respectively. Mathematically, we derive the shortcut connection in ResNet as a special case of a new forward propagation rule on ML-CSC. We find a theoretical interpretation of the dilated convolution and dense connection in MSDNet by analyzing MSD-CSC, which gives a clear mathematical understanding about them. We implement the iterative soft thresholding algorithm (ISTA) and its fast version to solve Res-CSC and MSD-CSC, which can employ the unfolding operation for further improvements. At last, extensive numerical experiments and comparison with competing methods demonstrate their effectiveness using three typical datasets.

LGNov 25, 2019
Matrix Normal PCA for Interpretable Dimension Reduction and Graphical Noise Modeling

Chihao Zhang, Kuo Gai, Shihua Zhang

Principal component analysis (PCA) is one of the most widely used dimension reduction and multivariate statistical techniques. From a probabilistic perspective, PCA seeks a low-dimensional representation of data in the presence of independent identical Gaussian noise. Probabilistic PCA (PPCA) and its variants have been extensively studied for decades. Most of them assume the underlying noise follows a certain independent identical distribution. However, the noise in the real world is usually complicated and structured. To address this challenge, some variants of PCA for data with non-IID noise have been proposed. However, most of the existing methods only assume that the noise is correlated in the feature space while there may exist two-way structured noise. To this end, we propose a powerful and intuitive PCA method (MN-PCA) through modeling the graphical noise by the matrix normal distribution, which enables us to explore the structure of noise in both the feature space and the sample space. MN-PCA obtains a low-rank representation of data and the structure of noise simultaneously. And it can be explained as approximating data over the generalized Mahalanobis distance. We develop two algorithms to solve this model: one maximizes the regularized likelihood, the other exploits the Wasserstein distance, which is more robust. Extensive experiments on various data demonstrate their effectiveness.

MLJul 28, 2018
Group-sparse SVD Models and Their Applications in Biological Data

Wenwen Min, Juan Liu, Shihua Zhang

Sparse Singular Value Decomposition (SVD) models have been proposed for biclustering high dimensional gene expression data to identify block patterns with similar expressions. However, these models do not take into account prior group effects upon variable selection. To this end, we first propose group-sparse SVD models with group Lasso (GL1-SVD) and group L0-norm penalty (GL0-SVD) for non-overlapping group structure of variables. However, such group-sparse SVD models limit their applicability in some problems with overlapping structure. Thus, we also propose two group-sparse SVD models with overlapping group Lasso (OGL1-SVD) and overlapping group L0-norm penalty (OGL0-SVD). We first adopt an alternating iterative strategy to solve GL1-SVD based on a block coordinate descent method, and GL0-SVD based on a projection method. The key of solving OGL1-SVD is a proximal operator with overlapping group Lasso penalty. We employ an alternating direction method of multipliers (ADMM) to solve the proximal operator. Similarly, we develop an approximate method to solve OGL0-SVD. Applications of these methods and comparison with competing ones using simulated data demonstrate their effectiveness. Extensive applications of them onto several real gene expression data with gene prior group knowledge identify some biologically interpretable gene modules.

CVDec 9, 2017
Bayesian Joint Matrix Decomposition for Data Integration with Heterogeneous Noise

Chihao Zhang, Shihua Zhang

Matrix decomposition is a popular and fundamental approach in machine learning and data mining. It has been successfully applied into various fields. Most matrix decomposition methods focus on decomposing a data matrix from one single source. However, it is common that data are from different sources with heterogeneous noise. A few of matrix decomposition methods have been extended for such multi-view data integration and pattern discovery. While only few methods were designed to consider the heterogeneity of noise in such multi-view data for data integration explicitly. To this end, we propose a joint matrix decomposition framework (BJMD), which models the heterogeneity of noise by Gaussian distribution in a Bayesian framework. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the posterior distribution; and another is a maximum a posterior algorithm, which is more scalable and can be easily paralleled. Extensive experiments on synthetic and real-world datasets demonstrate that BJMD considering the heterogeneity of noise is superior or competitive to the state-of-the-art methods.

LGOct 13, 2017
Sparse Weighted Canonical Correlation Analysis

Wenwen Min, Juan Liu, Shihua Zhang

Given two data matrices $X$ and $Y$, sparse canonical correlation analysis (SCCA) is to seek two sparse canonical vectors $u$ and $v$ to maximize the correlation between $Xu$ and $Yv$. However, classical and sparse CCA models consider the contribution of all the samples of data matrices and thus cannot identify an underlying specific subset of samples. To this end, we propose a novel sparse weighted canonical correlation analysis (SWCCA), where weights are used for regularizing different samples. We solve the $L_0$-regularized SWCCA ($L_0$-SWCCA) using an alternating iterative algorithm. We apply $L_0$-SWCCA to synthetic data and real-world data to demonstrate its effectiveness and superiority compared to related methods. Lastly, we consider also SWCCA with different penalties like LASSO (Least absolute shrinkage and selection operator) and Group LASSO, and extend it for integrating more than three data matrices.

CVJul 28, 2017
Sparse Deep Nonnegative Matrix Factorization

Zhenxing Guo, Shihua Zhang

Nonnegative matrix factorization is a powerful technique to realize dimension reduction and pattern recognition through single-layer data representation learning. Deep learning, however, with its carefully designed hierarchical structure, is able to combine hidden features to form more representative features for pattern recognition. In this paper, we proposed sparse deep nonnegative matrix factorization models to analyze complex data for more accurate classification and better feature interpretation. Such models are designed to learn localized features or generate more discriminative representations for samples in distinct classes by imposing $L_1$-norm penalty on the columns of certain factors. By extending one-layer model into multi-layer one with sparsity, we provided a hierarchical way to analyze big data and extract hidden features intuitively due to nonnegativity. We adopted the Nesterov's accelerated gradient algorithm to accelerate the computing process with the convergence rate of $O(1/k^2)$ after $k$ steps iteration. We also analyzed the computing complexity of our framework to demonstrate their efficiency. To improve the performance of dealing with linearly inseparable data, we also considered to incorporate popular nonlinear functions into this framework and explored their performance. We applied our models onto two benchmarking image datasets, demonstrating our models can achieve competitive or better classification performance and produce intuitive interpretations compared with the typical NMF and competing multi-layer models.

CVJul 25, 2017
A Unified Joint Matrix Factorization Framework for Data Integration

Lihua Zhang, Shihua Zhang

Nonnegative matrix factorization (NMF) is a powerful tool in data exploratory analysis by discovering the hidden features and part-based patterns from high-dimensional data. NMF and its variants have been successfully applied into diverse fields such as pattern recognition, signal processing, data mining, bioinformatics and so on. Recently, NMF has been extended to analyze multiple matrices simultaneously. However, a unified framework is still lacking. In this paper, we introduce a sparse multiple relationship data regularized joint matrix factorization (JMF) framework and two adapted prediction models for pattern recognition and data integration. Next, we present four update algorithms to solve this framework. The merits and demerits of these algorithms are systematically explored. Furthermore, extensive computational experiments using both synthetic data and real data demonstrate the effectiveness of JMF framework and related algorithms on pattern recognition and data mining.

GNSep 21, 2016
Network-regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery

Wenwen Min, Juan Liu, Shihua Zhang

Molecular profiling data (e.g., gene expression) has been used for clinical risk prediction and biomarker discovery. However, it is necessary to integrate other prior knowledge like biological pathways or gene interaction networks to improve the predictive ability and biological interpretability of biomarkers. Here, we first introduce a general regularized Logistic Regression (LR) framework with regularized term $λ\|\bm{w}\|_1 + η\bm{w}^T\bm{M}\bm{w}$, which can reduce to different penalties, including Lasso, elastic net, and network-regularized terms with different $\bm{M}$. This framework can be easily solved in a unified manner by a cyclic coordinate descent algorithm which can avoid inverse matrix operation and accelerate the computing speed. However, if those estimated $\bm{w}_i$ and $\bm{w}_j$ have opposite signs, then the traditional network-regularized penalty may not perform well. To address it, we introduce a novel network-regularized sparse LR model with a new penalty $λ\|\bm{w}\|_1 + η|\bm{w}|^T\bm{M}|\bm{w}|$ to consider the difference between the absolute values of the coefficients. And we develop two efficient algorithms to solve it. Finally, we test our methods and compare them with the related ones using simulated and real data to show their efficiency.

LGMar 19, 2016
L0-norm Sparse Graph-regularized SVD for Biclustering

Wenwen Min, Juan Liu, Shihua Zhang

Learning the "blocking" structure is a central challenge for high dimensional data (e.g., gene expression data). Recently, a sparse singular value decomposition (SVD) has been used as a biclustering tool to achieve this goal. However, this model ignores the structural information between variables (e.g., gene interaction graph). Although typical graph-regularized norm can incorporate such prior graph information to get accurate discovery and better interpretability, it fails to consider the opposite effect of variables with different signs. Motivated by the development of sparse coding and graph-regularized norm, we propose a novel sparse graph-regularized SVD as a powerful biclustering tool for analyzing high-dimensional data. The key of this method is to impose two penalties including a novel graph-regularized norm ($|\pmb{u}|\pmb{L}|\pmb{u}|$) and $L_0$-norm ($\|\pmb{u}\|_0$) on singular vectors to induce structural sparsity and enhance interpretability. We design an efficient Alternating Iterative Sparse Projection (AISP) algorithm to solve it. Finally, we apply our method and related ones to simulated and real data to show its efficiency in capturing natural blocking structures.