Difei Cheng

LG
h-index3
5papers
30citations
Novelty59%
AI Score43

5 Papers

LGJul 6, 2022
Careful Seeding for k-Medois Clustering with Incremental k-Means++ Initialization

Difei Cheng, Yunfeng Zhang, Ruinan Jin

K-medoids clustering is a popular variant of k-means clustering and widely used in pattern recognition and machine learning. A main drawback of k-medoids clustering is that an improper initialization can cause it to get trapped in local optima. An improved k-medoids clustering algorithm, called INCKM algorithm, which is the first to apply incremental initialization to k-medoids clustering, was recently proposed to overcome this drawback. The INCKM algorithm requires the construction of a subset of candidate medoids determined by one hyperparameter for initialization, and meanwhile, it always fails when dealing with imbalanced datasets with an incorrect hyperparameter selection. In this paper, we propose a novel k-medoids clustering algorithm, called incremental k-means++ (INCKPP) algorithm, which initializes with a novel incremental manner, attempting to optimally add one new cluster center at each stage through a nonparametric and stochastic k-means++ initialization. The INCKPP algorithm overcomes the difficulty of hyperparameter selection in the INCKM algorithm, improves the clustering performance, and can deal with imbalanced datasets well. However, the INCKPP algorithm is not computationally efficient enough. To deal with this, we further propose an improved INCKPP algorithm, called INCKPPsample algorithm, which improves the clustering efficiency while maintaining the clustering performance of the INCKPP algorithm. Extensive results from experiments on both synthetic and real-world datasets, including imbalanced datasets, illustrate that the proposed algorithms outperforms than the other compared algorithms.

LGMar 1
A Decomposition Framework for Certifiably Optimal Orthogonal Sparse PCA

Difei Cheng, Qiao Hu

Sparse Principal Component Analysis (SPCA) is an important technique for high-dimensional data analysis, improving interpretability by imposing sparsity on principal components. However, existing methods often fail to simultaneously guarantee sparsity, orthogonality, and optimality of the principal components. To address this challenge, this work introduces a novel Sparse Principal Component Analysis (SPCA) algorithm called \textsc{GS-SPCA} (SPCA with Gram-Schmidt Orthogonalization), which simultaneously enforces sparsity, orthogonality, and optimality. However, the original GS-SPCA algorithm is computationally expensive due to the inherent $\ell_0$-norm constraint. To address this issue, we propose two acceleration strategies: First, we combine \textbf{Branch-and-Bound} with the GS-SPCA algorithm. By incorporating this strategy, we are able to obtain $\varepsilon$-optimal solutions with a trade-off between precision and efficiency, significantly improving computational speed. Second, we propose a \textbf{decomposition framework} for efficiently solving \textbf{multiple} principal components. This framework approximates the covariance matrix using a block-diagonal matrix through a thresholding method, reducing the original SPCA problem to a set of block-wise subproblems on approximately block-diagonal matrices.

LGOct 24, 2025
Online AUC Optimization Based on Second-order Surrogate Loss

JunRu Luo, Difei Cheng, Bo Zhang

The Area Under the Curve (AUC) is an important performance metric for classification tasks, particularly in class-imbalanced scenarios. However, minimizing the AUC presents significant challenges due to the non-convex and discontinuous nature of pairwise 0/1 losses, which are difficult to optimize, as well as the substantial memory cost of instance-wise storage, which creates bottlenecks in large-scale applications. To overcome these challenges, we propose a novel second-order surrogate loss based on the pairwise hinge loss, and develop an efficient online algorithm. Unlike conventional approaches that approximate each individual pairwise 0/1 loss term with an instance-wise surrogate function, our approach introduces a new paradigm that directly substitutes the entire aggregated pairwise loss with a surrogate loss function constructed from the first- and second-order statistics of the training data. Theoretically, while existing online AUC optimization algorithms typically achieve an $\mathcal{O}(\sqrt{T})$ regret bound, our method attains a tighter $\mathcal{O}(\ln T)$ bound. Furthermore, we extend the proposed framework to nonlinear settings through a kernel-based formulation. Extensive experiments on multiple benchmark datasets demonstrate the superior efficiency and effectiveness of the proposed second-order surrogate loss in optimizing online AUC performance.

LGApr 17, 2025
Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods

Ruinan Jin, Difei Cheng, Hong Qiao et al.

Stochastic Gradient Descent (SGD) is widely used in machine learning research. Previous convergence analyses of SGD under the vanishing step-size setting typically require Robbins-Monro conditions. However, in practice, a wider variety of step-size schemes are frequently employed, yet existing convergence results remain limited and often rely on strong assumptions. This paper bridges this gap by introducing a novel analytical framework based on a stopping-time method, enabling asymptotic convergence analysis of SGD under more relaxed step-size conditions and weaker assumptions. In the non-convex setting, we prove the almost sure convergence of SGD iterates for step-sizes $ \{ ε_t \}_{t \geq 1} $ satisfying $\sum_{t=1}^{+\infty} ε_t = +\infty$ and $\sum_{t=1}^{+\infty} ε_t^p < +\infty$ for some $p > 2$. Compared with previous studies, our analysis eliminates the global Lipschitz continuity assumption on the loss function and relaxes the boundedness requirements for higher-order moments of stochastic gradients. Building upon the almost sure convergence results, we further establish $L_2$ convergence. These significantly relaxed assumptions make our theoretical results more general, thereby enhancing their applicability in practical scenarios.

LGSep 23, 2021
Fast Density Estimation for Density-based Clustering Methods

Difei Cheng, Ruihang Xu, Bo Zhang et al.

Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning since they can deal with non-hyperspherical clusters and are robustness to handle outliers. However, the runtime of density-based algorithms are heavily dominated by finding fixed-radius near neighbors and calculating the density, which is time-consuming. Meanwhile, the traditional acceleration methods using indexing technique such as KD tree is not effective in processing high-dimensional data. In this paper, we propose a fast region query algorithm named fast principal component analysis pruning (called FPCAP) with the help of the fast principal component analysis technique in conjunction with geometric information provided by principal attributes of the data, which can process high-dimensional data and be easily applied to density-based methods to prune unnecessary distance calculations when finding neighbors and estimating densities. As an application in density-based clustering methods, FPCAP method was combined with the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And then, an improved DBSCAN (called IDBSCAN) is obtained, which preserves the advantage of DBSCAN and meanwhile, greatly reduces the computation of redundant distances. Experiments on seven benchmark datasets demonstrate that the proposed algorithm improves the computational efficiency significantly.