OCJan 28
Convergence Analysis of Randomized Subspace Normalized SGD under Heavy-Tailed NoiseGaku Omiya, Pierre-Louis Poirion, Akiko Takeda
Randomized subspace methods reduce per-iteration cost; however, in nonconvex optimization, most analyses are expectation-based, and high-probability bounds remain scarce even under sub-Gaussian noise. We first prove that randomized subspace SGD (RS-SGD) admits a high-probability convergence bound under sub-Gaussian noise, achieving the same order of oracle complexity as prior in-expectation results. Motivated by the prevalence of heavy-tailed gradients in modern machine learning, we then propose randomized subspace normalized SGD (RS-NSGD), which integrates direction normalization into subspace updates. Assuming the noise has bounded $p$-th moments, we establish both in-expectation and high-probability convergence guarantees, and show that RS-NSGD can achieve better oracle complexity than full-dimensional normalized SGD.
LGJun 8, 2025Code
Modified K-means Algorithm with Local Optimality GuaranteesMingyi Li, Michael R. Metel, Akiko Takeda
The K-means algorithm is one of the most widely studied clustering algorithms in machine learning. While extensive research has focused on its ability to achieve a globally optimal solution, there still lacks a rigorous analysis of its local optimality guarantees. In this paper, we first present conditions under which the K-means algorithm converges to a locally optimal solution. Based on this, we propose simple modifications to the K-means algorithm which ensure local optimality in both the continuous and discrete sense, with the same computational complexity as the original K-means algorithm. As the dissimilarity measure, we consider a general Bregman divergence, which is an extension of the squared Euclidean distance often used in the K-means algorithm. Numerical experiments confirm that the K-means algorithm does not always find a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. Our code is available at https://github.com/lmingyi/LO-K-means.
22.5OCMay 1
Randomized Subspace Nesterov Accelerated GradientGaku Omiya, Pierre-Louis Poirion, Akiko Takeda
Randomized-subspace methods reduce the cost of first-order optimization by using only low-dimensional projected-gradient information, a feature that is attractive in forward-mode automatic differentiation and communication-limited settings. While Nesterov acceleration is well understood for full-gradient and coordinate-based methods, obtaining accelerated methods for general subspace sketches that use only projected-gradient information and can improve over full-dimensional Nesterov acceleration in oracle complexity is technically nontrivial. We develop randomized-subspace Nesterov accelerated gradient methods for smooth convex and smooth strongly convex optimization under matrix smoothness and generic sketch moment assumptions. The key technical ingredient is a three-sequence formulation tailored to matrix smoothness, which recovers the corresponding classical Nesterov methods in the full-dimensional case. The resulting theory establishes accelerated oracle-complexity guarantees and makes explicit how matrix smoothness and the sketch distribution enter the complexity. It also provides a unified basis for comparing sketch families and identifying when randomized-subspace acceleration improves over full-dimensional Nesterov acceleration in oracle complexity.
OCFeb 6, 2024
A Framework for Bilevel Optimization on Riemannian ManifoldsAndi Han, Bamdev Mishra, Pratik Jawanpuria et al.
Bilevel optimization has gained prominence in various applications. In this study, we introduce a framework for solving bilevel optimization problems, where the variables in both the lower and upper levels are constrained on Riemannian manifolds. We present several hypergradient estimation strategies on manifolds and analyze their estimation errors. Furthermore, we provide comprehensive convergence and complexity analyses for the proposed hypergradient descent algorithm on manifolds. We also extend our framework to encompass stochastic bilevel optimization and incorporate the use of general retraction. The efficacy of the proposed framework is demonstrated through several applications.
MLMay 25, 2025
On the Role of Label Noise in the Feature Learning ProcessAndi Han, Wei Huang, Zhanpeng Zhou et al.
Deep learning with noisy labels presents significant challenges. In this work, we theoretically characterize the role of label noise from a feature learning perspective. Specifically, we consider a signal-noise data distribution, where each sample comprises a label-dependent signal and label-independent noise, and rigorously analyze the training dynamics of a two-layer convolutional neural network under this data setup, along with the presence of label noise. Our analysis identifies two key stages. In Stage I, the model perfectly fits all the clean samples (i.e., samples without label noise) while ignoring the noisy ones (i.e., samples with noisy labels). During this stage, the model learns the signal from the clean samples, which generalizes well on unseen data. In Stage II, as the training loss converges, the gradient in the direction of noise surpasses that of the signal, leading to overfitting on noisy samples. Eventually, the model memorizes the noise present in the noisy samples and degrades its generalization ability. Furthermore, our analysis provides a theoretical basis for two widely used techniques for tackling label noise: early stopping and sample selection. Experiments on both synthetic and real-world setups validate our theory.
OCDec 29, 2024
Zeroth-Order Methods for Nonconvex Stochastic Problems with Decision-Dependent DistributionsYuya Hikima, Akiko Takeda
In this study, we consider an optimization problem with uncertainty dependent on decision variables, which has recently attracted attention due to its importance in machine learning and pricing applications. In this problem, the gradient of the objective function cannot be obtained explicitly because the decision-dependent distribution is unknown. Therefore, several zeroth-order methods have been proposed, which obtain noisy objective values by sampling and update the iterates. Although these existing methods have theoretical convergence for optimization problems with decision-dependent uncertainty, they require strong assumptions about the function and distribution or exhibit large variances in their gradient estimators. To overcome these issues, we propose two zeroth-order methods under mild assumptions. First, we develop a zeroth-order method with a new one-point gradient estimator including a variance reduction parameter. The proposed method updates the decision variables while adjusting the variance reduction parameter. Second, we develop a zeroth-order method with a two-point gradient estimator. There are situations where only one-point estimators can be used, but if both one-point and two-point estimators are available, it is more practical to use the two-point estimator. As theoretical results, we show the convergence of our methods to stationary points and provide the worst-case iteration and sample complexity analysis. Our simulation experiments with real data on a retail service application show that our methods output solutions with lower objective values than the conventional zeroth-order methods.
OCMay 18, 2025
Efficient Optimization with Orthogonality Constraint: a Randomized Riemannian Submanifold MethodAndi Han, Pierre-Louis Poirion, Akiko Takeda
Optimization with orthogonality constraints frequently arises in various fields such as machine learning. Riemannian optimization offers a powerful framework for solving these problems by equipping the constraint set with a Riemannian manifold structure and performing optimization intrinsically on the manifold. This approach typically involves computing a search direction in the tangent space and updating variables via a retraction operation. However, as the size of the variables increases, the computational cost of the retraction can become prohibitively high, limiting the applicability of Riemannian optimization to large-scale problems. To address this challenge and enhance scalability, we propose a novel approach that restricts each update on a random submanifold, thereby significantly reducing the per-iteration complexity. We introduce two sampling strategies for selecting the random submanifolds and theoretically analyze the convergence of the proposed methods. We provide convergence results for general nonconvex functions and functions that satisfy Riemannian Polyak-Lojasiewicz condition as well as for stochastic optimization settings. Additionally, we demonstrate how our approach can be generalized to quotient manifolds derived from the orthogonal manifold. Extensive experiments verify the benefits of the proposed method, across a wide variety of problems.
LGJun 4, 2024
SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretrainingAndi Han, Jiaxiang Li, Wei Huang et al.
Large language models (LLMs) have shown impressive capabilities across various tasks. However, training LLMs from scratch requires significant computational power and extensive memory capacity. Recent studies have explored low-rank structures on weights for efficient fine-tuning in terms of parameters and memory, either through low-rank adaptation or factorization. While effective for fine-tuning, low-rank structures are generally less suitable for pretraining because they restrict parameters to a low-dimensional subspace. In this work, we propose to parameterize the weights as a sum of low-rank and sparse matrices for pretraining, which we call SLTrain. The low-rank component is learned via matrix factorization, while for the sparse component, we employ a simple strategy of uniformly selecting the sparsity support at random and learning only the non-zero entries with the fixed support. While being simple, the random fixed-support sparse learning strategy significantly enhances pretraining when combined with low-rank learning. Our results show that SLTrain adds minimal extra parameters and memory costs compared to pretraining with low-rank parameterization, yet achieves substantially better performance, which is comparable to full-rank training. Remarkably, when combined with quantization and per-layer updates, SLTrain can reduce memory requirements by up to 73% when pretraining the LLaMA 7B model.
OCMay 28, 2021
A Gradient Method for Multilevel OptimizationRyo Sato, Mirai Tanaka, Akiko Takeda
Although application examples of multilevel optimization have already been discussed since the 1990s, the development of solution methods was almost limited to bilevel cases due to the difficulty of the problem. In recent years, in machine learning, Franceschi et al. have proposed a method for solving bilevel optimization problems by replacing their lower-level problems with the $T$ steepest descent update equations with some prechosen iteration number $T$. In this paper, we have developed a gradient-based algorithm for multilevel optimization with $n$ levels based on their idea and proved that our reformulation asymptotically converges to the original multilevel problem. As far as we know, this is one of the first algorithms with some theoretical guarantee for multilevel optimization. Numerical experiments show that a trilevel hyperparameter learning model considering data poisoning produces more stable prediction results than an existing bilevel hyperparameter learning model in noisy data settings.
LGMar 11, 2021
BODAME: Bilevel Optimization for Defense Against Model ExtractionYuto Mori, Atsushi Nitanda, Akiko Takeda
Model extraction attacks have become serious issues for service providers using machine learning. We consider an adversarial setting to prevent model extraction under the assumption that attackers will make their best guess on the service provider's model using query accesses, and propose to build a surrogate model that significantly keeps away the predictions of the attacker's model from those of the true model. We formulate the problem as a non-convex constrained bilevel optimization problem and show that for kernel models, it can be transformed into a non-convex 1-quadratically constrained quadratic program with a polynomial-time algorithm to find the global optimum. Moreover, we give a tractable transformation and an algorithm for more complicated models that are learned by using stochastic gradient descent-based algorithms. Numerical experiments show that the surrogate model performs well compared with existing defense models when the difference between the attacker's and service provider's distributions is large. We also empirically confirm the generalization ability of the surrogate model.
LGMay 31, 2020
Theory and Algorithms for Shapelet-based Multiple-Instance LearningDaiki Suehiro, Kohei Hatano, Eiji Takimoto et al.
We propose a new formulation of Multiple-Instance Learning (MIL), in which a unit of data consists of a set of instances called a bag. The goal is to find a good classifier of bags based on the similarity with a "shapelet" (or pattern), where the similarity of a bag with a shapelet is the maximum similarity of instances in the bag. In previous work, some of the training instances are chosen as shapelets with no theoretical justification. In our formulation, we use all possible, and thus infinitely many shapelets, resulting in a richer class of classifiers. We show that the formulation is tractable, that is, it can be reduced through Linear Programming Boosting (LPBoost) to Difference of Convex (DC) programs of finite (actually polynomial) size. Our theoretical result also gives justification to the heuristics of some of the previous work. The time complexity of the proposed algorithm highly depends on the size of the set of all instances in the training sample. To apply to the data containing a large number of instances, we also propose a heuristic option of the algorithm without the loss of the theoretical guarantee. Our empirical study demonstrates that our algorithm uniformly works for Shapelet Learning tasks on time-series classification and various MIL tasks with comparable accuracy to the existing methods. Moreover, we show that the proposed heuristics allow us to achieve the result with reasonable computational time.
MLFeb 16, 2020
Convex Fairness Constrained Model Using Causal Effect EstimatorsHikaru Ogura, Akiko Takeda
Recent years have seen much research on fairness in machine learning. Here, mean difference (MD) or demographic parity is one of the most popular measures of fairness. However, MD quantifies not only discrimination but also explanatory bias which is the difference of outcomes justified by explanatory features. In this paper, we devise novel models, called FairCEEs, which remove discrimination while keeping explanatory bias. The models are based on estimators of causal effect utilizing propensity score analysis. We prove that FairCEEs with the squared loss theoretically outperform a naive MD constraint model. We provide an efficient algorithm for solving FairCEEs in regression and binary classification tasks. In our experiment on synthetic and real-world data in these two tasks, FairCEEs outperformed an existing model that considers explanatory bias in specific cases.
LGNov 20, 2018
Multiple-Instance Learning by Boosting Infinitely Many Shapelet-based ClassifiersDaiki Suehiro, Kohei Hatano, Eiji Takimoto et al.
We propose a new formulation of Multiple-Instance Learning (MIL). In typical MIL settings, a unit of data is given as a set of instances called a bag and the goal is to find a good classifier of bags based on similarity from a single or finitely many "shapelets" (or patterns), where the similarity of the bag from a shapelet is the maximum similarity of instances in the bag. Classifiers based on a single shapelet are not sufficiently strong for certain applications. Additionally, previous work with multiple shapelets has heuristically chosen some of the instances as shapelets with no theoretical guarantee of its generalization ability. Our formulation provides a richer class of the final classifiers based on infinitely many shapelets. We provide an efficient algorithm for the new formulation, in addition to generalization bound. Our empirical study demonstrates that our approach is effective not only for MIL tasks but also for Shapelet Learning for time-series classification.
APJun 15, 2018
Robust Bayesian Model Selection for Variable Clustering with the Gaussian Graphical ModelDaniel Andrade, Akiko Takeda, Kenji Fukumizu
Variable clustering is important for explanatory analysis. However, only few dedicated methods for variable clustering with the Gaussian graphical model have been proposed. Even more severe, small insignificant partial correlations due to noise can dramatically change the clustering result when evaluating for example with the Bayesian Information Criteria (BIC). In this work, we try to address this issue by proposing a Bayesian model that accounts for negligible small, but not necessarily zero, partial correlations. Based on our model, we propose to evaluate a variable clustering result using the marginal likelihood. To address the intractable calculation of the marginal likelihood, we propose two solutions: one based on a variational approximation, and another based on MCMC. Experiments on simulated data shows that the proposed method is similarly accurate as BIC in the no noise setting, but considerably more accurate when there are noisy partial correlations. Furthermore, on real data the proposed method provides clustering results that are intuitively sensible, which is not always the case when using BIC or its extensions.
OCApr 19, 2018
A refined convergence analysis of pDCA$_e$ with applications to simultaneous sparse recovery and outlier detectionTianxiang Liu, Ting Kei Pong, Akiko Takeda
We consider the problem of minimizing a difference-of-convex (DC) function, which can be written as the sum of a smooth convex function with Lipschitz gradient, a proper closed convex function and a continuous possibly nonsmooth concave function. We refine the convergence analysis in [38] for the proximal DC algorithm with extrapolation (pDCA$_e$) and show that the whole sequence generated by the algorithm is convergent when the objective is level-bounded, {\em without} imposing differentiability assumptions in the concave part. Our analysis is based on a new potential function and we assume such a function is a Kurdyka-Łojasiewicz (KL) function. We also establish a relationship between our KL assumption and the one used in [38]. Finally, we demonstrate how the pDCA$_e$ can be applied to a class of simultaneous sparse recovery and outlier detection problems arising from robust compressed sensing in signal processing and least trimmed squares regression in statistics. Specifically, we show that the objectives of these problems can be written as level-bounded DC functions whose concave parts are {\em typically nonsmooth}. Moreover, for a large class of loss functions and regularizers, the KL exponent of the corresponding potential function are shown to be 1/2, which implies that the pDCA$_e$ is locally linearly convergent when applied to these problems. Our numerical experiments show that the pDCA$_e$ usually outperforms the proximal DC algorithm with nonmonotone linesearch [24, Appendix A] in both CPU time and solution quality for this particular application.
MLNov 20, 2017
Optimistic Robust Optimization With Applications To Machine LearningMatthew Norton, Akiko Takeda, Alexander Mafusalov
Robust Optimization has traditionally taken a pessimistic, or worst-case viewpoint of uncertainty which is motivated by a desire to find sets of optimal policies that maintain feasibility under a variety of operating conditions. In this paper, we explore an optimistic, or best-case view of uncertainty and show that it can be a fruitful approach. We show that these techniques can be used to address a wide variety of problems. First, we apply our methods in the context of robust linear programming, providing a method for reducing conservatism in intuitive ways that encode economically realistic modeling assumptions. Second, we look at problems in machine learning and find that this approach is strongly connected to the existing literature. Specifically, we provide a new interpretation for popular sparsity inducing non-convex regularization schemes. Additionally, we show that successful approaches for dealing with outliers and noise can be interpreted as optimistic robust optimization problems. Although many of the problems resulting from our approach are non-convex, we find that DCA or DCA-like optimization approaches can be intuitive and efficient.
OCOct 16, 2017
A successive difference-of-convex approximation method for a class of nonconvex nonsmooth optimization problemsTianxiang Liu, Ting Kei Pong, Akiko Takeda
We consider a class of nonconvex nonsmooth optimization problems whose objective is the sum of a smooth function and a finite number of nonnegative proper closed possibly nonsmooth functions (whose proximal mappings are easy to compute), some of which are further composed with linear maps. This kind of problems arises naturally in various applications when different regularizers are introduced for inducing simultaneous structures in the solutions. Solving these problems, however, can be challenging because of the coupled nonsmooth functions: the corresponding proximal mapping can be hard to compute so that standard first-order methods such as the proximal gradient algorithm cannot be applied efficiently. In this paper, we propose a successive difference-of-convex approximation method for solving this kind of problems. In this algorithm, we approximate the nonsmooth functions by their Moreau envelopes in each iteration. Making use of the simple observation that Moreau envelopes of nonnegative proper closed functions are continuous {\em difference-of-convex} functions, we can then approximately minimize the approximation function by first-order methods with suitable majorization techniques. These first-order methods can be implemented efficiently thanks to the fact that the proximal mapping of {\em each} nonsmooth function is easy to compute. Under suitable assumptions, we prove that the sequence generated by our method is bounded and any accumulation point is a stationary point of the objective. We also discuss how our method can be applied to concrete applications such as nonconvex fused regularized optimization problems and simultaneously structured matrix optimization problems, and illustrate the performance numerically for these two specific applications.
LGSep 5, 2017
Boosting the kernelized shapelets: Theory and algorithms for local featuresDaiki Suehiro, Kohei Hatano, Eiji Takimoto et al.
We consider binary classification problems using local features of objects. One of motivating applications is time-series classification, where features reflecting some local closeness measure between a time series and a pattern sequence called shapelet are useful. Despite the empirical success of such approaches using local features, the generalization ability of resulting hypotheses is not fully understood and previous work relies on a bunch of heuristics. In this paper, we formulate a class of hypotheses using local features, where the richness of features is controlled by kernels. We derive generalization bounds of sparse ensembles over the class which is exponentially better than a standard analysis in terms of the number of possible local features. The resulting optimization problem is well suited to the boosting approach and the weak learning problem is formulated as a DC program, for which practical algorithms exist. In preliminary experiments on time-series data sets, our method achieves competitive accuracy with the state-of-the-art algorithms with small parameter-tuning cost.
MLMar 9, 2017
Trimmed Density Ratio EstimationSong Liu, Akiko Takeda, Taiji Suzuki et al.
Density ratio estimation is a vital tool in both machine learning and statistical community. However, due to the unbounded nature of density ratio, the estimation procedure can be vulnerable to corrupted data points, which often pushes the estimated ratio toward infinity. In this paper, we present a robust estimator which automatically identifies and trims outliers. The proposed estimator has a convex formulation, and the global optimum can be obtained via subgradient descent. We analyze the parameter estimation error of this estimator under high-dimensional settings. Experiments are conducted to verify the effectiveness of the estimator.
MLSep 3, 2014
Breakdown Point of Robust Support Vector MachineTakafumi Kanamori, Shuhei Fujiwara, Akiko Takeda
The support vector machine (SVM) is one of the most successful learning methods for solving classification problems. Despite its popularity, SVM has a serious drawback, that is sensitivity to outliers in training samples. The penalty on misclassification is defined by a convex loss called the hinge loss, and the unboundedness of the convex loss causes the sensitivity to outliers. To deal with outliers, robust variants of SVM have been proposed, such as the robust outlier detection algorithm and an SVM with a bounded loss called the ramp loss. In this paper, we propose a robust variant of SVM and investigate its robustness in terms of the breakdown point. The breakdown point is a robustness measure that is the largest amount of contamination such that the estimated classifier still gives information about the non-contaminated data. The main contribution of this paper is to show an exact evaluation of the breakdown point for the robust SVM. For learning parameters such as the regularization parameter in our algorithm, we derive a simple formula that guarantees the robustness of the classifier. When the learning parameters are determined with a grid search using cross validation, our formula works to reduce the number of candidate search points. The robustness of the proposed method is confirmed in numerical experiments. We show that the statistical properties of the robust SVM are well explained by a theoretical analysis of the breakdown point.
LGJun 18, 2012
A Unified Robust Classification ModelAkiko Takeda, Hiroyuki Mitsugi, Takafumi Kanamori
A wide variety of machine learning algorithms such as support vector machine (SVM), minimax probability machine (MPM), and Fisher discriminant analysis (FDA), exist for binary classification. The purpose of this paper is to provide a unified classification model that includes the above models through a robust optimization approach. This unified model has several benefits. One is that the extensions and improvements intended for SVM become applicable to MPM and FDA, and vice versa. Another benefit is to provide theoretical results to above learning methods at once by dealing with the unified model. We give a statistical interpretation of the unified classification model and propose a non-convex optimization algorithm that can be applied to non-convex variants of existing learning methods.
MLApr 30, 2012
A Conjugate Property between Loss Functions and Uncertainty Sets in Classification ProblemsTakafumi Kanamori, Akiko Takeda, Taiji Suzuki
In binary classification problems, mainly two approaches have been proposed; one is loss function approach and the other is uncertainty set approach. The loss function approach is applied to major learning algorithms such as support vector machine (SVM) and boosting methods. The loss function represents the penalty of the decision function on the training samples. In the learning algorithm, the empirical mean of the loss function is minimized to obtain the classifier. Against a backdrop of the development of mathematical programming, nowadays learning algorithms based on loss functions are widely applied to real-world data analysis. In addition, statistical properties of such learning algorithms are well-understood based on a lots of theoretical works. On the other hand, the learning method using the so-called uncertainty set is used in hard-margin SVM, mini-max probability machine (MPM) and maximum margin MPM. In the learning algorithm, firstly, the uncertainty set is defined for each binary label based on the training samples. Then, the best separating hyperplane between the two uncertainty sets is employed as the decision function. This is regarded as an extension of the maximum-margin approach. The uncertainty set approach has been studied as an application of robust optimization in the field of mathematical programming. The statistical properties of learning algorithms with uncertainty sets have not been intensively studied. In this paper, we consider the relation between the above two approaches. We point out that the uncertainty set is described by using the level set of the conjugate of the loss function. Based on such relation, we study statistical properties of learning algorithms using uncertainty sets.