MLJun 3, 2019
Robust Gaussian Process Regression for Real-Time High Precision GPS Signal EnhancementMing Lin, Xiaomin Song, Qi Qian et al.
Satellite-based positioning system such as GPS often suffers from large amount of noise that degrades the positioning accuracy dramatically especially in real-time applications. In this work, we consider a data-mining approach to enhance the GPS signal. We build a large-scale high precision GPS receiver grid system to collect real-time GPS signals for training. The Gaussian Process (GP) regression is chosen to model the vertical Total Electron Content (vTEC) distribution of the ionosphere of the Earth. Our experiments show that the noise in the real-time GPS signals often exceeds the breakdown point of the conventional robust regression methods resulting in sub-optimal system performance. We propose a three-step approach to address this challenge. In the first step we perform a set of signal validity tests to separate the signals into clean and dirty groups. In the second step, we train an initial model on the clean signals and then reweigting the dirty signals based on the residual error. A final model is retrained on both the clean signals and the reweighted dirty signals. In the theoretical analysis, we prove that the proposed three-step approach is able to tolerate much higher noise level than the vanilla robust regression methods if two reweighting rules are followed. We validate the superiority of the proposed method in our real-time high precision positioning system against several popular state-of-the-art robust regression methods. Our method achieves centimeter positioning accuracy in the benchmark region with probability $78.4\%$ , outperforming the second best baseline method by a margin of $8.3\%$. The benchmark takes 6 hours on 20,000 CPU cores or 14 years on a single CPU.
MLJan 30, 2019
Which Factorization Machine Modeling is Better: A Theoretical Answer with Optimal GuaranteeMing Lin, Shuang Qiu, Jieping Ye et al.
Factorization machine (FM) is a popular machine learning model to capture the second order feature interactions. The optimal learning guarantee of FM and its generalized version is not yet developed. For a rank $k$ generalized FM of $d$ dimensional input, the previous best known sampling complexity is $\mathcal{O}[k^{3}d\cdot\mathrm{polylog}(kd)]$ under Gaussian distribution. This bound is sub-optimal comparing to the information theoretical lower bound $\mathcal{O}(kd)$. In this work, we aim to tighten this bound towards optimal and generalize the analysis to sub-gaussian distribution. We prove that when the input data satisfies the so-called $τ$-Moment Invertible Property, the sampling complexity of generalized FM can be improved to $\mathcal{O}[k^{2}d\cdot\mathrm{polylog}(kd)/τ^{2}]$. When the second order self-interaction terms are excluded in the generalized FM, the bound can be improved to the optimal $\mathcal{O}[kd\cdot\mathrm{polylog}(kd)]$ up to the logarithmic factors. Our analysis also suggests that the positive semi-definite constraint in the conventional FM is redundant as it does not improve the sampling complexity while making the model difficult to optimize. We evaluate our improved FM model in real-time high precision GPS signal calibration task to validate its superiority.
LGDec 5, 2018
RobustSTL: A Robust Seasonal-Trend Decomposition Algorithm for Long Time SeriesQingsong Wen, Jingkun Gao, Xiaomin Song et al.
Decomposing complex time series into trend, seasonality, and remainder components is an important task to facilitate time series anomaly detection and forecasting. Although numerous methods have been proposed, there are still many time series characteristics exhibiting in real-world data which are not addressed properly, including 1) ability to handle seasonality fluctuation and shift, and abrupt change in trend and reminder; 2) robustness on data with anomalies; 3) applicability on time series with long seasonality period. In the paper, we propose a novel and generic time series decomposition algorithm to address these challenges. Specifically, we extract the trend component robustly by solving a regression problem using the least absolute deviations loss with sparse regularization. Based on the extracted trend, we apply the the non-local seasonal filtering to extract the seasonality component. This process is repeated until accurate decomposition is obtained. Experiments on different synthetic and real-world time series datasets demonstrate that our method outperforms existing solutions.
OCJul 17, 2018
Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep LearningHao Yu, Sen Yang, Shenghuo Zhu
In distributed training of deep neural networks, parallel mini-batch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.
LGMay 25, 2018
Large-scale Distance Metric Learning with UncertaintyQi Qian, Jiasheng Tang, Hao Li et al.
Distance metric learning (DML) has been studied extensively in the past decades for its superior performance with distance-based algorithms. Most of the existing methods propose to learn a distance metric with pairwise or triplet constraints. However, the number of constraints is quadratic or even cubic in the number of the original examples, which makes it challenging for DML to handle the large-scale data set. Besides, the real-world data may contain various uncertainty, especially for the image data. The uncertainty can mislead the learning procedure and cause the performance degradation. By investigating the image data, we find that the original data can be observed from a small set of clean latent examples with different distortions. In this work, we propose the margin preserving metric learning framework to learn the distance metric and latent examples simultaneously. By leveraging the ideal properties of latent examples, the training efficiency can be improved significantly while the learned metric also becomes robust to the uncertainty in the original data. Furthermore, we can show that the metric is learned from latent examples only, but it can preserve the large margin property even for the original data. The empirical study on the benchmark image data sets demonstrates the efficacy and efficiency of the proposed method.
LGMay 21, 2018
Learning with Non-Convex Truncated Losses by SGDYi Xu, Shenghuo Zhu, Sen Yang et al.
Learning with a {\it convex loss} function has been a dominating paradigm for many years. It remains an interesting question how non-convex loss functions help improve the generalization of learning with broad applicability. In this paper, we study a family of objective functions formed by truncating traditional loss functions, which is applicable to both shallow learning and deep learning. Truncating loss functions has potential to be less vulnerable and more robust to large noise in observations that could be adversarial. More importantly, it is a generic technique without assuming the knowledge of noise distribution. To justify non-convex learning with truncated losses, we establish excess risk bounds of empirical risk minimization based on truncated losses for heavy-tailed output, and statistical error of an approximate stationary point found by stochastic gradient descent (SGD) method. Our experiments for shallow and deep learning for regression with outliers, corrupted data and heavy-tailed noise further justify the proposed method.
LGMay 19, 2018
Robust Optimization over Multiple DomainsQi Qian, Shenghuo Zhu, Jiasheng Tang et al.
In this work, we study the problem of learning a single model for multiple domains. Unlike the conventional machine learning scenario where each domain can have the corresponding model, multiple domains (i.e., applications/users) may share the same machine learning model due to maintenance loads in cloud computing services. For example, a digit-recognition model should be applicable to hand-written digits, house numbers, car plates, etc. Therefore, an ideal model for cloud computing has to perform well at each applicable domain. To address this new challenge from cloud computing, we develop a framework of robust optimization over multiple domains. In lieu of minimizing the empirical risk, we aim to learn a model optimized to the adversarial distribution over multiple domains. Hence, we propose to learn the model and the adversarial distribution simultaneously with the stochastic algorithm for efficiency. Theoretically, we analyze the convergence rate for convex and non-convex models. To our best knowledge, we first study the convergence rate of learning a robust non-convex model with a practical algorithm. Furthermore, we demonstrate that the robustness of the framework and the convergence rate can be further enhanced by appropriate regularizers over the adversarial distribution. The empirical study on real-world fine-grained visual categorization and digits recognition tasks verifies the effectiveness and efficiency of the proposed framework.
LGMay 8, 2018
Multinomial Logit Bandit with Linear Utility FunctionsMingdong Ou, Nan Li, Shenghuo Zhu et al.
Multinomial logit bandit is a sequential subset selection problem which arises in many applications. In each round, the player selects a $K$-cardinality subset from $N$ candidate items, and receives a reward which is governed by a {\it multinomial logit} (MNL) choice model considering both item utility and substitution property among items. The player's objective is to dynamically learn the parameters of MNL model and maximize cumulative reward over a finite horizon $T$. This problem faces the exploration-exploitation dilemma, and the involved combinatorial nature makes it non-trivial. In recent years, there have developed some algorithms by exploiting specific characteristics of the MNL model, but all of them estimate the parameters of MNL model separately and incur a regret no better than $\tilde{O}\big(\sqrt{NT}\big)$ which is not preferred for large candidate set size $N$. In this paper, we consider the {\it linear utility} MNL choice model whose item utilities are represented as linear functions of $d$-dimension item features, and propose an algorithm, titled {\bf LUMB}, to exploit the underlying structure. It is proven that the proposed algorithm achieves $\tilde{O}\big(dK\sqrt{T}\big)$ regret which is free of candidate set size. Experiments show the superiority of the proposed algorithm.
CVJul 24, 2017
Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMMCong Leng, Hao Li, Shenghuo Zhu et al.
Although deep learning models are highly effective for various learning tasks, their high computational costs prohibit the deployment to scenarios where either memory or computational resources are limited. In this paper, we focus on compressing and accelerating deep models with network weights represented by very small numbers of bits, referred to as extremely low bit neural network. We model this problem as a discretely constrained optimization problem. Borrowing the idea from Alternating Direction Method of Multipliers (ADMM), we decouple the continuous parameters from the discrete constraints of network, and cast the original hard problem into several subproblems. We propose to solve these subproblems using extragradient and iterative quantization algorithms that lead to considerably faster convergency compared to conventional optimization methods. Extensive experiments on image recognition and object detection verify that the proposed algorithm is more effective than state-of-the-art approaches when coming to extremely low bit neural network.
LGDec 6, 2015
Similarity Learning via Adaptive Regression and Its Application to Image RetrievalQi Qian, Inci M. Baytas, Rong Jin et al.
We study the problem of similarity learning and its application to image retrieval with large-scale data. The similarity between pairs of images can be measured by the distances between their high dimensional representations, and the problem of learning the appropriate similarity is often addressed by distance metric learning. However, distance metric learning requires the learned metric to be a PSD matrix, which is computational expensive and not necessary for retrieval ranking problem. On the other hand, the bilinear model is shown to be more flexible for large-scale image retrieval task, hence, we adopt it to learn a matrix for estimating pairwise similarities under the regression framework. By adaptively updating the target matrix in regression, we can mimic the hinge loss, which is more appropriate for similarity learning problem. Although the regression problem can have the closed-form solution, the computational cost can be very expensive. The computational challenges come from two aspects: the number of images can be very large and image features have high dimensionality. We address the first challenge by compressing the data by a randomized algorithm with the theoretical guarantee. For the high dimensional issue, we address it by taking low rank assumption and applying alternating method to obtain the partial matrix, which has a global optimal solution. Empirical studies on real world image datasets (i.e., Caltech and ImageNet) demonstrate the effectiveness and efficiency of the proposed method.
LGSep 15, 2015
Towards Making High Dimensional Distance Metric Learning PracticalQi Qian, Rong Jin, Lijun Zhang et al.
In this work, we study distance metric learning (DML) for high dimensional data. A typical approach for DML with high dimensional data is to perform the dimensionality reduction first before learning the distance metric. The main shortcoming of this approach is that it may result in a suboptimal solution due to the subspace removed by the dimensionality reduction method. In this work, we present a dual random projection frame for DML with high dimensional data that explicitly addresses the limitation of dimensionality reduction for DML. The key idea is to first project all the data points into a low dimensional space by random projection, and compute the dual variables using the projected vectors. It then reconstructs the distance metric in the original space using the estimated dual variables. The proposed method, on one hand, enjoys the light computation of random projection, and on the other hand, alleviates the limitation of most dimensionality reduction methods. We verify both empirically and theoretically the effectiveness of the proposed algorithm for high dimensional DML.
NAMay 4, 2015
An Explicit Sampling Dependent Spectral Error Bound for Column Subset SelectionTianbao Yang, Lijun Zhang, Rong Jin et al.
In this paper, we consider the problem of column subset selection. We present a novel analysis of the spectral norm reconstruction for a simple randomized algorithm and establish a new bound that depends explicitly on the sampling probabilities. The sampling dependent error bound (i) allows us to better understand the tradeoff in the reconstruction error due to sampling probabilities, (ii) exhibits more insights than existing error bounds that exploit specific probability distributions, and (iii) implies better sampling distributions. In particular, we show that a sampling distribution with probabilities proportional to the square root of the statistical leverage scores is always better than uniform sampling and is better than leverage-based sampling when the statistical leverage scores are very nonuniform. And by solving a constrained optimization problem related to the error bound with an efficient bisection search we are able to achieve better performance than using either the leverage-based distribution or that proportional to the square root of the statistical leverage scores. Numerical simulations demonstrate the benefits of the new sampling distributions for low-rank matrix approximation and least square approximation compared to state-of-the art algorithms.
LGApr 15, 2015
Theory of Dual-sparse Regularized Randomized ReductionTianbao Yang, Lijun Zhang, Rong Jin et al.
In this paper, we study randomized reduction methods, which reduce high-dimensional features into low-dimensional space by randomized methods (e.g., random projection, random hashing), for large-scale high-dimensional classification. Previous theoretical results on randomized reduction methods hinge on strong assumptions about the data, e.g., low rank of the data matrix or a large separable margin of classification, which hinder their applications in broad domains. To address these limitations, we propose dual-sparse regularized randomized reduction methods that introduce a sparse regularizer into the reduced dual problem. Under a mild condition that the original dual solution is a (nearly) sparse vector, we show that the resulting dual solution is close to the original dual solution and concentrates on its support set. In numerical experiments, we present an empirical study to support the analysis and we also present a novel application of the dual-sparse regularized randomized reduction methods to reducing the communication cost of distributed learning from large-scale high-dimensional data.
NAAug 13, 2014
On Data Preconditioning for Regularized Loss MinimizationTianbao Yang, Rong Jin, Shenghuo Zhu et al.
In this work, we study data preconditioning, a well-known and long-existing technique, for boosting the convergence of first-order methods for regularized loss minimization. It is well understood that the condition number of the problem, i.e., the ratio of the Lipschitz constant to the strong convexity modulus, has a harsh effect on the convergence of the first-order optimization methods. Therefore, minimizing a small regularized loss for achieving good generalization performance, yielding an ill conditioned problem, becomes the bottleneck for big data problems. We provide a theory on data preconditioning for regularized loss minimization. In particular, our analysis exhibits an appropriate data preconditioner and characterizes the conditions on the loss function and on the data under which data preconditioning can reduce the condition number and therefore boost the convergence for minimizing the regularized loss. To make the data preconditioning practically useful, we endeavor to employ and analyze a random sampling approach to efficiently compute the preconditioned data. The preliminary experiments validate our theory.
LGMar 22, 2014
CUR Algorithm with Incomplete Matrix ObservationRong Jin, Shenghuo Zhu
CUR matrix decomposition is a randomized algorithm that can efficiently compute the low rank approximation for a given rectangle matrix. One limitation with the existing CUR algorithms is that they require an access to the full matrix A for computing U. In this work, we aim to alleviate this limitation. In particular, we assume that besides having an access to randomly sampled d rows and d columns from A, we only observe a subset of randomly sampled entries from A. Our goal is to develop a low rank approximation algorithm, similar to CUR, based on (i) randomly sampled rows and columns from A, and (ii) randomly sampled entries from A. The proposed algorithm is able to perfectly recover the target matrix A with only O(rn log n) number of observed entries. In addition, instead of having to solve an optimization problem involved trace norm regularization, the proposed algorithm only needs to solve a standard regression problem. Finally, unlike most matrix completion theories that hold only when the target matrix is of low rank, we show a strong guarantee for the proposed algorithm even when the target matrix is not low rank.
CVFeb 3, 2014
Fine-Grained Visual Categorization via Multi-stage Metric LearningQi Qian, Rong Jin, Shenghuo Zhu et al.
Fine-grained visual categorization (FGVC) is to categorize objects into subordinate classes instead of basic classes. One major challenge in FGVC is the co-occurrence of two issues: 1) many subordinate classes are highly correlated and are difficult to distinguish, and 2) there exists the large intra-class variation (e.g., due to object pose). This paper proposes to explicitly address the above two issues via distance metric learning (DML). DML addresses the first issue by learning an embedding so that data points from the same class will be pulled together while those from different classes should be pushed apart from each other; and it addresses the second issue by allowing the flexibility that only a portion of the neighbors (not all data points) from the same class need to be pulled together. However, feature representation of an image is often high dimensional, and DML is known to have difficulty in dealing with high dimensional feature vectors since it would require $\mathcal{O}(d^2)$ for storage and $\mathcal{O}(d^3)$ for optimization. To this end, we proposed a multi-stage metric learning framework that divides the large-scale high dimensional learning problem to a series of simple subproblems, achieving $\mathcal{O}(d)$ computational complexity. The empirical study with FVGC benchmark datasets verifies that our method is both effective and efficient compared to the state-of-the-art FGVC approaches.
DCDec 4, 2013
Analysis of Distributed Stochastic Dual Coordinate AscentTianbao Yang, Shenghuo Zhu, Rong Jin et al.
In \citep{Yangnips13}, the author presented distributed stochastic dual coordinate ascent (DisDCA) algorithms for solving large-scale regularized loss minimization. Extraordinary performances have been observed and reported for the well-motivated updates, as referred to the practical updates, compared to the naive updates. However, no serious analysis has been provided to understand the updates and therefore the convergence rates. In the paper, we bridge the gap by providing a theoretical analysis of the convergence rates of the practical DisDCA algorithm. Our analysis helped by empirical studies has shown that it could yield an exponential speed-up in the convergence by increasing the number of dual updates at each iteration. This result justifies the superior performances of the practical DisDCA as compared to the naive variant. As a byproduct, our analysis also reveals the convergence behavior of the one-communication DisDCA.
LGMay 9, 2013
Stochastic gradient descent algorithms for strongly convex functions at O(1/T) convergence ratesShenghuo Zhu
With a weighting scheme proportional to t, a traditional stochastic gradient descent (SGD) algorithm achieves a high probability convergence rate of O(κ/T) for strongly convex functions, instead of O(κ ln(T)/T). We also prove that an accelerated SGD algorithm also achieves a rate of O(κ/T).
LGMay 7, 2013
One-Pass AUC OptimizationWei Gao, Rong Jin, Shenghuo Zhu et al.
AUC is an important performance measure and many algorithms have been devoted to AUC optimization, mostly by minimizing a surrogate convex loss on a training data set. In this work, we focus on one-pass AUC optimization that requires only going through the training data once without storing the entire training dataset, where conventional online learning algorithms cannot be applied directly because AUC is measured by a sum of losses defined over pairs of instances from different classes. We develop a regression-based algorithm which only needs to maintain the first and second order statistics of training data in memory, resulting a storage requirement independent from the size of training data. To efficiently handle high dimensional data, we develop a randomized algorithm that approximates the covariance matrices by low rank matrices. We verify, both theoretically and empirically, the effectiveness of the proposed algorithm.
LGApr 3, 2013
Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)Qi Qian, Rong Jin, Jinfeng Yi et al.
Distance metric learning (DML) is an important task that has found applications in many domains. The high computational cost of DML arises from the large number of variables to be determined and the constraint that a distance metric has to be a positive semi-definite (PSD) matrix. Although stochastic gradient descent (SGD) has been successfully applied to improve the efficiency of DML, it can still be computationally expensive because in order to ensure that the solution is a PSD matrix, it has to, at every iteration, project the updated distance metric onto the PSD cone, an expensive operation. We address this challenge by developing two strategies within SGD, i.e. mini-batch and adaptive sampling, to effectively reduce the number of updates (i.e., projections onto the PSD cone) in SGD. We also develop hybrid approaches that combine the strength of adaptive sampling with that of mini-batch online learning techniques to further improve the computational efficiency of SGD for DML. We prove the theoretical guarantees for both adaptive sampling and mini-batch based approaches for DML. We also conduct an extensive empirical study to verify the effectiveness of the proposed algorithms for DML.
SIMay 9, 2012
A Bayesian Framework for Community Detection Integrating Content and LinkTianbao Yang, Rong Jin, Yun Chi et al.
This paper addresses the problem of community detection in networked data that combines link and content analysis. Most existing work combines link and content information by a generative model. There are two major shortcomings with the existing approaches. First, they assume that the probability of creating a link between two nodes is determined only by the community memberships of the nodes; however other factors (e.g. popularity) could also affect the link pattern. Second, they use generative models to model the content of individual nodes, whereas these generative models are vulnerable to the content attributes that are irrelevant to communities. We propose a Bayesian framework for combining link and content information for community detection that explicitly addresses these shortcomings. A new link model is presented that introduces a random variable to capture the node popularity when deciding the link between two nodes; a discriminative model is used to determine the community membership of a node by its content. An approximate inference algorithm is presented for efficient Bayesian inference. Our empirical study shows that the proposed framework outperforms several state-of-theart approaches in combining link and content information for community detection.
LGJan 24, 2012
An Efficient Primal-Dual Prox Method for Non-Smooth OptimizationTianbao Yang, Mehrdad Mahdavi, Rong Jin et al.
We study the non-smooth optimization problems in machine learning, where both the loss function and the regularizer are non-smooth functions. Previous studies on efficient empirical loss minimization assume either a smooth loss function or a strongly convex regularizer, making them unsuitable for non-smooth optimization. We develop a simple yet efficient method for a family of non-smooth optimization problems where the dual form of the loss function is bilinear in primal and dual variables. We cast a non-smooth optimization problem into a minimax optimization problem, and develop a primal dual prox method that solves the minimax optimization problem at a rate of $O(1/T)$ {assuming that the proximal step can be efficiently solved}, significantly faster than a standard subgradient descent method that has an $O(1/\sqrt{T})$ convergence rate. Our empirical study verifies the efficiency of the proposed method for various non-smooth optimization problems that arise ubiquitously in machine learning by comparing it to the state-of-the-art first order methods.