CLJun 9, 2023Code
Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge EvaluationZhouhong Gu, Xiaoxuan Zhu, Haoning Ye et al.
New Natural Langauge Process~(NLP) benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present Xiezhi, the most comprehensive evaluation suite designed to assess holistic domain knowledge. Xiezhi comprises multiple-choice questions across 516 diverse disciplines ranging from 13 different subjects with 249,587 questions and accompanied by Xiezhi-Specialty and Xiezhi-Interdiscipline, both with 15k questions. We conduct evaluation of the 47 cutting-edge LLMs on Xiezhi. Results indicate that LLMs exceed average performance of humans in science, engineering, agronomy, medicine, and art, but fall short in economics, jurisprudence, pedagogy, literature, history, and management. We anticipate Xiezhi will help analyze important strengths and shortcomings of LLMs, and the benchmark is released in~\url{https://github.com/MikeGu721/XiezhiBenchmark}.
LGOct 1, 2013
Improving CUR Matrix Decomposition and the Nyström Approximation via Adaptive SamplingShusen Wang, Zhihua Zhang
The CUR matrix decomposition and the Nyström approximation are two important low-rank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nyström approximation. In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nyström algorithms with expected relative-error bounds. The proposed CUR and Nyström algorithms also have low time complexity and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nyström method and the ensemble Nyström method. The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices.
LGApr 6, 2022
Federated Reinforcement Learning with Environment HeterogeneityHao Jin, Yang Peng, Wenhao Yang et al.
We study a Federated Reinforcement Learning (FedRL) problem in which $n$ agents collaboratively learn a single policy without sharing the trajectories they collected during agent-environment interaction. We stress the constraint of environment heterogeneity, which means $n$ environments corresponding to these $n$ agents have different state transitions. To obtain a value function or a policy function which optimizes the overall performance in all environments, we propose two federated RL algorithms, \texttt{QAvg} and \texttt{PAvg}. We theoretically prove that these algorithms converge to suboptimal solutions, while such suboptimality depends on how heterogeneous these $n$ environments are. Moreover, we propose a heuristic that achieves personalization by embedding the $n$ environments into $n$ vectors. The personalization heuristic not only improves the training but also allows for better generalization to new environments.
NADec 18, 2015
Improved Analyses of the Randomized Power Method and Block Lanczos MethodShusen Wang, Zhihua Zhang, Tong Zhang
The power method and block Lanczos method are popular numerical algorithms for computing the truncated singular value decomposition (SVD) and eigenvalue decomposition problems. Especially in the literature of randomized numerical linear algebra, the power method is widely applied to improve the quality of randomized sketching, and relative-error bounds have been well established. Recently, Musco & Musco (2015) proposed a block Krylov subspace method that fully exploits the intermediate results of the power iteration to accelerate convergence. They showed spectral gap-independent bounds which are stronger than the power method by order-of-magnitude. This paper offers novel error analysis techniques and significantly improves the bounds of both the randomized power method and the block Lanczos method. This paper also establishes the first gap-independent bound for the warm-start block Lanczos method.
LGFeb 9, 2023
An End-to-End Framework for Marketing Effectiveness Optimization under Budget ConstraintZiang Yan, Shusen Wang, Guorui Zhou et al.
Online platforms often incentivize consumers to improve user engagement and platform revenue. Since different consumers might respond differently to incentives, individual-level budget allocation is an essential task in marketing campaigns. Recent advances in this field often address the budget allocation problem using a two-stage paradigm: the first stage estimates the individual-level treatment effects using causal inference algorithms, and the second stage invokes integer programming techniques to find the optimal budget allocation solution. Since the objectives of these two stages might not be perfectly aligned, such a two-stage paradigm could hurt the overall marketing effectiveness. In this paper, we propose a novel end-to-end framework to directly optimize the business goal under budget constraints. Our core idea is to construct a regularizer to represent the marketing goal and optimize it efficiently using gradient estimation techniques. As such, the obtained models can learn to maximize the marketing goal directly and precisely. We extensively evaluate our proposed method in both offline and online experiments, and experimental results demonstrate that our method outperforms current state-of-the-art methods. Our proposed method is currently deployed to allocate marketing budgets for hundreds of millions of users on a short video platform and achieves significant business goal improvements. Our code will be publicly available.
LGAug 4, 2022
Transferable Multi-Agent Reinforcement Learning with Dynamic Participating AgentsXuting Tang, Jia Xu, Shusen Wang
We study multi-agent reinforcement learning (MARL) with centralized training and decentralized execution. During the training, new agents may join, and existing agents may unexpectedly leave the training. In such situations, a standard deep MARL model must be trained again from scratch, which is very time-consuming. To tackle this problem, we propose a special network architecture with a few-shot learning algorithm that allows the number of agents to vary during centralized training. In particular, when a new agent joins the centralized training, our few-shot learning algorithm trains its policy network and value network using a small number of samples; when an agent leaves the training, the training process of the remaining agents is not affected. Our experiments show that using the proposed network architecture and algorithm, model adaptation when new agents join can be 100+ times faster than the baseline. Our work is applicable to any setting, including cooperative, competitive, and mixed.
CLMay 24, 2023Code
RefGPT: Dialogue Generation of GPT, by GPT, and for GPTDongjie Yang, Ruifeng Yuan, Yuantao Fan et al.
Large Language Models (LLMs) have attained the impressive capability to resolve a wide range of NLP tasks by fine-tuning high-quality instruction data. However, collecting human-written data of high quality, especially multi-turn dialogues, is expensive and unattainable for most people. Though previous studies have used powerful LLMs to generate the dialogues automatically, they all suffer from generating untruthful dialogues because of the model hallucination. Therefore, we propose a method called RefGPT to generate enormous truthful and customized dialogues without worrying about factual errors caused by the model hallucination. RefGPT solves the model hallucination in dialogue generation by restricting the LLMs to leverage the given reference instead of reciting their own knowledge to generate dialogues. Additionally, RefGPT adds detailed controls on every utterance to enable high customization capability, which previous studies have ignored. On the basis of RefGPT, we also propose two high-quality dialogue datasets generated by GPT-4, namely RefGPT-Fact and RefGPT-Code. RefGPT-Fact is a dataset with 100k multi-turn dialogues based on factual knowledge and RefGPT-Code has 76k multi-turn dialogues covering a wide range of coding scenarios. Our code and datasets are released in https://github.com/mutonix/RefGPT.
IRJul 21, 2023
Methodologies for Improving Modern Industrial Recommender SystemsShusen Wang
Recommender system (RS) is an established technology with successful applications in social media, e-commerce, entertainment, and more. RSs are indeed key to the success of many popular APPs, such as YouTube, Tik Tok, Xiaohongshu, Bilibili, and others. This paper explores the methodology for improving modern industrial RSs. It is written for experienced RS engineers who are diligently working to improve their key performance indicators, such as retention and duration. The experiences shared in this paper have been tested in some real industrial RSs and are likely to be generalized to other RSs as well. Most contents in this paper are industry experience without publicly available references.
CLJun 18, 2024
DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?Zhouhong Gu, Lin Zhang, Xiaoxuan Zhu et al.
Detecting evidence within the context is a key step in the process of reasoning task. Evaluating and enhancing the capabilities of LLMs in evidence detection will strengthen context-based reasoning performance. This paper proposes a benchmark called DetectBench for verifying the ability to detect and piece together implicit evidence within a long context. DetectBench contains 3,928 multiple-choice questions, with an average of 994 tokens per question. Each question contains an average of 4.55 pieces of implicit evidence, and solving the problem typically requires 7.62 logical jumps to find the correct answer. To enhance the performance of LLMs in evidence detection, this paper proposes Detective Reasoning Prompt and Finetune. Experiments demonstrate that the existing LLMs' abilities to detect evidence in long contexts are far inferior to humans. However, the Detective Reasoning Prompt effectively enhances the capability of powerful LLMs in evidence detection, while the Finetuning method shows significant effects in enhancing the performance of weaker LLMs. Moreover, when the abilities of LLMs in evidence detection are improved, their final reasoning performance is also enhanced accordingly.
MLMar 1, 2021
FedPower: Privacy-Preserving Distributed Eigenspace EstimationXiao Guo, Xiang Li, Xiangyu Chang et al.
Eigenspace estimation is fundamental in machine learning and statistics, which has found applications in PCA, dimension reduction, and clustering, among others. The modern machine learning community usually assumes that data come from and belong to different organizations. The low communication power and the possible privacy breaches of data make the computation of eigenspace challenging. To address these challenges, we propose a class of algorithms called \textsf{FedPower} within the federated learning (FL) framework. \textsf{FedPower} leverages the well-known power method by alternating multiple local power iterations and a global aggregation step, thus improving communication efficiency. In the aggregation, we propose to weight each local eigenvector matrix with {\it Orthogonal Procrustes Transformation} (OPT) for better alignment. To ensure strong privacy protection, we add Gaussian noise in each iteration by adopting the notion of \emph{differential privacy} (DP). We provide convergence bounds for \textsf{FedPower} that are composed of different interpretable terms corresponding to the effects of Gaussian noise, parallelization, and random sampling of local machines. Additionally, we conduct experiments to demonstrate the effectiveness of our proposed algorithms.
MLFeb 19, 2020
Communication-Efficient Distributed SVD via Local Power IterationsXiang Li, Shusen Wang, Kun Chen et al.
We study distributed computing of the truncated singular value decomposition problem. We develop an algorithm that we call \texttt{LocalPower} for improving communication efficiency. Specifically, we uniformly partition the dataset among $m$ nodes and alternate between multiple (precisely $p$) local power iterations and one global aggregation. In the aggregation, we propose to weight each local eigenvector matrix with orthogonal Procrustes transformation (OPT). As a practical surrogate of OPT, sign-fixing, which uses a diagonal matrix with $\pm 1$ entries as weights, has better computation complexity and stability in experiments. We theoretically show that under certain assumptions \texttt{LocalPower} lowers the required number of communications by a factor of $p$ to reach a constant accuracy. We also show that the strategy of periodically decaying $p$ helps obtain high-precision solutions. We conduct experiments to demonstrate the effectiveness of \texttt{LocalPower}.
LGDec 27, 2019
Fast Generalized Matrix Regression with Applications in Machine LearningHaishan Ye, Shusen Wang, Zhihua Zhang et al.
Fast matrix algorithms have become the fundamental tools of machine learning in big data era. The generalized matrix regression problem is widely used in the matrix approximation such as CUR decomposition, kernel matrix approximation, and stream singular value decomposition (SVD), etc. In this paper, we propose a fast generalized matrix regression algorithm (Fast GMR) which utilizes sketching technique to solve the GMR problem efficiently. Given error parameter $0<ε<1$, the Fast GMR algorithm can achieve a $(1+ε)$ relative error with the sketching sizes being of order $\cO(ε^{-1/2})$ for a large group of GMR problems. We apply the Fast GMR algorithm to the symmetric positive definite matrix approximation and single pass singular value decomposition and they achieve a better performance than conventional algorithms. Our empirical study also validates the effectiveness and efficiency of our proposed algorithms.
LGDec 21, 2019
Graph Message Passing with Cross-location Attentions for Long-term ILI PredictionSonggaojun Deng, Shusen Wang, Huzefa Rangwala et al.
Forecasting influenza-like illness (ILI) is of prime importance to epidemiologists and health-care providers. Early prediction of epidemic outbreaks plays a pivotal role in disease intervention and control. Most existing work has either limited long-term prediction performance or lacks a comprehensive ability to capture spatio-temporal dependencies in data. Accurate and early disease forecasting models would markedly improve both epidemic prevention and managing the onset of an epidemic. In this paper, we design a cross-location attention based graph neural network (Cola-GNN) for learning time series embeddings and location aware attentions. We propose a graph message passing framework to combine learned feature embeddings and an attention matrix to model disease propagation over time. We compare the proposed method with state-of-the-art statistical approaches and deep learning models on real-world epidemic-related datasets from United States and Japan. The proposed method shows strong predictive performance and leads to interpretable results for long-term epidemic predictions.
MLOct 21, 2019
Communication-Efficient Local Decentralized SGD MethodsXiang Li, Wenhao Yang, Shusen Wang et al.
Recently, the technique of local updates is a powerful tool in centralized settings to improve communication efficiency via periodical communication. For decentralized settings, it is still unclear how to efficiently combine local updates and decentralized communication. In this work, we propose an algorithm named as LD-SGD, which incorporates arbitrary update schemes that alternate between multiple Local updates and multiple Decentralized SGDs, and provide an analytical framework for LD-SGD. Under the framework, we present a sufficient condition to guarantee the convergence. We show that LD-SGD converges to a critical point for a wide range of update schemes when the objective is non-convex and the training data are non-identically independent distributed. Moreover, our framework brings many insights into the design of update schemes for decentralized optimization. As examples, we specify two update schemes and show how they help improve communication efficiency. Specifically, the first scheme alternates the number of local and global update steps. From our analysis, the ratio of the number of local updates to that of decentralized SGD trades off communication and computation. The second scheme is to periodically shrink the length of local updates. We show that the decaying strategy helps improve communication efficiency both theoretically and empirically.
MLSep 24, 2019
Simple and Almost Assumption-Free Out-of-Sample Bound for Random Feature MappingShusen Wang
Random feature mapping (RFM) is a popular method for speeding up kernel methods at the cost of losing a little accuracy. We study kernel ridge regression with random feature mapping (RFM-KRR) and establish novel out-of-sample error upper and lower bounds. While out-of-sample bounds for RFM-KRR have been established by prior work, this paper's theories are highly interesting for two reasons. On the one hand, our theories are based on weak and valid assumptions. In contrast, the existing theories are based on various uncheckable assumptions, which makes it unclear whether their bounds are the nature of RFM-KRR or simply the consequence of strong assumptions. On the other hand, our analysis is completely based on elementary linear algebra and thereby easy to read and verify. Finally, our experiments lend empirical supports to the theories.
LGSep 24, 2019
Matrix Sketching for Secure Collaborative Machine LearningMengjiao Zhang, Shusen Wang
Collaborative learning allows participants to jointly train a model without data sharing. To update the model parameters, the central server broadcasts model parameters to the clients, and the clients send updating directions such as gradients to the server. While data do not leave a client device, the communicated gradients and parameters will leak a client's privacy. Attacks that infer clients' privacy from gradients and parameters have been developed by prior work. Simple defenses such as dropout and differential privacy either fail to defend the attacks or seriously hurt test accuracy. We propose a practical defense which we call Double-Blind Collaborative Learning (DBCL). The high-level idea is to apply random matrix sketching to the parameters (aka weights) and re-generate random sketching after each iteration. DBCL prevents clients from conducting gradient-based privacy inferences which are the most effective attacks. DBCL works because from the attacker's perspective, sketching is effectively random noise that outweighs the signal. Notably, DBCL does not much increase computation and communication costs and does not hurt test accuracy at all.
MLJul 4, 2019
On the Convergence of FedAvg on Non-IID DataXiang Li, Kaixuan Huang, Wenhao Yang et al.
Federated learning enables a large amount of edge computing devices to jointly learn a model without data sharing. As a leading algorithm in this setting, Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees under realistic settings. In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data and establish a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs. Importantly, our bound demonstrates a trade-off between communication-efficiency and convergence rate. As user devices may be disconnected from the server, we relax the assumption of full device participation to partial device participation and study different averaging schemes; low device participation rate can be achieved without severely slowing down the learning. Our results indicate that heterogeneity of data slows down the convergence, which matches empirical observations. Furthermore, we provide a necessary condition for \texttt{FedAvg} on non-iid data: the learning rate $η$ must decay, even if full-gradient is used; otherwise, the solution will be $Ω(η)$ away from the optimal.
MLFeb 13, 2019
Do Subsampled Newton Methods Work for High-Dimensional Data?Xiang Li, Shusen Wang, Zhihua Zhang
Subsampled Newton methods approximate Hessian matrices through subsampling techniques, alleviating the cost of forming Hessian matrices but using sufficient curvature information. However, previous results require $Ω(d)$ samples to approximate Hessians, where $d$ is the dimension of data points, making it less practically feasible for high-dimensional data. The situation is deteriorated when $d$ is comparably as large as the number of data points $n$, which requires to take the whole dataset into account, making subsampling useless. This paper theoretically justifies the effectiveness of subsampled Newton methods on high dimensional data. Specifically, we prove only $\widetildeΘ(d^γ_{\rm eff})$ samples are needed in the approximation of Hessian matrices, where $d^γ_{\rm eff}$ is the $γ$-ridge leverage and can be much smaller than $d$ as long as $nγ\gg 1$. Additionally, we extend this result so that subsampled Newton methods can work for high-dimensional data on both distributed optimization problems and non-smooth regularized problems.
MLMar 21, 2018
Error Estimation for Randomized Least-Squares Algorithms via the BootstrapMiles E. Lopes, Shusen Wang, Michael W. Mahoney
Over the course of the past decade, a variety of randomized algorithms have been proposed for computing approximate least-squares (LS) solutions in large-scale settings. A longstanding practical issue is that, for any given input, the user rarely knows the actual error of an approximate solution (relative to the exact solution). Likewise, it is difficult for the user to know precisely how much computation is needed to achieve the desired error tolerance. Consequently, the user often appeals to worst-case error bounds that tend to offer only qualitative guidance. As a more practical alternative, we propose a bootstrap method to compute a posteriori error estimates for randomized LS algorithms. These estimates permit the user to numerically assess the error of a given solution, and to predict how much work is needed to improve a "preliminary" solution. In addition, we provide theoretical consistency results for the method, which are the first such results in this context (to the best of our knowledge). From a practical standpoint, the method also has considerable flexibility, insofar as it can be applied to several popular sketching algorithms, as well as a variety of error metrics. Moreover, the extra step of error estimation does not add much cost to an underlying sketching algorithm. Finally, we demonstrate the effectiveness of the method with empirical results.
LGOct 11, 2017
Efficient Data-Driven Geologic Feature Detection from Pre-stack Seismic Measurements using Randomized Machine-Learning AlgorithmYouzuo Lin, Shusen Wang, Jayaraman Thiagarajan et al.
Conventional seismic techniques for detecting the subsurface geologic features are challenged by limited data coverage, computational inefficiency, and subjective human factors. We developed a novel data-driven geological feature detection approach based on pre-stack seismic measurements. Our detection method employs an efficient and accurate machine-learning detection approach to extract useful subsurface geologic features automatically. Specifically, our method is based on kernel ridge regression model. The conventional kernel ridge regression can be computationally prohibited because of the large volume of seismic measurements. We employ a data reduction technique in combination with the conventional kernel ridge regression method to improve the computational efficiency and reduce memory usage. In particular, we utilize a randomized numerical linear algebra technique, named Nyström method, to effectively reduce the dimensionality of the feature space without compromising the information content required for accurate detection. We provide thorough computational cost analysis to show efficiency of our new geological feature detection methods. We further validate the performance of our new subsurface geologic feature detection method using synthetic surface seismic data for 2D acoustic and elastic velocity models. Our numerical examples demonstrate that our new detection method significantly improves the computational efficiency while maintaining comparable accuracy. Interestingly, we show that our method yields a speed-up ratio on the order of $\sim10^2$ to $\sim 10^3$ in a multi-core computational environment.
LGSep 11, 2017
GIANT: Globally Improved Approximate Newton Method for Distributed OptimizationShusen Wang, Farbod Roosta-Khorasani, Peng Xu et al.
For distributed computing environment, we consider the empirical risk minimization problem and propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, which is sent to the main driver. The main driver, then, averages all the ANT directions received from workers to form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly communication efficient and naturally exploits the trade-offs between local computations and global communications in that more local computations result in fewer overall rounds of communications. Theoretically, we show that GIANT enjoys an improved convergence rate as compared with first-order methods and existing distributed Newton-type methods. Further, and in sharp contrast with many existing distributed Newton-type methods, as well as popular first-order methods, a highly advantageous practical feature of GIANT is that it only involves one tuning parameter. We conduct large-scale experiments on a computer cluster and, empirically, demonstrate the superior performance of GIANT.
MLAug 6, 2017
A Bootstrap Method for Error Estimation in Randomized Matrix MultiplicationMiles E. Lopes, Shusen Wang, Michael W. Mahoney
In recent years, randomized methods for numerical linear algebra have received growing interest as a general approach to large-scale problems. Typically, the essential ingredient of these methods is some form of randomized dimension reduction, which accelerates computations, but also creates random approximation error. In this way, the dimension reduction step encodes a tradeoff between cost and accuracy. However, the exact numerical relationship between cost and accuracy is typically unknown, and consequently, it may be difficult for the user to precisely know (1) how accurate a given solution is, or (2) how much computation is needed to achieve a given level of accuracy. In the current paper, we study randomized matrix multiplication (sketching) as a prototype setting for addressing these general problems. As a solution, we develop a bootstrap method for \emph{directly estimating} the accuracy as a function of the reduced dimension (as opposed to deriving worst-case bounds on the accuracy in terms of the reduced dimension). From a computational standpoint, the proposed method does not substantially increase the cost of standard sketching methods, and this is made possible by an "extrapolation" technique. In addition, we provide both theoretical and empirical results to demonstrate the effectiveness of the proposed method.
LGJun 9, 2017
Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error BoundsShusen Wang, Alex Gittens, Michael W. Mahoney
Kernel $k$-means clustering can correctly identify and extract a far more varied collection of cluster structures than the linear $k$-means clustering algorithm. However, kernel $k$-means clustering is computationally expensive when the non-linear feature map is high-dimensional and there are many input points. Kernel approximation, e.g., the Nyström method, has been applied in previous works to approximately solve kernel learning problems when both of the above conditions are present. This work analyzes the application of this paradigm to kernel $k$-means clustering, and shows that applying the linear $k$-means clustering algorithm to $\frac{k}ε (1 + o(1))$ features constructed using a so-called rank-restricted Nyström approximation results in cluster assignments that satisfy a $1 + ε$ approximation ratio in terms of the kernel $k$-means cost function, relative to the guarantee provided by the same algorithm without the use of the Nyström method. As part of the analysis, this work establishes a novel $1 + ε$ relative-error trace norm guarantee for low-rank approximation using the rank-restricted Nyström approximation. Empirical evaluations on the $8.1$ million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel $k$-means clustering with Nyström approximation. This work argues that spectral clustering using Nyström approximation---a popular and computationally efficient, but theoretically unsound approach to non-linear clustering---should be replaced with the efficient and theoretically sound combination of kernel $k$-means clustering with Nyström approximation. The superior performance of the latter approach is empirically verified.
MLFeb 16, 2017
Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model AveragingShusen Wang, Alex Gittens, Michael W. Mahoney
We address the statistical and optimization impacts of the classical sketch and Hessian sketch used to approximately solve the Matrix Ridge Regression (MRR) problem. Prior research has quantified the effects of classical sketch on the strictly simpler least squares regression (LSR) problem. We establish that classical sketch has a similar effect upon the optimization properties of MRR as it does on those of LSR: namely, it recovers nearly optimal solutions. By contrast, Hessian sketch does not have this guarantee, instead, the approximation error is governed by a subtle interplay between the "mass" in the responses and the optimal objective value. For both types of approximation, the regularization in the sketched MRR problem results in significantly different statistical properties from those of the sketched LSR problem. In particular, there is a bias-variance trade-off in sketched MRR that is not present in sketched LSR. We provide upper and lower bounds on the bias and variance of sketched MRR, these bounds show that classical sketch significantly increases the variance, while Hessian sketch significantly increases the bias. Empirically, sketched MRR solutions can have risks that are higher by an order-of-magnitude than those of the optimal MRR solutions. We establish theoretically and empirically that model averaging greatly decreases the gap between the risks of the true and sketched solutions to the MRR problem. Thus, in parallel or distributed settings, sketching combined with model averaging is a powerful technique that quickly obtains near-optimal solutions to the MRR problem while greatly mitigating the increased statistical risk incurred by sketching.
MSMay 28, 2015
A Practical Guide to Randomized Matrix Computations with MATLAB ImplementationsShusen Wang
Matrix operations such as matrix inversion, eigenvalue decomposition, singular value decomposition are ubiquitous in real-world applications. Unfortunately, many of these matrix operations so time and memory expensive that they are prohibitive when the scale of data is large. In real-world applications, since the data themselves are noisy, machine-precision matrix operations are not necessary at all, and one can sacrifice a reasonable amount of accuracy for computational efficiency. In recent years, a bunch of randomized algorithms have been devised to make matrix computations more scalable. Mahoney (2011) and Woodruff (2014) have written excellent but very technical reviews of the randomized algorithms. Differently, the focus of this manuscript is on intuition, algorithm derivation, and implementation. This manuscript should be accessible to people with knowledge in elementary matrix algebra but unfamiliar with randomized matrix computations. The algorithms introduced in this manuscript are all summarized in a user-friendly way, and they can be implemented in lines of MATLAB code. The readers can easily follow the implementations even if they do not understand the maths and algorithms.
LGMar 29, 2015
Towards More Efficient SPSD Matrix Approximation and CUR Matrix DecompositionShusen Wang, Zhihua Zhang, Tong Zhang
Symmetric positive semi-definite (SPSD) matrix approximation methods have been extensively used to speed up large-scale eigenvalue computation and kernel learning methods. The standard sketch based method, which we call the prototype model, produces relatively accurate approximations, but is inefficient on large square matrices. The Nyström method is highly efficient, but can only achieve low accuracy. In this paper we propose a novel model that we call the {\it fast SPSD matrix approximation model}. The fast model is nearly as efficient as the Nyström method and as accurate as the prototype model. We show that the fast model can potentially solve eigenvalue problems and kernel learning problems in linear time with respect to the matrix size $n$ to achieve $1+ε$ relative-error, whereas both the prototype model and the Nyström method cost at least quadratic time to attain comparable error bound. Empirical comparisons among the prototype model, the Nyström method, and our fast model demonstrate the superiority of the fast model. We also contribute new understandings of the Nyström method. The Nyström method is a special instance of our fast model and is approximation to the prototype model. Our technique can be straightforwardly applied to make the CUR matrix decomposition more efficiently computed without much affecting the accuracy.
LGDec 26, 2014
Adjusting Leverage Scores by Row Weighting: A Practical Approach to Coherent Matrix CompletionShusen Wang, Tong Zhang, Zhihua Zhang
Low-rank matrix completion is an important problem with extensive real-world applications. When observations are uniformly sampled from the underlying matrix entries, existing methods all require the matrix to be incoherent. This paper provides the first working method for coherent matrix completion under the standard uniform sampling model. Our approach is based on the weighted nuclear norm minimization idea proposed in several recent work, and our key contribution is a practical method to compute the weighting matrices so that the leverage scores become more uniform after weighting. Under suitable conditions, we are able to derive theoretical results, showing the effectiveness of our approach. Experiments on synthetic data show that our approach recovers highly coherent matrices with high precision, whereas the standard unweighted method fails even on noise-free data.
LGJun 22, 2014
SPSD Matrix Approximation vis Column Selection: Theories, Algorithms, and ExtensionsShusen Wang, Luo Luo, Zhihua Zhang
Symmetric positive semidefinite (SPSD) matrix approximation is an important problem with applications in kernel methods. However, existing SPSD matrix approximation methods such as the Nyström method only have weak error bounds. In this paper we conduct in-depth studies of an SPSD matrix approximation model and establish strong relative-error bounds. We call it the prototype model for it has more efficient and effective extensions, and some of its extensions have high scalability. Though the prototype model itself is not suitable for large-scale data, it is still useful to study its properties, on which the analysis of its extensions relies. This paper offers novel theoretical analysis, efficient algorithms, and a highly accurate extension. First, we establish a lower error bound for the prototype model and improve the error bound of an existing column selection algorithm to match the lower bound. In this way, we obtain the first optimal column selection algorithm for the prototype model. We also prove that the prototype model is exact under certain conditions. Second, we develop a simple column selection algorithm with a provable error bound. Third, we propose a so-called spectral shifting model to make the approximation more accurate when the eigenvalues of the matrix decay slowly, and the improvement is theoretically quantified. The spectral shifting method can also be applied to improve other SPSD matrix approximation models.
LGApr 1, 2014
Efficient Algorithms and Error Analysis for the Modified Nystrom MethodShusen Wang, Zhihua Zhang
Many kernel methods suffer from high time and space complexities and are thus prohibitive in big-data applications. To tackle the computational challenge, the Nyström method has been extensively used to reduce time and space complexities by sacrificing some accuracy. The Nyström method speedups computation by constructing an approximation of the kernel matrix using only a few columns of the matrix. Recently, a variant of the Nyström method called the modified Nyström method has demonstrated significant improvement over the standard Nyström method in approximation accuracy, both theoretically and empirically. In this paper, we propose two algorithms that make the modified Nyström method practical. First, we devise a simple column selection algorithm with a provable error bound. Our algorithm is more efficient and easier to implement than and nearly as accurate as the state-of-the-art algorithm. Second, with the selected columns at hand, we propose an algorithm that computes the approximation in lower time complexity than the approach in the previous work. Furthermore, we prove that the modified Nyström method is exact under certain conditions, and we establish a lower error bound for the modified Nyström method.
LGMar 30, 2014
Sharpened Error Bounds for Random Sampling Based $\ell_2$ RegressionShusen Wang
Given a data matrix $X \in R^{n\times d}$ and a response vector $y \in R^{n}$, suppose $n>d$, it costs $O(n d^2)$ time and $O(n d)$ space to solve the least squares regression (LSR) problem. When $n$ and $d$ are both large, exactly solving the LSR problem is very expensive. When $n \gg d$, one feasible approach to speeding up LSR is to randomly embed $y$ and all columns of $X$ into a smaller subspace $R^c$; the induced LSR problem has the same number of columns but much fewer number of rows, and it can be solved in $O(c d^2)$ time and $O(c d)$ space. We discuss in this paper two random sampling based methods for solving LSR more efficiently. Previous work showed that the leverage scores based sampling based LSR achieves $1+ε$ accuracy when $c \geq O(d ε^{-2} \log d)$. In this paper we sharpen this error bound, showing that $c = O(d \log d + d ε^{-1})$ is enough for achieving $1+ε$ accuracy. We also show that when $c \geq O(μd ε^{-2} \log d)$, the uniform sampling based LSR attains a $2+ε$ bound with positive probability.
LGMar 18, 2013
Improving CUR Matrix Decomposition and the Nyström Approximation via Adaptive SamplingShusen Wang, Zhihua Zhang
The CUR matrix decomposition and the Nyström approximation are two important low-rank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nyström approximation. In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nyström algorithms with expected relative-error bounds. The proposed CUR and Nyström algorithms also have low time complexity and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nyström method and the ensemble Nyström method. The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices.
LGOct 4, 2012
A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter BoundShusen Wang, Zhihua Zhang, Jian Li
The CUR matrix decomposition is an important extension of Nyström approximation to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate significant improvement over the existing relative-error algorithms.
MLApr 19, 2012
EP-GIG Priors and Applications in Bayesian Sparse LearningZhihua Zhang, Shusen Wang, Dehua Liu et al.
In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. These properties lead us to EM algorithms for Bayesian sparse learning. We show that these algorithms bear an interesting resemblance to iteratively re-weighted $\ell_2$ or $\ell_1$ methods. In addition, we present two extensions for grouped variable selection and logistic regression.