CVJul 25, 2022Code
Jigsaw-ViT: Learning Jigsaw Puzzles in Vision TransformerYingyi Chen, Xi Shen, Yahui Liu et al.
The success of Vision Transformer (ViT) in various computer vision tasks has promoted the ever-increasing prevalence of this convolution-free network. The fact that ViT works on image patches makes it potentially relevant to the problem of jigsaw puzzle solving, which is a classical self-supervised task aiming at reordering shuffled sequential image patches back to their natural form. Despite its simplicity, solving jigsaw puzzle has been demonstrated to be helpful for diverse tasks using Convolutional Neural Networks (CNNs), such as self-supervised feature representation learning, domain generalization, and fine-grained classification. In this paper, we explore solving jigsaw puzzle as a self-supervised auxiliary loss in ViT for image classification, named Jigsaw-ViT. We show two modifications that can make Jigsaw-ViT superior to standard ViT: discarding positional embeddings and masking patches randomly. Yet simple, we find that Jigsaw-ViT is able to improve both in generalization and robustness over the standard ViT, which is usually rather a trade-off. Experimentally, we show that adding the jigsaw puzzle branch provides better generalization than ViT on large-scale image classification on ImageNet. Moreover, the auxiliary task also improves robustness to noisy labels on Animal-10N, Food-101N, and Clothing1M as well as adversarial examples. Our implementation is available at https://yingyichen-cyy.github.io/Jigsaw-ViT/.
LGJun 27, 2022Code
Compressing Features for Learning with Noisy LabelsYingyi Chen, Shell Xu Hu, Xi Shen et al.
Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this paper, we focus on the problem of learning with noisy labels and introduce compression inductive bias to network architectures to alleviate this over-fitting problem. More precisely, we revisit one classical regularization named Dropout and its variant Nested Dropout. Dropout can serve as a compression constraint for its feature dropping mechanism, while Nested Dropout further learns ordered feature representations w.r.t. feature importance. Moreover, the trained models with compression regularization are further combined with Co-teaching for performance boost. Theoretically, we conduct bias-variance decomposition of the objective function under compression regularization. We analyze it for both single model and Co-teaching. This decomposition provides three insights: (i) it shows that over-fitting is indeed an issue for learning with noisy labels; (ii) through an information bottleneck formulation, it explains why the proposed feature compression helps in combating label noise; (iii) it gives explanations on the performance boost brought by incorporating compression regularization into Co-teaching. Experiments show that our simple approach can have comparable or even better performance than the state-of-the-art methods on benchmarks with real-world label noise including Clothing1M and ANIMAL-10N. Our implementation is available at https://yingyichen-cyy.github.io/CompressFeatNoisyLabels/.
LGAug 30, 2023Code
Low-Rank Multitask Learning based on Tensorized SVMs and LSSVMsJiani Liu, Qinghua Tao, Ce Zhu et al.
Multitask learning (MTL) leverages task-relatedness to enhance performance. With the emergence of multimodal data, tasks can now be referenced by multiple indices. In this paper, we employ high-order tensors, with each mode corresponding to a task index, to naturally represent tasks referenced by multiple indices and preserve their structural relations. Based on this representation, we propose a general framework of low-rank MTL methods with tensorized support vector machines (SVMs) and least square support vector machines (LSSVMs), where the CP factorization is deployed over the coefficient tensor. Our approach allows to model the task relation through a linear combination of shared factors weighted by task-specific factors and is generalized to both classification and regression problems. Through the alternating optimization scheme and the Lagrangian function, each subproblem is transformed into a convex problem, formulated as a quadratic programming or linear system in the dual form. In contrast to previous MTL frameworks, our decision function in the dual induces a weighted kernel function with a task-coupling term characterized by the similarities of the task-specific factors, better revealing the explicit relations across tasks in MTL. Experimental results validate the effectiveness and superiority of our proposed methods compared to existing state-of-the-art approaches in MTL. The code of implementation will be available at https://github.com/liujiani0216/TSVM-MTL.
LGFeb 22, 2023
Deep Kernel Principal Component Analysis for Multi-level Feature LearningFrancesco Tonin, Qinghua Tao, Panagiotis Patrinos et al.
Principal Component Analysis (PCA) and its nonlinear extension Kernel PCA (KPCA) are widely used across science and industry for data analysis and dimensionality reduction. Modern deep learning tools have achieved great empirical success, but a framework for deep principal component analysis is still lacking. Here we develop a deep kernel PCA methodology (DKPCA) to extract multiple levels of the most informative components of the data. Our scheme can effectively identify new hierarchical variables, called deep principal components, capturing the main characteristics of high-dimensional data through a simple and interpretable numerical optimization. We couple the principal components of multiple KPCA levels, theoretically showing that DKPCA creates both forward and backward dependency across levels, which has not been explored in kernel methods and yet is crucial to extract more informative features. Various experimental evaluations on multiple data types show that DKPCA finds more efficient and disentangled representations with higher explained variance in fewer principal components, compared to the shallow KPCA. We demonstrate that our method allows for effective hierarchical data exploration, with the ability to separate the key generative factors of the input data both for large datasets and when few training samples are available. Overall, DKPCA can facilitate the extraction of useful patterns from high-dimensional data by learning more informative features organized in different levels, giving diversified aspects to explore the variation factors in the data, while maintaining a simple mathematical formulation.
LGJun 18, 2022
Piecewise Linear Neural Networks and Deep LearningQinghua Tao, Li Li, Xiaolin Huang et al.
As a powerful modelling method, PieceWise Linear Neural Networks (PWLNNs) have proven successful in various fields, most recently in deep learning. To apply PWLNN methods, both the representation and the learning have long been studied. In 1977, the canonical representation pioneered the works of shallow PWLNNs learned by incremental designs, but the applications to large-scale data were prohibited. In 2010, the Rectified Linear Unit (ReLU) advocated the prevalence of PWLNNs in deep learning. Ever since, PWLNNs have been successfully applied to extensive tasks and achieved advantageous performances. In this Primer, we systematically introduce the methodology of PWLNNs by grouping the works into shallow and deep networks. Firstly, different PWLNN representation models are constructed with elaborated examples. With PWLNNs, the evolution of learning algorithms for data is presented and fundamental theoretical analysis follows up for in-depth understandings. Then, representative applications are introduced together with discussions and outlooks.
CVJul 5, 2023
Unbalanced Optimal Transport: A Unified Framework for Object DetectionHenri De Plaen, Pierre-François De Plaen, Johan A. K. Suykens et al.
During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models.
LGJul 23, 2022
Tensor-based Multi-view Spectral Clustering via Shared Latent SpaceQinghua Tao, Francesco Tonin, Panagiotis Patrinos et al.
Multi-view Spectral Clustering (MvSC) attracts increasing attention due to diverse data sources. However, most existing works are prohibited in out-of-sample predictions and overlook model interpretability and exploration of clustering results. In this paper, a new method for MvSC is proposed via a shared latent space from the Restricted Kernel Machine framework. Through the lens of conjugate feature duality, we cast the weighted kernel principal component analysis problem for MvSC and develop a modified weighted conjugate feature duality to formulate dual variables. In our method, the dual variables, playing the role of hidden features, are shared by all views to construct a common latent space, coupling the views by learning projections from view-specific spaces. Such single latent space promotes well-separated clusters and provides straightforward data exploration, facilitating visualization and interpretation. Our method requires only a single eigendecomposition, whose dimension is independent of the number of views. To boost higher-order correlations, tensor-based modelling is introduced without increasing computational complexity. Our method can be flexibly applied with out-of-sample extensions, enabling greatly improved efficiency for large-scale data with fixed-size kernel schemes. Numerical experiments verify that our method is effective regarding accuracy, efficiency, and interpretability, showing a sharp eigenvalue decay and distinct latent variable distributions.
LGJan 24, 2023
Multi-view Kernel PCA for Time series ForecastingArun Pandey, Hannes De Meulemeester, Bart De Moor et al.
In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear kernel is used for the output view, it is shown that the forecasting equation takes the form of kernel ridge regression. When that kernel is non-linear, a pre-image problem has to be solved to forecast a point in the input space. We evaluate the model on several standard time series datasets, perform ablation studies, benchmark with closely related models and discuss its results.
LGJun 12, 2023
Nonlinear SVD with Asymmetric Kernels: feature learning and asymmetric Nyström methodQinghua Tao, Francesco Tonin, Panagiotis Patrinos et al.
Asymmetric data naturally exist in real life, such as directed graphs. Different from the common kernel methods requiring Mercer kernels, this paper tackles the asymmetric kernel-based learning problem. We describe a nonlinear extension of the matrix Singular Value Decomposition through asymmetric kernels, namely KSVD. First, we construct two nonlinear feature mappings w.r.t. rows and columns of the given data matrix. The proposed optimization problem maximizes the variance of each mapping projected onto the subspace spanned by the other, subject to a mutual orthogonality constraint. Through Lagrangian duality, we show that it can be solved by the left and right singular vectors in the feature space induced by the asymmetric kernel. Moreover, we start from the integral equations with a pair of adjoint eigenfunctions corresponding to the singular vectors on an asymmetrical kernel, and extend the Nyström method to asymmetric cases through the finite sample approximation, which can be applied to speedup the training in KSVD. Experiments show that asymmetric KSVD learns features outperforming Mercer-kernel based methods that resort to symmetrization, and also verify the effectiveness of the asymmetric Nyström method.
LGJan 31, 2023
Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node ClassificationSonny Achten, Francesco Tonin, Panagiotis Patrinos et al.
We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. The method is built of two main types of blocks: (i) We introduce unsupervised kernel machine layers propagating the node features in a one-hop neighborhood, using implicit node feature mappings. (ii) We specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. We derive an effective initialization scheme and efficient end-to-end training algorithm in the dual variables for the full architecture. The main idea underlying GCKM is that, because of the unsupervised core, the final model can achieve higher performance in semi-supervised node classification when few labels are available for training. Experimental results demonstrate the effectiveness of the proposed framework.
LGJun 9, 2023
Extending Kernel PCA through Dualization: Sparsity, Robustness and Fast AlgorithmsFrancesco Tonin, Alex Lambert, Panagiotis Patrinos et al.
The goal of this paper is to revisit Kernel Principal Component Analysis (KPCA) through dualization of a difference of convex functions. This allows to naturally extend KPCA to multiple objective functions and leads to efficient gradient-based algorithms avoiding the expensive SVD of the Gram matrix. Particularly, we consider objective functions that can be written as Moreau envelopes, demonstrating how to promote robustness and sparsity within the same framework. The proposed method is evaluated on synthetic and real-world benchmarks, showing significant speedup in KPCA training time as well as highlighting the benefits in terms of robustness and sparsity.
LGOct 8, 2023
Enhancing Kernel Flexibility via Learning Asymmetric Locally-Adaptive KernelsFan He, Mingzhen He, Lei Shi et al.
The lack of sufficient flexibility is the key bottleneck of kernel-based learning that relies on manually designed, pre-given, and non-trainable kernels. To enhance kernel flexibility, this paper introduces the concept of Locally-Adaptive-Bandwidths (LAB) as trainable parameters to enhance the Radial Basis Function (RBF) kernel, giving rise to the LAB RBF kernel. The parameters in LAB RBF kernels are data-dependent, and its number can increase with the dataset, allowing for better adaptation to diverse data patterns and enhancing the flexibility of the learned function. This newfound flexibility also brings challenges, particularly with regards to asymmetry and the need for an efficient learning algorithm. To address these challenges, this paper for the first time establishes an asymmetric kernel ridge regression framework and introduces an iterative kernel learning algorithm. This novel approach not only reduces the demand for extensive support data but also significantly improves generalization by training bandwidths on the available training data. Experimental results on real datasets underscore the remarkable performance of the proposed algorithm, showcasing its superior capability in handling large-scale datasets compared to Nyström approximation-based algorithms. Moreover, it demonstrates a significant improvement in regression accuracy over existing kernel-based learning methods and even surpasses residual neural networks.
LGMar 4, 2023
Tensorized LSSVMs for Multitask RegressionJiani Liu, Qinghua Tao, Ce Zhu et al.
Multitask learning (MTL) can utilize the relatedness between multiple tasks for performance improvement. The advent of multimodal data allows tasks to be referenced by multiple indices. High-order tensors are capable of providing efficient representations for such tasks, while preserving structural task-relations. In this paper, a new MTL method is proposed by leveraging low-rank tensor analysis and constructing tensorized Least Squares Support Vector Machines, namely the tLSSVM-MTL, where multilinear modelling and its nonlinear extensions can be flexibly exerted. We employ a high-order tensor for all the weights with each mode relating to an index and factorize it with CP decomposition, assigning a shared factor for all tasks and retaining task-specific latent factors along each index. Then an alternating algorithm is derived for the nonconvex optimization, where each resulting subproblem is solved by a linear system. Experimental results demonstrate promising performances of our tLSSVM-MTL.
LGJul 19, 2023
A Dual Formulation for Probabilistic Principal Component AnalysisHenri De Plaen, Johan A. K. Suykens
In this paper, we characterize Probabilistic Principal Component Analysis in Hilbert spaces and demonstrate how the optimal solution admits a representation in dual space. This allows us to develop a generative framework for kernel methods. Furthermore, we show how it englobes Kernel Principal Component Analysis and illustrate its working on a toy and a real dataset.
LGJun 12, 2023
Combining Primal and Dual Representations in Deep Restricted Kernel Machines ClassifiersFrancesco Tonin, Panagiotis Patrinos, Johan A. K. Suykens
In the context of deep learning with kernel machines, the deep Restricted Kernel Machine (DRKM) framework allows multiple levels of kernel PCA (KPCA) and Least-Squares Support Vector Machines (LSSVM) to be combined into a deep architecture using visible and hidden units. We propose a new method for DRKM classification coupling the objectives of KPCA and classification levels, with the hidden feature matrix lying on the Stiefel manifold. The classification level can be formulated as an LSSVM or as an MLP feature map, combining depth in terms of levels and layers. The classification level is expressed in its primal formulation, as the deep KPCA levels, in their dual formulation, can embed the most informative components of the data in a much lower dimensional space. The dual setting is independent of the dimension of the inputs and the primal setting is parametric, which makes the proposed method computationally efficient for both high-dimensional inputs and large datasets. In the experiments, we show that our developed algorithm can effectively learn from small datasets, while using less memory than the convolutional neural network (CNN) with high-dimensional data. and that models with multiple KPCA levels can outperform models with a single level. On the tested larger-scale datasets, DRKM is more energy efficient than CNN while maintaining comparable performance.
CVApr 28, 2021Code
Boosting Co-teaching with Compression Regularization for Label NoiseYingyi Chen, Xi Shen, Shell Xu Hu et al.
In this paper, we study the problem of learning image classification models in the presence of label noise. We revisit a simple compression regularization named Nested Dropout. We find that Nested Dropout, though originally proposed to perform fast information retrieval and adaptive data compression, can properly regularize a neural network to combat label noise. Moreover, owing to its simplicity, it can be easily combined with Co-teaching to further boost the performance. Our final model remains simple yet effective: it achieves comparable or even better performance than the state-of-the-art approaches on two real-world datasets with label noise which are Clothing1M and ANIMAL-10N. On Clothing1M, our approach obtains 74.9% accuracy which is slightly better than that of DivideMix. On ANIMAL-10N, we achieve 84.1% accuracy while the best public result by PLC is 83.4%. We hope that our simple approach can be served as a strong baseline for learning with label noise. Our implementation is available at https://github.com/yingyichen-cyy/Nested-Co-teaching.
MLJan 5, 2024
Nonlinear functional regression by functional deep neural network with kernel embeddingZhongjie Shi, Jun Fan, Linhao Song et al.
Recently, deep learning has been widely applied in functional data analysis (FDA) with notable empirical success. However, the infinite dimensionality of functional data necessitates an effective dimension reduction approach for functional learning tasks, particularly in nonlinear functional regression. In this paper, we introduce a functional deep neural network with an adaptive and discretization-invariant dimension reduction method. Our functional network architecture consists of three parts: first, a kernel embedding step that features an integral transformation with an adaptive smooth kernel; next, a projection step that utilizes eigenfunction bases based on a projection Mercer kernel for the dimension reduction; and finally, a deep ReLU neural network is employed for the prediction. Explicit rates of approximating nonlinear smooth functionals across various input function spaces by our proposed functional network are derived. Additionally, we conduct a generalization analysis for the empirical risk minimization (ERM) algorithm applied to our functional net, by employing a novel two-stage oracle inequality and the established functional approximation results. Ultimately, we conduct numerical experiments on both simulated and real datasets to demonstrate the effectiveness and benefits of our functional net.
LGJun 9, 2025
Accelerating Spectral Clustering under Fairness ConstraintsFrancesco Tonin, Alex Lambert, Johan A. K. Suykens et al.
Fairness of decision-making algorithms is an increasingly important issue. In this paper, we focus on spectral clustering with group fairness constraints, where every demographic group is represented in each cluster proportionally as in the general population. We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex functions (DC) framework. To this end, we introduce a novel variable augmentation strategy and employ an alternating direction method of multipliers type of algorithm adapted to DC problems. We show that each associated subproblem can be solved efficiently, resulting in higher computational efficiency compared to prior work, which required a computationally expensive eigendecomposition. Numerical experiments demonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications.
LGFeb 2, 2024
Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian ProcessesYingyi Chen, Qinghua Tao, Francesco Tonin et al.
While the great capability of Transformers significantly boosts prediction accuracy, it could also yield overconfident predictions and require calibrated uncertainty estimation, which can be commonly tackled by Gaussian processes (GPs). Existing works apply GPs with symmetric kernels under variational inference to the attention kernel; however, omitting the fact that attention kernels are in essence asymmetric. Moreover, the complexity of deriving the GP posteriors remains high for large-scale data. In this work, we propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention where the asymmetry of attention kernels is tackled by Kernel SVD (KSVD) and a reduced complexity is acquired. Through KEP-SVGP, i) the SVGP pair induced by the two sets of singular vectors from KSVD w.r.t. the attention kernel fully characterizes the asymmetry; ii) using only a small set of adjoint eigenfunctions from KSVD, the derivation of SVGP posteriors can be based on the inversion of a diagonal matrix containing singular values, contributing to a reduction in time complexity; iii) an evidence lower bound is derived so that variational parameters and network weights can be optimized with it. Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.
LGFeb 13, 2024
Sparsity via Sparse Group $k$-max RegularizationQinghua Tao, Xiangming Xi, Jun Xu et al.
For the linear inverse problem with sparsity constraints, the $l_0$ regularized problem is NP-hard, and existing approaches either utilize greedy algorithms to find almost-optimal solutions or to approximate the $l_0$ regularization with its convex counterparts. In this paper, we propose a novel and concise regularization, namely the sparse group $k$-max regularization, which can not only simultaneously enhance the group-wise and in-group sparsity, but also casts no additional restraints on the magnitude of variables in each group, which is especially important for variables at different scales, so that it approximate the $l_0$ norm more closely. We also establish an iterative soft thresholding algorithm with local optimality conditions and complexity analysis provided. Through numerical experiments on both synthetic and real-world datasets, we verify the effectiveness and flexibility of the proposed method.
LGFeb 4, 2025
Generative Kernel Spectral ClusteringDavid Winant, Sonny Achten, Johan A. K. Suykens
Modern clustering approaches often trade interpretability for performance, particularly in deep learning-based methods. We present Generative Kernel Spectral Clustering (GenKSC), a novel model combining kernel spectral clustering with generative modeling to produce both well-defined clusters and interpretable representations. By augmenting weighted variance maximization with reconstruction and clustering losses, our model creates an explorable latent space where cluster characteristics can be visualized through traversals along cluster directions. Results on MNIST and FashionMNIST datasets demonstrate the model's ability to learn meaningful cluster representations.
LGJun 13, 2024
Learning in Feature Spaces via Coupled Covariances: Asymmetric Kernel SVD and Nyström methodQinghua Tao, Francesco Tonin, Alex Lambert et al.
In contrast with Mercer kernel-based approaches as used e.g., in Kernel Principal Component Analysis (KPCA), it was previously shown that Singular Value Decomposition (SVD) inherently relates to asymmetric kernels and Asymmetric Kernel Singular Value Decomposition (KSVD) has been proposed. However, the existing formulation to KSVD cannot work with infinite-dimensional feature mappings, the variational objective can be unbounded, and needs further numerical evaluation and exploration towards machine learning. In this work, i) we introduce a new asymmetric learning paradigm based on coupled covariance eigenproblem (CCE) through covariance operators, allowing infinite-dimensional feature maps. The solution to CCE is ultimately obtained from the SVD of the induced asymmetric kernel matrix, providing links to KSVD. ii) Starting from the integral equations corresponding to a pair of coupled adjoint eigenfunctions, we formalize the asymmetric Nyström method through a finite sample approximation to speed up training. iii) We provide the first empirical evaluations verifying the practical utility and benefits of KSVD and compare with methods resorting to symmetrization or linear SVD across multiple tasks.
LGJun 3, 2024
Learning Analysis of Kernel Ridgeless Regression with Asymmetric Kernel LearningFan He, Mingzhen He, Lei Shi et al.
Ridgeless regression has garnered attention among researchers, particularly in light of the ``Benign Overfitting'' phenomenon, where models interpolating noisy samples demonstrate robust generalization. However, kernel ridgeless regression does not always perform well due to the lack of flexibility. This paper enhances kernel ridgeless regression with Locally-Adaptive-Bandwidths (LAB) RBF kernels, incorporating kernel learning techniques to improve performance in both experiments and theory. For the first time, we demonstrate that functions learned from LAB RBF kernels belong to an integral space of Reproducible Kernel Hilbert Spaces (RKHSs). Despite the absence of explicit regularization in the proposed model, its optimization is equivalent to solving an $\ell_0$-regularized problem in the integral space of RKHSs, elucidating the origin of its generalization ability. Taking an approximation analysis viewpoint, we introduce an $l_q$-norm analysis technique (with $0<q<1$) to derive the learning rate for the proposed model under mild conditions. This result deepens our theoretical understanding, explaining that our algorithm's robust approximation ability arises from the large capacity of the integral space of RKHSs, while its generalization ability is ensured by sparsity, controlled by the number of support vectors. Experimental results on both synthetic and real datasets validate our theoretical conclusions.
MLJan 24, 2024
Can overfitted deep neural networks in adversarial training generalize? -- An approximation viewpointZhongjie Shi, Fanghui Liu, Yuan Cao et al.
Adversarial training is a widely used method to improve the robustness of deep neural networks (DNNs) over adversarial perturbations. However, it is empirically observed that adversarial training on over-parameterized networks often suffers from the \textit{robust overfitting}: it can achieve almost zero adversarial training error while the robust generalization performance is not promising. In this paper, we provide a theoretical understanding of the question of whether overfitted DNNs in adversarial training can generalize from an approximation viewpoint. Specifically, our main results are summarized into three folds: i) For classification, we prove by construction the existence of infinitely many adversarial training classifiers on over-parameterized DNNs that obtain arbitrarily small adversarial training error (overfitting), whereas achieving good robust generalization error under certain conditions concerning the data quality, well separated, and perturbation level. ii) Linear over-parameterization (meaning that the number of parameters is only slightly larger than the sample size) is enough to ensure such existence if the target function is smooth enough. iii) For regression, our results demonstrate that there also exist infinitely many overfitted DNNs with linear over-parameterization in adversarial training that can achieve almost optimal rates of convergence for the standard generalization error. Overall, our analysis points out that robust overfitting can be avoided but the required model capacity will depend on the smoothness of the target function, while a robust generalization gap is inevitable. We hope our analysis will give a better understanding of the mathematical foundations of robustness in DNNs from an approximation view.
LGMay 31, 2023
Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal RepresentationYingyi Chen, Qinghua Tao, Francesco Tonin et al.
Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, $i$) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; $ii$) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; $iii$) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.
LGMay 26, 2023
Duality in Multi-View Restricted Kernel MachinesSonny Achten, Arun Pandey, Hannes De Meulemeester et al.
We propose a unifying setting that combines existing restricted kernel machine methods into a single primal-dual multi-view framework for kernel principal component analysis in both supervised and unsupervised settings. We derive the primal and dual representations of the framework and relate different training and inference algorithms from a theoretical perspective. We show how to achieve full equivalence in primal and dual formulations by rescaling primal variables. Finally, we experimentally validate the equivalence and provide insight into the relationships between different methods on a number of time series data sets by recursively forecasting unseen test data and visualizing the learned features.
LGFeb 3, 2022
Learning with Asymmetric Kernels: Least Squares and Feature InterpretationMingzhen He, Fan He, Lei Shi et al.
Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly. We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known. Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels. Experimental results on the Corel database, directed graphs, and the UCI database will show that in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that have to do symmetrization to accommodate asymmetric kernels.
LGOct 26, 2021
Tensor Network Kalman Filtering for Large-Scale LS-SVMsMaximilian Lucassen, Johan A. K. Suykens, Kim Batselier
Least squares support vector machines are a commonly used supervised learning method for nonlinear regression and classification. They can be implemented in either their primal or dual form. The latter requires solving a linear system, which can be advantageous as an explicit mapping of the data to a possibly infinite-dimensional feature space is avoided. However, for large-scale applications, current low-rank approximation methods can perform inadequately. For example, current methods are probabilistic due to their sampling procedures, and/or suffer from a poor trade-off between the ranks and approximation power. In this paper, a recursive Bayesian filtering framework based on tensor networks and the Kalman filter is presented to alleviate the demanding memory and computational complexities associated with solving large-scale dual problems. The proposed method is iterative, does not require explicit storage of the kernel matrix, and allows the formulation of early stopping conditions. Additionally, the framework yields confidence estimates of obtained models, unlike alternative methods. The performance is tested on two regression and three classification experiments, and compared to the Nyström and fixed size LS-SVM methods. Results show that our method can achieve high performance and is particularly useful when alternative methods are computationally infeasible due to a slowly decaying kernel matrix spectrum.
MLOct 13, 2021
On the Double Descent of Random Features Models Trained with SGDFanghui Liu, Johan A. K. Suykens, Volkan Cevher
We study generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD) in under-/over-parameterized regime. In this work, we derive precise non-asymptotic error bounds of RF regression under both constant and polynomial-decay step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our analysis shows how to cope with multiple randomness sources of initialization, label noise, and data sampling (as well as stochastic gradients) with no closed-form solution, and also goes beyond the commonly-used Gaussian/spherical data assumption. Our theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias. Besides, we also prove that the constant step-size SGD setting incurs no loss in convergence rate when compared to the exact minimum-norm interpolator, as a theoretical justification of using SGD in practice.
LGMay 28, 2021
Latent Space Exploration Using Generative Kernel PCADavid Winant, Joachim Schreurs, Johan A. K. Suykens
Kernel PCA is a powerful feature extractor which recently has seen a reformulation in the context of Restricted Kernel Machines (RKMs). These RKMs allow for a representation of kernel PCA in terms of hidden and visible units similar to Restricted Boltzmann Machines. This connection has led to insights on how to use kernel PCA in a generative procedure, called generative kernel PCA. In this paper, the use of generative kernel PCA for exploring latent spaces of datasets is investigated. New points can be generated by gradually moving in the latent space, which allows for an interpretation of the components. Firstly, examples of this feature space exploration on three datasets are shown with one of them leading to an interpretable representation of ECG signals. Afterwards, the use of the tool in combination with novelty detection is shown, where the latent space around novel patterns in the data is explored. This helps in the interpretation of why certain points are considered as novel.
LGMay 28, 2021
Towards Deterministic Diverse Subset SamplingJoachim Schreurs, Michaël Fanuel, Johan A. K. Suykens
Determinantal point processes (DPPs) are well known models for diverse subset selection problems, including recommendation tasks, document summarization and image search. In this paper, we discuss a greedy deterministic adaptation of k-DPP. Deterministic algorithms are interesting for many applications, as they provide interpretability to the user by having no failure probability and always returning the same results. First, the ability of the method to yield low-rank approximations of kernel matrices is evaluated by comparing the accuracy of the Nyström approximation on multiple datasets. Afterwards, we demonstrate the usefulness of the model on an image search task.
LGApr 6, 2021
Leverage Score Sampling for Complete Mode Coverage in Generative Adversarial NetworksJoachim Schreurs, Hannes De Meulemeester, Michaël Fanuel et al.
Commonly, machine learning models minimize an empirical expectation. As a result, the trained models typically perform well for the majority of the data but the performance may deteriorate in less dense regions of the dataset. This issue also arises in generative modeling. A generative model may overlook underrepresented modes that are less frequent in the empirical data distribution. This problem is known as complete mode coverage. We propose a sampling procedure based on ridge leverage scores which significantly improves mode coverage when compared to standard methods and can easily be combined with any GAN. Ridge leverage scores are computed by using an explicit feature map, associated with the next-to-last layer of a GAN discriminator or of a pre-trained network, or by using an implicit feature map corresponding to a Gaussian kernel. Multiple evaluations against recent approaches of complete mode coverage show a clear improvement when using the proposed sampling strategy.
LGFeb 16, 2021
Unsupervised Energy-based Out-of-distribution Detection using Stiefel-Restricted Kernel MachineFrancesco Tonin, Arun Pandey, Panagiotis Patrinos et al.
Detecting out-of-distribution (OOD) samples is an essential requirement for the deployment of machine learning systems in the real world. Until now, research on energy-based OOD detectors has focused on the softmax confidence score from a pre-trained neural network classifier with access to class labels. In contrast, we propose an unsupervised energy-based OOD detector leveraging the Stiefel-Restricted Kernel Machine (St-RKM). Training requires minimizing an objective function with an autoencoder loss term and the RKM energy where the interconnection matrix lies on the Stiefel manifold. Further, we outline multiple energy function definitions based on the RKM framework and discuss their utility. In the experiments on standard datasets, the proposed method improves over the existing energy-based OOD detectors and deep generative models. Through several ablation studies, we further illustrate the merit of each proposed energy function on the OOD detection performance.
LGNov 25, 2020
Unsupervised learning of disentangled representations in deep restricted kernel machines with orthogonality constraintsFrancesco Tonin, Panagiotis Patrinos, Johan A. K. Suykens
We introduce Constr-DRKM, a deep kernel method for the unsupervised learning of disentangled data representations. We propose augmenting the original deep restricted kernel machine formulation for kernel PCA by orthogonality constraints on the latent variables to promote disentanglement and to make it possible to carry out optimization without first defining a stabilized objective. After illustrating an end-to-end training procedure based on a quadratic penalty optimization algorithm with warm start, we quantitatively evaluate the proposed method's effectiveness in disentangled feature learning. We demonstrate on four benchmark datasets that this approach performs similarly overall to $β$-VAE on a number of disentanglement metrics when few training points are available, while being less sensitive to randomness and hyperparameter selection than $β$-VAE. We also present a deterministic initialization of Constr-DRKM's training algorithm that significantly improves the reproducibility of the results. Finally, we empirically evaluate and discuss the role of the number of layers in the proposed methodology, examining the influence of each principal component in every layer and showing that components in lower layers act as local feature detectors capturing the broad trends of the data distribution, while components in deeper layers use the representation learned by previous layers and more accurately reproduce higher-level features.
LGNov 13, 2020
Determinantal Point Processes Implicitly Regularize Semi-parametric Regression ProblemsMichaël Fanuel, Joachim Schreurs, Johan A. K. Suykens
Semi-parametric regression models are used in several applications which require comprehensibility without sacrificing accuracy. Typical examples are spline interpolation in geophysics, or non-linear time series problems, where the system includes a linear and non-linear component. We discuss here the use of a finite Determinantal Point Process (DPP) for approximating semi-parametric models. Recently, Barthelmé, Tremblay, Usevich, and Amblard introduced a novel representation of some finite DPPs. These authors formulated extended L-ensembles that can conveniently represent partial-projection DPPs and suggest their use for optimal interpolation. With the help of this formalism, we derive a key identity illustrating the implicit regularization effect of determinantal sampling for semi-parametric regression and interpolation. Also, a novel projected Nyström approximation is defined and used to derive a bound on the expected risk for the corresponding approximation of semi-parametric regression. This work naturally extends similar results obtained for kernel ridge regression.
LGNov 3, 2020
Towards a Unified Quadrature Framework for Large-Scale Kernel MachinesFanghui Liu, Xiaolin Huang, Yudong Chen et al.
In this paper, we develop a quadrature framework for large-scale kernel machines via a numerical integration representation. Considering that the integration domain and measure of typical kernels, e.g., Gaussian kernels, arc-cosine kernels, are fully symmetric, we leverage deterministic fully symmetric interpolatory rules to efficiently compute quadrature nodes and associated weights for kernel approximation. The developed interpolatory rules are able to reduce the number of needed nodes while retaining a high approximation accuracy. Further, we randomize the above deterministic rules by the classical Monte-Carlo sampling and control variates techniques with two merits: 1) The proposed stochastic rules make the dimension of the feature mapping flexibly varying, such that we can control the discrepancy between the original and approximate kernels by tuning the dimnension. 2) Our stochastic rules have nice statistical properties of unbiasedness and variance reduction with fast convergence rate. In addition, we elucidate the relationship between our deterministic/stochastic interpolatory rules and current quadrature rules for kernel approximation, including the sparse grids quadrature and stochastic spherical-radial rules, thereby unifying these methods under our framework. Experimental results on several benchmark datasets show that our methods compare favorably with other representative kernel approximation based methods.
MLOct 6, 2020
Kernel regression in high dimensions: Refined analysis beyond double descentFanghui Liu, Zhenyu Liao, Johan A. K. Suykens
In this paper, we provide a precise characterization of generalization properties of high dimensional kernel ridge regression across the under- and over-parameterized regimes, depending on whether the number of training data n exceeds the feature dimension d. By establishing a bias-variance decomposition of the expected excess risk, we show that, while the bias is (almost) independent of d and monotonically decreases with n, the variance depends on n, d and can be unimodal or monotonically decreasing under different regularization schemes. Our refined analysis goes beyond the double descent theory by showing that, depending on the data eigen-profile and the level of regularization, the kernel regression risk curve can be a double-descent-like, bell-shaped, or monotonic function of n. Experiments on synthetic and real data are conducted to support our theoretical findings.
MLAug 5, 2020
Outlier detection in non-elliptical data by kernel MRCDJoachim Schreurs, Iwein Vranckx, Mia Hubert et al.
The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed Kernel Minimum Regularized Covariance Determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.
LGJun 25, 2020
A Theoretical Framework for Target PropagationAlexander Meulemans, Francesco S. Carzaniga, Johan A. K. Suykens et al.
The success of deep learning, a brain-inspired form of AI, has sparked interest in understanding how the brain could similarly learn across multiple layers of neurons. However, the majority of biologically-plausible learning algorithms have not yet reached the performance of backpropagation (BP), nor are they built on strong theoretical foundations. Here, we analyze target propagation (TP), a popular but not yet fully understood alternative to BP, from the standpoint of mathematical optimization. Our theory shows that TP is closely related to Gauss-Newton optimization and thus substantially differs from BP. Furthermore, our analysis reveals a fundamental limitation of difference target propagation (DTP), a well-known variant of TP, in the realistic scenario of non-invertible neural networks. We provide a first solution to this problem through a novel reconstruction loss that improves feedback weight training, while simultaneously introducing architectural flexibility by allowing for direct feedback connections from the output to each hidden layer. Our theory is corroborated by experimental results that show significant improvements in performance and in the alignment of forward weight updates with loss gradients, compared to DTP.
LGJun 24, 2020
Ensemble Kernel Methods, Implicit Regularization and Determinantal Point ProcessesJoachim Schreurs, Michaël Fanuel, Johan A. K. Suykens
By using the framework of Determinantal Point Processes (DPPs), some theoretical results concerning the interplay between diversity and regularization can be obtained. In this paper we show that sampling subsets with kDPPs results in implicit regularization in the context of ridgeless Kernel Regression. Furthermore, we leverage the common setup of state-of-the-art DPP algorithms to sample multiple small subsets and use them in an ensemble of ridgeless regressions. Our first empirical results indicate that ensemble of ridgeless regressors can be interesting to use for datasets including redundant information.
LGJun 16, 2020
The Bures Metric for Generative Adversarial NetworksHannes De Meulemeester, Joachim Schreurs, Michaël Fanuel et al.
Generative Adversarial Networks (GANs) are performant generative methods yielding high-quality samples. However, under certain circumstances, the training of GANs can lead to mode collapse or mode dropping, i.e. the generative models not being able to sample from the entire probability distribution. To address this problem, we use the last layer of the discriminator as a feature map to study the distribution of the real and the fake data. During training, we propose to match the real batch diversity to the fake batch diversity by using the Bures distance between covariance matrices in feature space. The computation of the Bures distance can be conveniently done in either feature space or kernel space in terms of the covariance and kernel matrix respectively. We observe that diversity matching reduces mode collapse substantially and has a positive effect on the sample quality. On the practical side, a very simple training procedure, that does not require additional hyperparameter tuning, is proposed and assessed on several datasets.
LGJun 12, 2020
Disentangled Representation Learning and Generation with Manifold OptimizationArun Pandey, Michael Fanuel, Joachim Schreurs et al.
Disentanglement is a useful property in representation learning which increases the interpretability of generative models such as Variational autoencoders (VAE), Generative Adversarial Models, and their many variants. Typically in such models, an increase in disentanglement performance is traded-off with generation quality. In the context of latent space models, this work presents a representation learning framework that explicitly promotes disentanglement by encouraging orthogonal directions of variations. The proposed objective is the sum of an autoencoder error term along with a Principal Component Analysis reconstruction error in the feature space. This has an interpretation of a Restricted Kernel Machine with the eigenvector matrix-valued on the Stiefel manifold. Our analysis shows that such a construction promotes disentanglement by matching the principal directions in the latent space with the directions of orthogonal variation in data space. In an alternating minimization scheme, we use Cayley ADAM algorithm - a stochastic optimization method on the Stiefel manifold along with the ADAM optimizer. Our theoretical discussion and various experiments show that the proposed model improves over many VAE variants in terms of both generation quality and disentangled representation learning.
MLJun 1, 2020
Analysis of Regularized Least Squares in Reproducing Kernel Krein SpacesFanghui Liu, Lei Shi, Xiaolin Huang et al.
In this paper, we study the asymptotic properties of regularized least squares with indefinite kernels in reproducing kernel Krein spaces (RKKS). By introducing a bounded hyper-sphere constraint to such non-convex regularized risk minimization problem, we theoretically demonstrate that this problem has a globally optimal solution with a closed form on the sphere, which makes approximation analysis feasible in RKKS. Regarding to the original regularizer induced by the indefinite inner product, we modify traditional error decomposition techniques, prove convergence results for the introduced hypothesis error based on matrix perturbation theory, and derive learning rates of such regularized regression problem in RKKS. Under some conditions, the derived learning rates in RKKS are the same as that in reproducing kernel Hilbert spaces (RKHS), which is actually the first work on approximation analysis of regularized learning algorithms in RKKS.
LGMay 30, 2020
Fast Learning in Reproducing Kernel Krein Spaces via Signed MeasuresFanghui Liu, Xiaolin Huang, Yingyi Chen et al.
In this paper, we attempt to solve a long-lasting open question for non-positive definite (non-PD) kernels in machine learning community: can a given non-PD kernel be decomposed into the difference of two PD kernels (termed as positive decomposition)? We cast this question as a distribution view by introducing the \emph{signed measure}, which transforms positive decomposition to measure decomposition: a series of non-PD kernels can be associated with the linear combination of specific finite Borel measures. In this manner, our distribution-based framework provides a sufficient and necessary condition to answer this open question. Specifically, this solution is also computationally implementable in practice to scale non-PD kernels in large sample cases, which allows us to devise the first random features algorithm to obtain an unbiased estimator. Experimental results on several benchmark datasets verify the effectiveness of our algorithm over the existing methods.
MLApr 23, 2020
Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and BeyondFanghui Liu, Xiaolin Huang, Yudong Chen et al.
Random features is one of the most popular techniques to speed up kernel methods in large-scale problems. Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the connections among various algorithms and theoretical results. In this survey, we systematically review the work on random features from the past ten years. First, the motivations, characteristics and contributions of representative random features based algorithms are summarized according to their sampling schemes, learning procedures, variance reduction properties and how they exploit training data. Second, we review theoretical results that center around the following key question: how many random features are needed to ensure a high approximation quality or no loss in the empirical/expected risks of the learned estimator. Third, we provide a comprehensive evaluation of popular random features based algorithms on several large-scale benchmark datasets and discuss their approximation quality and prediction performance for classification. Last, we discuss the relationship between random features and modern over-parameterized deep neural networks (DNNs), including the use of high dimensional random features in the analysis of DNNs as well as the gaps between current theoretical and empirical results. This survey may serve as a gentle introduction to this topic, and as a users' guide for practitioners interested in applying the representative algorithms and understanding theoretical results under various technical assumptions. We hope that this survey will facilitate discussion on the open problems in this topic, and more importantly, shed light on future research directions.
LGFeb 20, 2020
Diversity sampling is an implicit regularization for kernel methodsMichaël Fanuel, Joachim Schreurs, Johan A. K. Suykens
Kernel methods have achieved very good performance on large scale regression and classification problems, by using the Nyström method and preconditioning techniques. The Nyström approximation -- based on a subset of landmarks -- gives a low rank approximation of the kernel matrix, and is known to provide a form of implicit regularization. We further elaborate on the impact of sampling diverse landmarks for constructing the Nyström approximation in supervised as well as unsupervised kernel methods. By using Determinantal Point Processes for sampling, we obtain additional theoretical results concerning the interplay between diversity and regularization. Empirically, we demonstrate the advantages of training kernel methods based on subsets made of diverse points. In particular, if the dataset has a dense bulk and a sparser tail, we show that Nyström kernel regression with diverse landmarks increases the accuracy of the regression in sparser regions of the dataset, with respect to a uniform landmark sampling. A greedy heuristic is also proposed to select diverse samples of significant size within large datasets when exact DPP sampling is not practically feasible.
LGFeb 5, 2020
Wasserstein Exponential KernelsHenri De Plaen, Michaël Fanuel, Johan A. K. Suykens
In the context of kernel methods, the similarity between data points is encoded by the kernel function which is often defined thanks to the Euclidean distance, a common example being the squared exponential kernel. Recently, other distances relying on optimal transport theory - such as the Wasserstein distance between probability distributions - have shown their practical relevance for different machine learning techniques. In this paper, we study the use of exponential kernels defined thanks to the regularized Wasserstein distance and discuss their positive definiteness. More specifically, we define Wasserstein feature maps and illustrate their interest for supervised learning problems involving shapes and images. Empirically, Wasserstein squared exponential kernels are shown to yield smaller classification errors on small training sets of shapes, compared to analogous classifiers using Euclidean distances.
MLFeb 4, 2020
Robust Generative Restricted Kernel Machines using Weighted Conjugate Feature DualityArun Pandey, Joachim Schreurs, Johan A. K. Suykens
Interest in generative models has grown tremendously in the past decade. However, their training performance can be adversely affected by contamination, where outliers are encoded in the representation of the model. This results in the generation of noisy data. In this paper, we introduce weighted conjugate feature duality in the framework of Restricted Kernel Machines (RKMs). The RKM formulation allows for an easy integration of methods from classical robust statistics. This formulation is used to fine-tune the latent space of generative RKMs using a weighting function based on the Minimum Covariance Determinant, which is a highly robust estimator of multivariate location and scatter. Experiments show that the weighted RKM is capable of generating clean images when contamination is present in the training data. We further show that the robust method also preserves uncorrelated feature learning through qualitative and quantitative experiments on standard datasets.
LGNov 20, 2019
Random Fourier Features via Fast Surrogate Leverage Weighted SamplingFanghui Liu, Xiaolin Huang, Yudong Chen et al.
In this paper, we propose a fast surrogate leverage weighted sampling strategy to generate refined random Fourier features for kernel approximation. Compared to the current state-of-the-art method that uses the leverage weighted scheme [Li-ICML2019], our new strategy is simpler and more effective. It uses kernel alignment to guide the sampling process and it can avoid the matrix inversion operator when we compute the leverage function. Given n observations and s random features, our strategy can reduce the time complexity from O(ns^2+s^3) to O(ns^2), while achieving comparable (or even slightly better) prediction performance when applied to kernel ridge regression (KRR). In addition, we provide theoretical guarantees on the generalization performance of our approach, and in particular characterize the number of random features required to achieve statistical guarantees in KRR. Experiments on several benchmark datasets demonstrate that our algorithm achieves comparable prediction performance and takes less time cost when compared to [Li-ICML2019].
LGJun 19, 2019
Generative Restricted Kernel Machines: A Framework for Multi-view Generation and Disentangled Feature LearningArun Pandey, Joachim Schreurs, Johan A. K. Suykens
This paper introduces a novel framework for generative models based on Restricted Kernel Machines (RKMs) with joint multi-view generation and uncorrelated feature learning, called Gen-RKM. To enable joint multi-view generation, this mechanism uses a shared representation of data from various views. Furthermore, the model has a primal and dual formulation to incorporate both kernel-based and (deep convolutional) neural network based models within the same setting. When using neural networks as explicit feature-maps, a novel training procedure is proposed, which jointly learns the features and shared subspace representation. The latent variables are given by the eigen-decomposition of the kernel matrix, where the mutual orthogonality of eigenvectors represents the learned uncorrelated features. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of generated samples on various standard datasets.