MLNov 15, 2018
The Trace Criterion for Kernel Bandwidth Selection for Support Vector Data DescriptionArin Chaudhuri, Carol Sadek, Deovrat Kakde et al.
Support vector data description (SVDD) is a popular anomaly detection technique. The SVDD classifier partitions the whole data space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, for which the Gaussian kernel is a common choice. The Gaussian kernel has a bandwidth parameter, and it is important to set the value of this parameter correctly for good results. A small bandwidth leads to overfitting such that the resulting SVDD classifier overestimates the number of anomalies, whereas a large bandwidth leads to underfitting and an inability to detect many anomalies. In this paper, we present a new unsupervised method for selecting the Gaussian kernel bandwidth. Our method exploits a low-rank representation of the kernel matrix to suggest a kernel bandwidth value. Our new technique is competitive with the current state of the art for low-dimensional data and performs extremely well for many classes of high-dimensional data. Because the mathematical formulation of SVDD is identical with the mathematical formulation of one-class support vector machines (OCSVM) when the Gaussian kernel is used, our method is equally applicable to Gaussian kernel bandwidth tuning for OCSVM.
MLSep 1, 2017
Fast Incremental SVDD Learning Algorithm with the Gaussian KernelHansi Jiang, Haoyu Wang, Wenhao Hu et al.
Support vector data description (SVDD) is a machine learning technique that is used for single-class classification and outlier detection. The idea of SVDD is to find a set of support vectors that defines a boundary around data. When dealing with online or large data, existing batch SVDD methods have to be rerun in each iteration. We propose an incremental learning algorithm for SVDD that uses the Gaussian kernel. This algorithm builds on the observation that all support vectors on the boundary have the same distance to the center of sphere in a higher-dimensional feature space as mapped by the Gaussian kernel function. Each iteration involves only the existing support vectors and the new data point. Moreover, the algorithm is based solely on matrix manipulations; the support vectors and their corresponding Lagrange multiplier $α_i$'s are automatically selected and determined in each iteration. It can be seen that the complexity of our algorithm in each iteration is only $O(k^2)$, where $k$ is the number of support vectors. Experimental results on some real data sets indicate that FISVDD demonstrates significant gains in efficiency with almost no loss in either outlier detection accuracy or objective function value.
LGJun 16, 2016
Sampling Method for Fast Training of Support Vector Data DescriptionArin Chaudhuri, Deovrat Kakde, Maria Jahja et al.
Support Vector Data Description (SVDD) is a popular outlier detection technique which constructs a flexible description of the input data. SVDD computation time is high for large training datasets which limits its use in big-data process-monitoring applications. We propose a new iterative sampling-based method for SVDD training. The method incrementally learns the training data description at each iteration by computing SVDD on an independent random sample selected with replacement from the training data set. The experimental results indicate that the proposed method is extremely fast and provides a good data description .
LGFeb 17, 2016
Peak Criterion for Choosing Gaussian Kernel Bandwidth in Support Vector Data DescriptionDeovrat Kakde, Arin Chaudhuri, Seunghyun Kong et al.
Support Vector Data Description (SVDD) is a machine-learning technique used for single class classification and outlier detection. SVDD formulation with kernel function provides a flexible boundary around data. The value of kernel function parameters affects the nature of the data boundary. For example, it is observed that with a Gaussian kernel, as the value of kernel bandwidth is lowered, the data boundary changes from spherical to wiggly. The spherical data boundary leads to underfitting, and an extremely wiggly data boundary leads to overfitting. In this paper, we propose empirical criterion to obtain good values of the Gaussian kernel bandwidth parameter. This criterion provides a smooth boundary that captures the essential geometric features of the data.
MLOct 19, 2015
Modularity Component Analysis versus Principal Component AnalysisHansi Jiang, Carl Meyer
In this paper the exact linear relation between the leading eigenvectors of the modularity matrix and the singular vectors of an uncentered data matrix is developed. Based on this analysis the concept of a modularity component is defined, and its properties are developed. It is shown that modularity component analysis can be used to cluster data similar to how traditional principal component analysis is used except that modularity component analysis does not require data centering.
MLMay 9, 2015
Relations Between Adjacency and Modularity Graph PartitioningHansi Jiang, Carl Meyer
This paper develops the exact linear relationship between the leading eigenvector of the unnormalized modularity matrix and the eigenvectors of the adjacency matrix. We propose a method for approximating the leading eigenvector of the modularity matrix, and we derive the error of the approximation. There is also a complete proof of the equivalence between normalized adjacency clustering and normalized modularity clustering. Numerical experiments show that normalized adjacency clustering can be as twice efficient as normalized modularity clustering.