16.3MLJun 2
Central Description Length (CDL) Clustering Validation IndexMahdi Shamsi, Soosan Beheshti
Selecting a clustering algorithm and its hyperparameters without labels is a common difficulty in engineering machine learning pipelines that work with unsupervised analysis of sensor, image, or process data. Clustering validation indices (CVIs) provide internal scores for ranking candidate clusterings, but most popular CVIs are built from Euclidean compactness and separation terms and so tend to favour compact, convex partitions. Their performance is known to degrade on non convex, irregular, or variable density data, where kernel transformations or alternative distance measures are typically used at the cost of additional tuning and computation. This paper introduces the Central Description Length (CDL) clustering validation index. CDL uses the observed within cluster compactness, the estimated cluster centers, and the estimated cluster covariances to compute a probabilistic upper bound on the description length associated with the unobservable true cluster centers. The bound condenses intra cluster compactness and centroid displacement into a single computable quantity and is evaluated on the partition produced by any clustering algorithm. The implementation uses only observable quantities (the data, the partition, the estimated centers, and the estimated covariances) and does not use ground truth labels. On synthetic benchmarks with non convex and arbitrary shape clusters, CDL-CVI selected the reference number of clusters more often and reached higher Adjusted Rand Index (ARI) values than the conventional CVIs we tested, without an additional kernel preprocessing stage. On image benchmarks (MNIST, CIFAR-10, STL-10) clustered from frozen unsupervised embeddings, CDL-CVI returned cluster numbers close to the reference class counts across K-means, DBSCAN, and spectral clustering in the reported trials.
MLMar 28, 2023
Learnability, Sample Complexity, and Hypothesis Class Complexity for Regression ModelsSoosan Beheshti, Mahdi Shamsi
The goal of a learning algorithm is to receive a training data set as input and provide a hypothesis that can generalize to all possible data points from a domain set. The hypothesis is chosen from hypothesis classes with potentially different complexities. Linear regression modeling is an important category of learning algorithms. The practical uncertainty of the target samples affects the generalization performance of the learned model. Failing to choose a proper model or hypothesis class can lead to serious issues such as underfitting or overfitting. These issues have been addressed by alternating cost functions or by utilizing cross-validation methods. These approaches can introduce new hyperparameters with their own new challenges and uncertainties or increase the computational complexity of the learning algorithm. On the other hand, the theory of probably approximately correct (PAC) aims at defining learnability based on probabilistic settings. Despite its theoretical value, PAC does not address practical learning issues on many occasions. This work is inspired by the foundation of PAC and is motivated by the existing regression learning issues. The proposed approach, denoted by epsilon-Confidence Approximately Correct (epsilon CoAC), utilizes Kullback Leibler divergence (relative entropy) and proposes a new related typical set in the set of hyperparameters to tackle the learnability issue. Moreover, it enables the learner to compare hypothesis classes of different complexity orders and choose among them the optimum with the minimum epsilon in the epsilon CoAC framework. Not only the epsilon CoAC learnability overcomes the issues of overfitting and underfitting, but it also shows advantages and superiority over the well known cross-validation method in the sense of time consumption as well as in the sense of accuracy.
MLMay 17, 2023
Separability and Scatteredness (S&S) Ratio-Based Efficient SVM Regularization Parameter, Kernel, and Kernel Parameter SelectionMahdi Shamsi, Soosan Beheshti
Support Vector Machine (SVM) is a robust machine learning algorithm with broad applications in classification, regression, and outlier detection. SVM requires tuning the regularization parameter (RP) which controls the model capacity and the generalization performance. Conventionally, the optimum RP is found by comparison of a range of values through the Cross-Validation (CV) procedure. In addition, for non-linearly separable data, the SVM uses kernels where a set of kernels, each with a set of parameters, denoted as a grid of kernels, are considered. The optimal choice of RP and the grid of kernels is through the grid-search of CV. By stochastically analyzing the behavior of the regularization parameter, this work shows that the SVM performance can be modeled as a function of separability and scatteredness (S&S) of the data. Separability is a measure of the distance between classes, and scatteredness is the ratio of the spread of data points. In particular, for the hinge loss cost function, an S&S ratio-based table provides the optimum RP. The S&S ratio is a powerful value that can automatically detect linear or non-linear separability before using the SVM algorithm. The provided S&S ratio-based table can also provide the optimum kernel and its parameters before using the SVM algorithm. Consequently, the computational complexity of the CV grid-search is reduced to only one time use of the SVM. The simulation results on the real dataset confirm the superiority and efficiency of the proposed approach in the sense of computational complexity over the grid-search CV method.
CVMay 17, 2020
Hyperspectral Image Classification Based on Sparse Modeling of Spectral BlocksSaeideh Ghanbari Azar, Saeed Meshgini, Tohid Yousefi Rezaii et al.
Hyperspectral images provide abundant spatial and spectral information that is very valuable for material detection in diverse areas of practical science. The high-dimensions of data lead to many processing challenges that can be addressed via existent spatial and spectral redundancies. In this paper, a sparse modeling framework is proposed for hyperspectral image classification. Spectral blocks are introduced to be used along with spatial groups to jointly exploit spectral and spatial redundancies. To reduce the computational complexity of sparse modeling, spectral blocks are used to break the high-dimensional optimization problems into small-size sub-problems that are faster to solve. Furthermore, the proposed sparse structure enables to extract the most discriminative spectral blocks and further reduce the computational burden. Experiments on three benchmark datasets, i.e., Pavia University Image and Indian Pines Image verify that the proposed method leads to a robust sparse modeling of hyperspectral images and improves the classification accuracy compared to several state-of-the-art methods. Moreover, the experiments demonstrate that the proposed method requires less processing time.
APFeb 8, 2014
Efficient Low Dose X-ray CT Reconstruction through Sparsity-Based MAP ModelingSayedMasoud Hashemi, Soosan Beheshti, Patrick R. Gill et al.
Ultra low radiation dose in X-ray Computed Tomography (CT) is an important clinical objective in order to minimize the risk of carcinogenesis. Compressed Sensing (CS) enables significant reductions in radiation dose to be achieved by producing diagnostic images from a limited number of CT projections. However, the excessive computation time that conventional CS-based CT reconstruction typically requires has limited clinical implementation. In this paper, we first demonstrate that a thorough analysis of CT reconstruction through a Maximum a Posteriori objective function results in a weighted compressive sensing problem. This analysis enables us to formulate a low dose fan beam and helical cone beam CT reconstruction. Subsequently, we provide an efficient solution to the formulated CS problem based on a Fast Composite Splitting Algorithm-Latent Expected Maximization (FCSA-LEM) algorithm. In the proposed method we use pseudo polar Fourier transform as the measurement matrix in order to decrease the computational complexity; and rebinning of the projections to parallel rays in order to extend its application to fan beam and helical cone beam scans. The weight involved in the proposed weighted CS model, denoted by Error Adaptation Weight (EAW), is calculated based on the statistical characteristics of CT reconstruction and is a function of Poisson measurement noise and rebinning interpolation error. Simulation results show that low computational complexity of the proposed method made the fast recovery of the CT images possible and using EAW reduces the reconstruction error by one order of magnitude. Recovery of a high quality 512$\times$ 512 image was achieved in less than 20 sec on a desktop computer without numerical optimizations.
LGJan 9, 2014
Efficient unimodality test in clustering by signature testingMahdi Shahbaba, Soosan Beheshti
This paper provides a new unimodality test with application in hierarchical clustering methods. The proposed method denoted by signature test (Sigtest), transforms the data based on its statistics. The transformed data has much smaller variation compared to the original data and can be evaluated in a simple proposed unimodality test. Compared with the existing unimodality tests, Sigtest is more accurate in detecting the overlapped clusters and has a much less computational complexity. Simulation results demonstrate the efficiency of this statistic test for both real and synthetic data sets.