Gauge-optimal approximate learning for small data classification problems
This addresses small data learning challenges in fields like climate science and bioinformatics, offering a novel method that is incremental in combining existing ideas into a joint solution.
The paper tackles small data classification problems where limited observations and high feature dimensions hinder learning, proposing the GOAL algorithm that reduces and rotates the feature space to jointly address dimension reduction, feature segmentation, and classification, and shows it outperforms state-of-the-art methods in learning performance and computational cost on synthetic and real-world datasets like El Nino prediction and gene-activity inference.
Small data learning problems are characterized by a significant discrepancy between the limited amount of response variable observations and the large feature space dimension. In this setting, the common learning tools struggle to identify the features important for the classification task from those that bear no relevant information, and cannot derive an appropriate learning rule which allows to discriminate between different classes. As a potential solution to this problem, here we exploit the idea of reducing and rotating the feature space in a lower-dimensional gauge and propose the Gauge-Optimal Approximate Learning (GOAL) algorithm, which provides an analytically tractable joint solution to the dimension reduction, feature segmentation and classification problems for small data learning problems. We prove that the optimal solution of the GOAL algorithm consists in piecewise-linear functions in the Euclidean space, and that it can be approximated through a monotonically convergent algorithm which presents -- under the assumption of a discrete segmentation of the feature space -- a closed-form solution for each optimization substep and an overall linear iteration cost scaling. The GOAL algorithm has been compared to other state-of-the-art machine learning (ML) tools on both synthetic data and challenging real-world applications from climate science and bioinformatics (i.e., prediction of the El Nino Southern Oscillation and inference of epigenetically-induced gene-activity networks from limited experimental data). The experimental results show that the proposed algorithm outperforms the reported best competitors for these problems both in learning performance and computational cost.