MLApr 27, 2025
Test Set Sizing for the Ridge RegressionAlexander Dubbs
We derive the ideal train/test split for the ridge regression to high accuracy in the limit that the number of training rows m becomes large. The split must depend on the ridge tuning parameter, alpha, but we find that the dependence is weak and can asymptotically be ignored; all parameters vanish except for m and the number of features, n, which is held constant. This is the first time that such a split is calculated mathematically for a machine learning model in the large data limit. The goal of the calculations is to maximize "integrity," so that the measured error in the trained model is as close as possible to what it theoretically should be. This paper's result for the ridge regression split matches prior art for the plain vanilla linear regression split to the first two terms asymptotically.
MLDec 11, 2021
Test Set Sizing Via Random Matrix TheoryAlexander Dubbs
This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test size for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, and evaluating a new integral in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.
CVJun 19, 2015
moco: Fast Motion Correction for Calcium ImagingAlexander Dubbs, James Guevara, Darcy S. Peterka et al.
Motion correction is the first in a pipeline of algorithms to analyze calcium imaging videos and extract biologically relevant information, for example the network structure of the neurons therein. Fast motion correction would be especially critical for closed-loop activity triggered stimulation experiments, where accurate detection and targeting of specific cells in necessary. Our algorithm uses a Fourier-transform approach, and its efficiency derives from a combination of judicious downsampling and the accelerated computation of many $L_2$ norms using dynamic programming and two-dimensional, fft-accelerated convolutions. Its accuracy is comparable to that of established community-used algorithms, and it is more stable to large translational motions. It is programmed in Java and is compatible with ImageJ.