Unbiased estimates for linear regression via volume sampling
This work addresses the problem of reducing label costs in linear regression for researchers and practitioners, offering incremental improvements in algorithm speed and theoretical bounds.
The paper tackles the problem of estimating the pseudo-inverse of a matrix and the optimal least-squares solution using only a subset of columns, by showing that volume sampling yields unbiased estimators with closed-form covariance. It results in a faster algorithm for volume sampling and provides bounds for the total loss, establishing a fundamental connection between linear least squares and volume sampling.
Given a full rank matrix $X$ with more columns than rows, consider the task of estimating the pseudo inverse $X^+$ based on the pseudo inverse of a sampled subset of columns (of size at least the number of rows). We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times $X^{+\top}X^+$. Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of $X$. We assume labels are expensive and we are only given the labels for the small subset of columns we sample from $X$. Using our methods we show that the weight vector of the solution for the sub problem is an unbiased estimator of the optimal solution for the whole problem based on all column labels. We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. We use our methods to obtain an algorithm for volume sampling that is faster than state-of-the-art and for obtaining bounds for the total loss of the estimated least-squares solution on all labeled columns.