Classification with Low Rank and Missing Data
This addresses the problem of handling missing data in classification tasks for machine learning practitioners, offering a provable solution that is incremental in improving efficiency over existing methods.
The paper tackles classification and regression with missing data by assuming the clean data lies in a low-rank subspace, and it provides an efficient agnostic algorithm that classifies as well as the best linear classifier combined with the optimal low-dimensional subspace, enabling performance comparable to classifiers with full data access.
We consider classification and regression tasks where we have missing data and assume that the (clean) data resides in a low rank subspace. Finding a hidden subspace is known to be computationally hard. Nevertheless, using a non-proper formulation we give an efficient agnostic algorithm that classifies as good as the best linear classifier coupled with the best low-dimensional subspace in which the data resides. A direct implication is that our algorithm can linearly (and non-linearly through kernels) classify provably as well as the best classifier that has access to the full data.