Accurate Coresets for Latent Variable Models and Regularized Regression
This work addresses the challenge of efficient data summarization for complex models like latent variable models and regularized regression, offering a unified solution that improves coreset sizes, though it is incremental in extending existing coreset methods to new model types.
The paper tackles the problem of constructing accurate coresets for a broader range of machine learning models, introducing a unified framework that yields coresets of size O(poly(k)) for latent variable models and smaller than d^p for ℓ_p-regularized ℓ_p-regression, with experimental validation on real datasets.
Accurate coresets are a weighted subset of the original dataset, ensuring a model trained on the accurate coreset maintains the same level of accuracy as a model trained on the full dataset. Primarily, these coresets have been studied for a limited range of machine learning models. In this paper, we introduce a unified framework for constructing accurate coresets. Using this framework, we present accurate coreset construction algorithms for general problems, including a wide range of latent variable model problems and $\ell_p$-regularized $\ell_p$-regression. For latent variable models, our coreset size is $O\left(\mathrm{poly}(k)\right)$, where $k$ is the number of latent variables. For $\ell_p$-regularized $\ell_p$-regression, our algorithm captures the reduction of model complexity due to regularization, resulting in a coreset whose size is always smaller than $d^{p}$ for a regularization parameter $λ> 0$. Here, $d$ is the dimension of the input points. This inherently improves the size of the accurate coreset for ridge regression. We substantiate our theoretical findings with extensive experimental evaluations on real datasets.