Coresets for Classification -- Simplified and Strengthened
This work provides a more efficient method for data reduction in machine learning, benefiting researchers and practitioners dealing with large datasets, though it is incremental as it builds on prior coreset techniques.
The paper tackles the problem of efficiently training linear classifiers by constructing coresets that achieve (1±ε) relative error for a broad class of loss functions, including logistic and hinge loss, using Õ(d·μ_y(X)²/ε²) points, which significantly improves existing theoretical bounds and outperforms uniform subsampling in practice.
We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $(1\pm ε)$ relative error with $\tilde O(d \cdot μ_y(X)^2/ε^2)$ points, where $μ_y(X)$ is a natural complexity measure of the data matrix $X \in \mathbb{R}^{n \times d}$ and label vector $y \in \{-1,1\}^n$, introduced in by Munteanu et al. 2018. Our result is based on subsampling data points with probabilities proportional to their $\ell_1$ $Lewis$ $weights$. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios.