Faster PAC Learning and Smaller Coresets via Smoothed Analysis
This work addresses the challenge of computational efficiency in machine learning for researchers and practitioners by providing smaller coresets, though it is incremental as it builds on smoothed analysis and existing coreset frameworks.
The paper tackles the problem of reducing the size of coresets and ε-samples in PAC learning by approximating average error instead of worst-case error, resulting in deterministic and randomized algorithms that produce subsets with size independent of the number of items, with applications in streaming vector summarization and k-PCA.
PAC-learning usually aims to compute a small subset ($\varepsilon$-sample/net) from $n$ items, that provably approximates a given loss function for every query (model, classifier, hypothesis) from a given set of queries, up to an additive error $\varepsilon\in(0,1)$. Coresets generalize this idea to support multiplicative error $1\pm\varepsilon$. Inspired by smoothed analysis, we suggest a natural generalization: approximate the \emph{average} (instead of the worst-case) error over the queries, in the hope of getting smaller subsets. The dependency between errors of different queries implies that we may no longer apply the Chernoff-Hoeffding inequality for a fixed query, and then use the VC-dimension or union bound. This paper provides deterministic and randomized algorithms for computing such coresets and $\varepsilon$-samples of size independent of $n$, for any finite set of queries and loss function. Example applications include new and improved coreset constructions for e.g. streaming vector summarization [ICML'17] and $k$-PCA [NIPS'16]. Experimental results with open source code are provided.