On the Self-Penalization Phenomenon in Feature Selection
This addresses feature selection in machine learning, offering a novel approach that could improve model interpretability and efficiency, though it appears incremental relative to existing sparsity methods.
The paper tackles the problem of feature selection by introducing an implicit sparsity-inducing mechanism based on kernel minimization, which achieves exactly sparse stationary points with high probability without explicit sparsification techniques like penalization or early stopping.
We describe an implicit sparsity-inducing mechanism based on minimization over a family of kernels: \begin{equation*} \min_{β, f}~\widehat{\mathbb{E}}[L(Y, f(β^{1/q} \odot X)] + λ_n \|f\|_{\mathcal{H}_q}^2~~\text{subject to}~~β\ge 0, \end{equation*} where $L$ is the loss, $\odot$ is coordinate-wise multiplication and $\mathcal{H}_q$ is the reproducing kernel Hilbert space based on the kernel $k_q(x, x') = h(\|x-x'\|_q^q)$, where $\|\cdot\|_q$ is the $\ell_q$ norm. Using gradient descent to optimize this objective with respect to $β$ leads to exactly sparse stationary points with high probability. The sparsity is achieved without using any of the well-known explicit sparsification techniques such as penalization (e.g., $\ell_1$), early stopping or post-processing (e.g., clipping). As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection.