MLMar 24, 2017

Binarsity: a penalization for one-hot encoded features in linear supervised learning

Mokhtar Z. Alaya, Simon Bussy, Stéphane Gaïffas, Agathe Guilloux

arXiv:1703.08619v43.432 citations

Originality Incremental advance

AI Analysis

This addresses the problem of efficient and interpretable linear modeling for high-dimensional continuous data, representing an incremental improvement with a novel penalization technique.

The paper tackles large-scale linear supervised learning with many continuous features by introducing 'binarsity', a penalization method for one-hot encoded features that yields piecewise constant and block sparse model weights. It provides non-asymptotic oracle inequalities, matches state-of-the-art under sparse additive models, and shows good performance in numerical experiments with complexity comparable to standard L1 penalization.

This paper deals with the problem of large-scale linear supervised learning in settings where a large number of continuous features are available. We propose to combine the well-known trick of one-hot encoding of continuous features with a new penalization called \emph{binarsity}. In each group of binary features coming from the one-hot encoding of a single raw continuous feature, this penalization uses total-variation regularization together with an extra linear constraint. This induces two interesting properties on the model weights of the one-hot encoded features: they are piecewise constant, and are eventually block sparse. Non-asymptotic oracle inequalities for generalized linear models are proposed. Moreover, under a sparse additive model assumption, we prove that our procedure matches the state-of-the-art in this setting. Numerical experiments illustrate the good performances of our approach on several datasets. It is also noteworthy that our method has a numerical complexity comparable to standard $\ell_1$ penalization.

View on arXiv PDF

Similar