LG MLFeb 10, 2020

Supervised Learning: No Loss No Cry

arXiv:2002.03555v18.516 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of principled loss function selection in supervised learning, offering a method with theoretical guarantees, though it is incremental as it builds on existing algorithms.

The paper tackles the problem of learning the loss function in supervised learning, which is typically chosen ad hoc, by generalizing the SLIsotron algorithm using Bregman divergences to create BregmanTron, which jointly learns the loss and classifier with convergence guarantees and outperforms SLIsotron in experiments.

Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an \emph{ad hoc} manner. In hopes of making this procedure more principled, the problem of \emph{learning the loss function} for a downstream task (e.g., classification) has garnered recent interest. However, works in this area have been generally empirical in nature. In this paper, we revisit the {\sc SLIsotron} algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss. In detail, we cast {\sc SLIsotron} as learning a loss from a family of composite square losses. By interpreting this through the lens of \emph{proper losses}, we derive a generalisation of {\sc SLIsotron} based on Bregman divergences. The resulting {\sc BregmanTron} algorithm jointly learns the loss along with the classifier. It comes equipped with a simple guarantee of convergence for the loss it learns, and its set of possible outputs comes with a guarantee of agnostic approximability of Bayes rule. Experiments indicate that the {\sc BregmanTron} substantially outperforms the {\sc SLIsotron}, and that the loss it learns can be minimized by other algorithms for different tasks, thereby opening the interesting problem of \textit{loss transfer} between domains.

View on arXiv PDF

Similar