LGMLJan 30, 2025

Loss Functions and Operators Generated by f-Divergences

arXiv:2501.18537v210 citationsh-index: 9ICML
AI Analysis

This work proposes a theoretical framework for generating loss functions, which could benefit machine learning practitioners in classification and language modeling, though it appears incremental as it builds on existing divergence concepts.

The authors tackled the problem of constructing new convex loss functions for multiclass classification and language modeling by generalizing the logistic loss using f-divergences and non-uniform reference measures, resulting in the f-softargmax operator and a bisection algorithm, with empirical results showing that the α-divergence loss with α=1.5 performs well across tasks.

The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback--Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones. By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence. On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $α$-divergence (which is equivalent to Tsallis $α$-negentropy in the case of unit reference measures) with $α=1.5$ performs well across several tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes