The Tree Loss: Improving Generalization with Many Classes
This addresses generalization issues in multi-class classification for domains like image and text analysis, but it is incremental as it builds on existing loss functions.
The paper tackles the problem of multi-class classification with many semantically similar classes by introducing the tree loss as a drop-in replacement for cross entropy loss, which enforces similar parameter vectors for similar classes and shows asymptotically better generalization error, validated on synthetic, image, and text datasets.
Multi-class classification problems often have many semantically similar classes. For example, 90 of ImageNet's 1000 classes are for different breeds of dog. We should expect that these semantically similar classes will have similar parameter vectors, but the standard cross entropy loss does not enforce this constraint. We introduce the tree loss as a drop-in replacement for the cross entropy loss. The tree loss re-parameterizes the parameter matrix in order to guarantee that semantically similar classes will have similar parameter vectors. Using simple properties of stochastic gradient descent, we show that the tree loss's generalization error is asymptotically better than the cross entropy loss's. We then validate these theoretical results on synthetic data, image data (CIFAR100, ImageNet), and text data (Twitter).