LGJan 15, 2025

Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications

Jin Chen, Jin Zhang, Xu huang, Yi Yang, Defu Lian, Enhong Chen

arXiv:2501.08563v17.12 citationsh-index: 16Has Code

Originality Highly original

AI Analysis

This addresses a critical bottleneck in machine learning for applications like large language models and recommendation systems, offering a more efficient alternative to sampled softmax.

The paper tackles the computational inefficiency of softmax in multi-class classification with many classes by proposing the MIDX Sampler, an adaptive sampling strategy based on an inverted multi-index, which reduces time complexity to the number of codewords and shows superior performance in experiments on large-scale models.

The softmax function is a cornerstone of multi-class classification, integral to a wide range of machine learning applications, from large-scale retrieval and ranking models to advanced large language models. However, its computational cost grows linearly with the number of classes, which becomes prohibitively expensive in scenarios with millions or even billions of classes. The sampled softmax, which relies on self-normalized importance sampling, has emerged as a powerful alternative, significantly reducing computational complexity. Yet, its estimator remains unbiased only when the sampling distribution matches the true softmax distribution. To improve both approximation accuracy and sampling efficiency, we propose the MIDX Sampler, a novel adaptive sampling strategy based on an inverted multi-index approach. Concretely, we decompose the softmax probability into several multinomial probabilities, each associated with a specific set of codewords and the last associated with the residual score of queries, thus reducing time complexity to the number of codewords instead of the number of classes. To further boost efficiency, we replace the query-specific residual probability with a simple uniform distribution, simplifying the computation while retaining high performance. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds. The results demonstrate that a smaller divergence from the ideal softmax distribution leads to faster convergence and improved generalization. Extensive experiments on large-scale language models, sequential recommenders, and extreme multi-class classification tasks confirm that the MIDX-Sampler delivers superior effectiveness and efficiency compared to existing approaches.

View on arXiv PDF Code

Similar