On the accuracy of self-normalized log-linear models
This work addresses a computational bottleneck in machine learning for applications like natural language processing, offering theoretical insights into self-normalization, but it is incremental as it builds on existing techniques without introducing a new paradigm.
The paper tackles the computational challenge of calculating log-normalizers in log-linear models with large output spaces by analyzing self-normalization, a technique that regularizes training to keep log normalizers near zero, enabling the use of unnormalized scores as approximate probabilities. The authors provide theoretical bounds on normalizer variance and accuracy loss, identify distribution classes that facilitate self-normalization, and validate predictions with empirical evidence.
Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as "self-normalization", which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking. We prove generalization bounds on the estimated variance of normalizers and upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributions that self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions.