Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing
This addresses a training difficulty in VAEs for NLP, offering a simple solution to improve performance across tasks like language modeling and dialog generation, though it is incremental as it builds on existing scheduling methods.
The paper tackles the KL vanishing problem in variational autoencoders (VAEs) for NLP tasks by proposing a cyclical annealing schedule for the hyper-parameter β, which improves latent code learning and achieves state-of-the-art results, such as a 2.5 BLEU score gain in dialog response generation.
Variational autoencoders (VAEs) with an auto-regressive decoder have been applied for many natural language processing (NLP) tasks. The VAE objective consists of two terms, (i) reconstruction and (ii) KL regularization, balanced by a weighting hyper-parameter β. One notorious training difficulty is that the KL term tends to vanish. In this paper we study scheduling schemes for β, and show that KL vanishing is caused by the lack of good latent codes in training the decoder at the beginning of optimization. To remedy this, we propose a cyclical annealing schedule, which repeats the process of increasing βmultiple times. This new procedure allows the progressive learning of more meaningful latent codes, by leveraging the informative representations of previous cycles as warm re-starts. The effectiveness of cyclical annealing is validated on a broad range of NLP tasks, including language modeling, dialog response generation and unsupervised language pre-training.