CLJul 13, 2022

Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent Variable Inference for Text Generation

Tsinghua
arXiv:2207.06130v2638 citationsh-index: 98
Originality Incremental advance
AI Analysis

This addresses a key bottleneck in text generation for NLP researchers, offering an incremental improvement over existing variational Transformer methods.

The paper tackles the KL vanishing problem in variational auto-encoders for text generation, where auto-regressive decoders ignore latent variables, by proposing DELLA, a variational Transformer framework that uses layer-wise latent variables to achieve higher non-zero KL values and improve quality and diversity in generation tasks.

The past several years have witnessed Variational Auto-Encoder's superiority in various text generation tasks. However, due to the sequential nature of the text, auto-regressive decoders tend to ignore latent variables and then reduce to simple language models, known as the KL vanishing problem, which would further deteriorate when VAE is combined with Transformer-based structures. To ameliorate this problem, we propose DELLA, a novel variational Transformer framework. DELLA learns a series of layer-wise latent variables with each inferred from those of lower layers and tightly coupled with the hidden states by low-rank tensor product. In this way, DELLA forces these posterior latent variables to be fused deeply with the whole computation path and hence incorporate more information. We theoretically demonstrate that our method can be regarded as entangling latent variables to avoid posterior information decrease through layers, enabling DELLA to get higher non-zero KL values even without any annealing or thresholding tricks. Experiments on four unconditional and three conditional generation tasks show that DELLA could better alleviate KL vanishing and improve both quality and diversity compared to several strong baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes