DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder
This work addresses the challenge of generating varied and contextually appropriate responses in conversational AI, which is crucial for applications like chatbots and virtual assistants, representing an incremental improvement over existing VAE-based methods.
The authors tackled the problem of generating diverse and coherent multimodal responses in dialogue modeling by proposing DialogWAE, a conditional Wasserstein autoencoder that replaces the simple prior in VAEs with a GAN-based latent space and a Gaussian mixture prior, resulting in state-of-the-art performance on two datasets with improved coherence, informativeness, and diversity.
Variational autoencoders~(VAEs) have shown a promise in data-driven conversation modeling. However, most VAE conversation models match the approximate posterior distribution over the latent variables to a simple prior such as standard normal distribution, thereby restricting the generated responses to a relatively simple (e.g., unimodal) scope. In this paper, we propose DialogWAE, a conditional Wasserstein autoencoder~(WAE) specially designed for dialogue modeling. Unlike VAEs that impose a simple distribution over the latent variables, DialogWAE models the distribution of data by training a GAN within the latent variable space. Specifically, our model samples from the prior and posterior distributions over the latent variables by transforming context-dependent random noise using neural networks and minimizes the Wasserstein distance between the two distributions. We further develop a Gaussian mixture prior network to enrich the latent space. Experiments on two popular datasets show that DialogWAE outperforms the state-of-the-art approaches in generating more coherent, informative and diverse responses.