An Adversarially-Learned Turing Test for Dialog Generation Models
This addresses the risk of adversarial attacks in conversational AI evaluation, offering a more robust metric for researchers and developers.
The paper tackles the problem of adversarial vulnerability in trainable dialogue evaluation metrics by proposing an adversarial training approach called ATT, which achieves high accuracy against strong attackers like DialoGPT and GPT-3.
The design of better automated dialogue evaluation metrics offers the potential of accelerate evaluation research on conversational AI. However, existing trainable dialogue evaluation models are generally restricted to classifiers trained in a purely supervised manner, which suffer a significant risk from adversarial attacking (e.g., a nonsensical response that enjoys a high classification score). To alleviate this risk, we propose an adversarial training approach to learn a robust model, ATT (Adversarial Turing Test), that discriminates machine-generated responses from human-written replies. In contrast to previous perturbation-based methods, our discriminator is trained by iteratively generating unrestricted and diverse adversarial examples using reinforcement learning. The key benefit of this unrestricted adversarial training approach is allowing the discriminator to improve robustness in an iterative attack-defense game. Our discriminator shows high accuracy on strong attackers including DialoGPT and GPT-3.