CLApr 16, 2021

An Adversarially-Learned Turing Test for Dialog Generation Models

arXiv:2104.08231v11 citations
Originality Highly original
AI Analysis

This addresses the risk of adversarial attacks in conversational AI evaluation, offering a more robust metric for researchers and developers.

The paper tackles the problem of adversarial vulnerability in trainable dialogue evaluation metrics by proposing an adversarial training approach called ATT, which achieves high accuracy against strong attackers like DialoGPT and GPT-3.

The design of better automated dialogue evaluation metrics offers the potential of accelerate evaluation research on conversational AI. However, existing trainable dialogue evaluation models are generally restricted to classifiers trained in a purely supervised manner, which suffer a significant risk from adversarial attacking (e.g., a nonsensical response that enjoys a high classification score). To alleviate this risk, we propose an adversarial training approach to learn a robust model, ATT (Adversarial Turing Test), that discriminates machine-generated responses from human-written replies. In contrast to previous perturbation-based methods, our discriminator is trained by iteratively generating unrestricted and diverse adversarial examples using reinforcement learning. The key benefit of this unrestricted adversarial training approach is allowing the discriminator to improve robustness in an iterative attack-defense game. Our discriminator shows high accuracy on strong attackers including DialoGPT and GPT-3.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes