CLDec 4, 2022

Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation

Tsinghua
arXiv:2212.01810v1296 citationsh-index: 74Has Code
Originality Incremental advance
AI Analysis

This work addresses safety concerns in dialogue AI for practical applications, offering a method to improve detection and mitigation of toxic content, though it is incremental as it builds on existing datasets and methods.

The paper tackles the problem of detecting toxic or biased content in dialogue models by constructing adversarial contexts that are more likely to induce unsafe responses, resulting in a new dataset BAD+ with over 120K contexts that exposes safety issues in three popular models and enhances generation safety.

Large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. In order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers, or automatic generation to construct adversarial contexts that are likely to induce toxic generations. However, what type of context is more likely to induce unsafe responses is still under-explored. In this paper, we identify that context toxicity and context category (e.g., \textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important factors to cause safety issues in response generation. Hence, we propose a method called \emph{reverse generation} to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level, and inductivity of the generated contexts. Via reverse generation, we augment the existing BAD dataset and construct a new dataset BAD+ which contains more than 120K diverse and highly inductive contexts in 12 categories. We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems. Furthermore, we show that BAD+ can greatly enhance the safety of generation and reveal the key factors of safety improvement. Our code and dataset is available at \url{https://github.com/thu-coai/Reverse_Generation}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes