AI CLNov 16, 2023

JAB: Joint Adversarial Prompting and Belief Augmentation

Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta

Amazon

arXiv:2311.09473v110.08 citationsh-index: 50

Originality Incremental advance

AI Analysis

This work addresses safety and robustness issues in language models, which is crucial for their deployment in various applications, though it appears incremental as it builds on existing red teaming and augmentation techniques.

The authors tackled the problem of improving the safety and robustness of language models by introducing a joint framework that uses adversarial prompting and belief augmentation with iterative feedback loops, resulting in reduced toxic content generation in both dynamic adversarial interactions and static benchmark evaluations.

With the recent surge of language models in different applications, attention to safety and robustness of these models has gained significant importance. Here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. Importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. In our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model.

View on arXiv PDF

Similar