CLMay 23, 2023

Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting

arXiv:2305.13733v240 citations
Originality Incremental advance
AI Analysis

This addresses a safety issue for users of LLMs by making them more resistant to misleading or malicious inputs, though it is incremental as it builds on existing alignment efforts.

The paper tackles the problem of large language models (LLMs) being vulnerable to inductive instructions that incorporate false knowledge, and it proposes Dual-critique prompting to enhance their robustness, showing significant improvements across diverse LLMs.

Numerous works are proposed to align large language models (LLMs) with human intents to better fulfill instructions, ensuring they are trustful and helpful. Nevertheless, some human instructions are often malicious or misleading and following them will lead to untruthful and unsafe responses. Previous work rarely focused on understanding how LLMs manage instructions based on counterfactual premises, referred to here as \textit{inductive instructions}, which may stem from users' false beliefs or malicious intents. In this paper, we aim to reveal the behaviors of LLMs towards \textit{inductive instructions} and enhance their truthfulness and helpfulness accordingly. Specifically, we first introduce a benchmark of \underline{\textbf{Indu}}ctive {In\underline{\textbf{st}}ruct}ions (\textsc{\textbf{INDust}}), where the false knowledge is incorporated into instructions in multiple different styles. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions. Additionally, we identified that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance. Motivated by these results, we propose \textsc{Dual-critique} prompting to improve LLM robustness against inductive instructions. Our experiments demonstrate that \textsc{Dual-critique} prompting significantly bolsters the robustness of a diverse array of LLMs, even when confronted with varying degrees of inductive instruction complexity and differing inductive styles.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes