Mitigating the Influence of Distractor Tasks in LMs with Prior-Aware Decoding
This addresses the issue of unreliable outputs in language models for users needing robust AI systems, though it is an incremental improvement based on existing contrastive methods.
The paper tackled the problem of language models being sensitive to distractor tasks, which can cause unwanted outputs like in prompt injection attacks, and demonstrated that prior-aware decoding improves task completion in 41 out of 44 cases with a median increase of 40%.
The broad capabilities of Language Models (LMs) can be limited by their sensitivity to distractor tasks: LMs can infer secondary tasks from the prompt in addition to the intended one, leading to unwanted outputs. For example, prompt injection attacks can cause models to deviate from explicit directives. In some 'inverse scaling' cases, this unwanted behaviour actually worsens as models scale up to at least 540B parameters. We present a theoretical framework that interprets LMs as a product of experts that combine multiple data generation processes. Based on this framework, we demonstrate prior-aware decoding (PAD) - a simple contrastive inference method to reduce the influence of distractor tasks. We apply PAD to eleven models, across four datasets, and find improvements in 41 out of 44 task-model combinations, with a median increase in task completion proportion of 40%. The results suggest a promising direction for further development towards more reliable language models.