CLAILGDec 13, 2024

Too Big to Fool: Resisting Deception in Language Models

MILA
arXiv:2412.10558v11 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the issue of model reliability against deceptive inputs for users of large language models, though it is incremental as it builds on existing understanding of model scaling.

The paper tackles the problem of how language models balance internal knowledge with misleading in-context prompts, finding that larger models are more resilient to deception and better at following legitimate instructions, with experiments showing higher performance in these tasks.

Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes