CLFeb 6, 2025

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs

arXiv:2502.04134v212 citationsh-index: 16
Originality Synthesis-oriented
AI Analysis

This addresses a reliability problem for users of LLMs in high-stakes applications, but it is incremental as it builds on known issues of order sensitivity.

The paper investigates how input order affects the reliability of large language models (LLMs) in tasks like paraphrasing and multiple-choice questions, finding that shuffled inputs lead to measurable declines in accuracy, with few-shot prompting offering only partial mitigation.

As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in the input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in LLMs whose internal components are hidden from users (such as closed-source models or those accessed via API calls). We conduct experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation; however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes