CLJun 2

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

arXiv:2606.0410913.6

Predicted impact top 98% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners using context-augmented LMs (e.g., RAG), this work reveals that wrapper labels are a confound that must be controlled in benchmarks, though the effect is bounded to specific label types.

The paper shows that discourse-role labels (e.g., Instruction:, Example:) in context-augmented language models cause 56-84 percentage point shifts in misleading adoption rates across models, with labels like Example: suppressing adoption. This demonstrates that presentation choices significantly affect measured context reliance.

Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context.

View on arXiv PDF

Similar