CL AINov 17, 2025

What Works for 'Lost-in-the-Middle' in LLMs? A Study on GM-Extract and Mitigations

Mihir Gupte, Eshan Dixit, Muhammad Tayyab, Arun Adiththan

arXiv:2511.13900v11 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses a key challenge in retrieval-based LLM applications for AI researchers and practitioners, though it is incremental as it builds on existing work on context utilization.

The study tackled the 'lost-in-the-middle' phenomenon in LLMs by introducing GM-Extract, a benchmark for evaluating retrieval of control variables, and found that performance varied significantly based on data representation in the context window, with mitigation methods showing nuanced effects including negative impacts in some cases.

The diminishing ability of large language models (LLMs) to effectively utilize long-range context-the "lost-in-the-middle" phenomenon-poses a significant challenge in retrieval-based LLM applications. To study the impact of this phenomenon in a real-world application setting, we introduce GM-Extract, a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables. To accurately diagnose failure modes, we propose a simple yet elegant evaluation system using two distinct metrics: one for spatial retrieval capability (Document Metric) and the other for semantic retrieval capability (Variable Extraction Metric). We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering), demonstrating a significant change in retrieval performance simply by altering how the data is represented in the context window. While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models, which we further correlate with perplexity scores. Furthermore, we perform a literature survey of mitigation methods, which we categorize into two distinct approaches: black-box and white-box methods. We then apply these techniques to our benchmark, finding that their efficacy is highly nuanced. Our evaluation highlights scenarios where these strategies successfully improve performance, as well as surprising cases where they lead to a negative impact, providing a comprehensive understanding of their utility in a practical context.

View on arXiv PDF

Similar