LG AI CLJun 27, 2024

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos

arXiv:2406.19292v218.220 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a key limitation in LLMs for applications requiring long-context processing, offering an incremental but practical enhancement over existing methods.

The paper tackles the problem of LLMs struggling with information retrieval and reasoning in long-context inputs by finetuning them on a synthetic dataset of numerical key-value tasks, resulting in improvements such as a 10.5% gain on a specific retrieval task for GPT-3.5 Turbo and no performance drop on general benchmarks like TriviaQA for Mistral 7B.

Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.

View on arXiv PDF Code

Similar