LGAIIRAug 24, 2025

Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs

arXiv:2508.17400v12 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This provides insights for developing LLM-based retrievers, though it is incremental as it focuses on scaling analysis rather than introducing new methods.

The study investigated how retrieval performance scales with pretraining FLOPs across LLM sizes from 125 million to 7 billion parameters, finding that it predictably scales with model size, training duration, and FLOPs, and that In-Context Learning scores correlate strongly with retrieval scores.

How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes