CLMar 12

Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale

arXiv:2603.11513v111.5

Predicted impact top 95% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This reveals a fundamental utilization bottleneck for small models in RAG systems, showing deployment at this scale can be counterproductive under standard conditions.

The study investigated whether small language models (7B parameters or less) can effectively use retrieved information in retrieval-augmented generation (RAG), finding that even with perfect retrieval, these models fail to extract correct answers 85-100% of the time on questions they cannot answer alone, and retrieval context destroys 42-100% of answers they previously knew.

Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.

View on arXiv PDF

Similar