CLMay 25, 2023

Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models

arXiv:2305.16243v3224 citations
Originality Incremental advance
AI Analysis

This work addresses the efficiency and performance of retrieval-augmented models for natural language processing, though it is incremental as it modifies an existing method.

The paper tackled the problem of improving retrieval-augmented language models by showing that surface-level retrieval methods, such as BM25, outperform semantic retrieval in reducing perplexity, achieving significant reductions in perplexity scores.

Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes