LGDec 30, 2024

EdgeRAG: Online-Indexed RAG for Edge Devices

arXiv:2412.21023v217 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses memory and latency constraints for RAG deployment on edge devices, representing an incremental improvement with domain-specific impact.

The paper tackles the challenge of deploying Retrieval Augmented Generation on edge devices with limited memory by proposing EdgeRAG, which prunes embeddings, generates them on-demand, and uses adaptive caching, resulting in significant latency reduction over baseline methods while maintaining similar generation quality and fitting datasets into memory.

Deploying Retrieval Augmented Generation (RAG) on resource-constrained edge devices is challenging due to limited memory and processing power. In this work, we propose EdgeRAG which addresses the memory constraint by pruning embeddings within clusters and generating embeddings on-demand during retrieval. To avoid the latency of generating embeddings for large tail clusters, EdgeRAG pre-computes and stores embeddings for these clusters, while adaptively caching remaining embeddings to minimize redundant computations and further optimize latency. The result from BEIR suite shows that EdgeRAG offers significant latency reduction over the baseline IVF index, but with similar generation quality while allowing all of our evaluated datasets to fit into the memory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes