DBLGJul 12, 2025

HedraRAG: Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving

arXiv:2507.09138v14 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses efficiency problems in RAG serving for AI systems, representing an incremental improvement with novel optimizations.

The paper tackles system-level challenges in heterogeneous retrieval-augmented generation (RAG) serving by introducing HedraRAG, a runtime system that optimizes execution through graph-based transformations, achieving speedups of 1.5x to 5x over existing frameworks.

This paper addresses emerging system-level challenges in heterogeneous retrieval-augmented generation (RAG) serving, where complex multi-stage workflows and diverse request patterns complicate efficient execution. We present HedraRAG, a runtime system built on a graph-based abstraction that exposes optimization opportunities across stage-level parallelism, intra-request similarity, and inter-request skewness. These opportunities are realized through dynamic graph transformations, such as node splitting, reordering, edge addition, and dependency rewiring, applied to wavefronts of subgraphs spanning concurrent requests. The resulting execution plans are mapped onto hybrid CPU-GPU pipelines to improve resource utilization and reduce latency. Evaluations across a wide range of RAG workflows demonstrate speedups exceeding 1.5x and reaching up to 5x over existing frameworks, showcasing the effectiveness of coordinated generation and retrieval in serving environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes