DS SERVE: A Framework for Efficient and Scalable Neural Retrieval
This framework addresses the need for scalable retrieval in applications like RAG and training data attribution, though it appears incremental as it builds on existing neural retrieval concepts.
The paper tackles the problem of building efficient and scalable neural retrieval systems by introducing DS-Serve, a framework that processes half a trillion tokens into a high-performance system with low latency and modest memory overhead on a single node.
We present DS-Serve, a framework that transforms large-scale text datasets, comprising half a trillion tokens, into a high-performance neural retrieval system. DS-Serve offers both a web interface and API endpoints, achieving low latency with modest memory overhead on a single node. The framework also supports inference-time trade-offs between latency, accuracy, and result diversity. We anticipate that DS-Serve will be broadly useful for a range of applications, including large-scale retrieval-augmented generation (RAG), training data attribution, training search agents, and beyond.