LGARIRJan 15

FaTRQ: Tiered Residual Quantization for LLM Vector Search in Far-Memory-Aware ANNS Systems

arXiv:2601.09985v1h-index: 4
Originality Incremental advance
AI Analysis

This addresses a critical performance issue in retrieval-augmented generation for large language models, offering a domain-specific solution that is incremental but with strong gains.

The paper tackles the latency bottleneck in ANNS systems caused by fetching full-precision vectors from slow storage during refinement, proposing FaTRQ, which uses tiered residual quantization and a progressive distance estimator to eliminate this need, resulting in a 2.4x improvement in storage efficiency and up to 9x higher throughput compared to state-of-the-art GPU ANNS systems.

Approximate Nearest-Neighbor Search (ANNS) is a key technique in retrieval-augmented generation (RAG), enabling rapid identification of the most relevant high-dimensional embeddings from massive vector databases. Modern ANNS engines accelerate this process using prebuilt indexes and store compressed vector-quantized representations in fast memory. However, they still rely on a costly second-pass refinement stage that reads full-precision vectors from slower storage like SSDs. For modern text and multimodal embeddings, these reads now dominate the latency of the entire query. We propose FaTRQ, a far-memory-aware refinement system using tiered memory that eliminates the need to fetch full vectors from storage. It introduces a progressive distance estimator that refines coarse scores using compact residuals streamed from far memory. Refinement stops early once a candidate is provably outside the top-k. To support this, we propose tiered residual quantization, which encodes residuals as ternary values stored efficiently in far memory. A custom accelerator is deployed in a CXL Type-2 device to perform low-latency refinement locally. Together, FaTRQ improves the storage efficiency by 2.4$\times$ and improves the throughput by up to 9$ \times$ than SOTA GPU ANNS system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes