Context-Aware Search and Retrieval Under Token Erasure
For developers of retrieval-augmented generation systems, this work provides an information-theoretic analysis and practical principles to improve retrieval reliability under token loss.
The paper analyzes retrieval reliability in RAG-like systems under token erasures, showing that assigning higher redundancy to semantically important query features improves retrieval reliability. Numerical results and data-driven evaluation on real-world data support the analysis.
This paper introduces and analyzes a search and retrieval model for RAG-like systems under {token} erasures. We provide an information-theoretic analysis of remote document retrieval when query representations are only partially preserved. The query is represented using term-frequency-based features, and semantically adaptive redundancy is assigned according to feature importance. Retrieval is performed using TF-IDF-weighted similarity. We characterize the retrieval error probability by showing that the vector of similarity margins converges to a multivariate Gaussian distribution, yielding an explicit approximation and computable upper bounds. Numerical results support the analysis, while a separate data-driven evaluation using embedding-based retrieval on real-world data shows that the same importance-aware redundancy principles extend to modern retrieval pipelines. Overall, the results show that assigning higher redundancy to semantically important query features improves retrieval reliability.