TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL
This work addresses latency issues in Text-to-SQL systems for users, but it is incremental as it optimizes existing KV cache methods.
The paper tackles the problem of high prefilling latency in LLM-based Text-to-SQL systems caused by redundant KV cache generation for database schemas, achieving up to a 3.62x speedup in Time to First Token with minimal performance loss.
In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an opportunity for KV cache sharing across queries-current inference engines, such as SGLang and vLLM, generate redundant prefix cache copies when processing user queries with varying table orders. To address this inefficiency, we propose precomputing table representations as KV caches offline and querying the required ones online. A key aspect of our approach is the computation of table caches while preserving primary foreign key relationships between tables. Additionally, we construct a Table Trie structure to facilitate efficient KV cache lookups during inference. To enhance cache performance, we introduce a cache management system with a query reranking strategy to improve cache hit rates and a computation loading pipeline for parallelizing model inference and cache loading. Experimental results show that our proposed TableCache achieves up to a 3.62x speedup in Time to First Token (TTFT) with negligible performance degradation.