Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data
This work addresses efficiency and accuracy challenges in semantic caching for practical LLM applications, representing an incremental advancement.
The paper tackles the problem of improving semantic caching for LLMs by using domain-specific embeddings and synthetic data, resulting in compact models that significantly outperform state-of-the-art alternatives in precision and recall after fine-tuning for just one epoch.
This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency. We propose leveraging smaller, domain-specific embedding models, fine-tuned with targeted real-world and synthetically generated datasets. Our empirical evaluations demonstrate that compact embedding models fine-tuned for just one epoch on specialized datasets significantly surpass both state-of-the-art open-source and proprietary alternatives in precision and recall. Moreover, we introduce a novel synthetic data generation pipeline for the semantic cache that mitigates the challenge of limited domain-specific annotated data, further boosting embedding performance. Our approach effectively balances computational overhead and accuracy, establishing a viable and efficient strategy for practical semantic caching implementations.