Hybrid-RACA: Hybrid Retrieval-Augmented Composition Assistance for Real-time Text Prediction
This addresses the problem of computational demands for real-time text prediction, offering a hybrid solution for composition assistance, though it appears incremental as it builds on existing retrieval-augmented methods.
The authors tackled the challenge of applying retrieval-augmented LLMs to real-time tasks like composition assistance by proposing Hybrid-RACA, a system that combines a cloud-based LLM with a client-side model via retrieval-augmented memory, achieving strong performance with low latency across five datasets.
Large language models (LLMs) enhanced with retrieval augmentation has shown great performance in many applications. However, the computational demands for these models pose a challenge when applying them to real-time tasks, such as composition assistance. To address this, we propose Hybrid Retrieval-Augmented Composition Assistance (Hybrid-RACA), a novel system for real-time text prediction that efficiently combines a cloud-based LLM with a smaller client-side model through retrieval augmented memory. This integration enables the client model to generate better responses, benefiting from the LLM's capabilities and cloud-based data. Meanwhile, via a novel asynchronous memory update mechanism, the client model can deliver real-time completions to user inputs without the need to wait for responses from the cloud. Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.