LG CL DCMar 3, 2025

Alchemist: Towards the Design of Efficient Online Continual Learning System

Yuyang Huang, Yuhan Liu, Haryadi S. Gunawi, Beibin Li, Changho Hwang

arXiv:2503.01066v24.11 citationsh-index: 31

Originality Incremental advance

AI Analysis

This addresses a performance bottleneck for researchers and practitioners deploying online continual learning systems, though it is incremental as it optimizes an existing process rather than introducing a new paradigm.

The paper tackles the inefficiency of redundant computations in online continual learning for large language models, where existing systems recompute intermediate results during training, accounting for 30%-42% of training time. The result is Alchemist, a system that reuses serving activations to increase training throughput by up to 1.72x, reduce memory usage by up to 47%, and support up to 2x more training tokens with negligible impact on serving latency.

Continual learning has become a promising solution to refine large language models incrementally by leveraging user feedback. In particular, online continual learning - iteratively training the model with small batches of user feedback - has demonstrated notable performance improvements. However, the existing practice of separating training and serving processes forces the online trainer to recompute the intermediate results already done during serving. Such redundant computations can account for 30%-42% of total training time. In this paper, we propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput. Alchemist introduces two key techniques: (1) recording and storing activations and KV cache only during the prefill phase to minimize latency and memory overhead; and (2) smart activation offloading and hedging. Evaluations with inputs of varied token length sampled from ShareGPT dataset show that compared with a separate training cluster, Alchemist significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens - all while maintaining negligible impact on serving latency.

View on arXiv PDF

Similar