LGAICLDCSep 28, 2025

MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

arXiv:2510.03283v12 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the challenge of deploying LLMs with continual learning on edge platforms for latency-sensitive applications like personalized assistants, representing an incremental improvement in scheduling methods.

The paper tackles the problem of balancing inference latency and model accuracy for large language models on edge servers by proposing MACE, a hybrid system that colocates inference and fine-tuning with iteration-level scheduling, resulting in up to 63% reduction in inference latency while maintaining throughput and high GPU utilization.

Large language models (LLMs) deployed on edge servers are increasingly used in latency-sensitive applications such as personalized assistants, recommendation, and content moderation. However, the non-stationary nature of user data necessitates frequent retraining, which introduces a fundamental tension between inference latency and model accuracy under constrained GPU resources. Existing retraining strategies either delay model updates, over-commit resources to retraining, or overlook iteration-level retraining granularity. In this paper, we identify that iteration-level scheduling is crucial for adapting retraining frequency to model drift without violating service-level objectives (SLOs). We propose MACE, a hybrid LLM system that colocates concurrent inference (prefill, decode) and fine-tuning, with intelligent memory management to maximize task performance while promising inference throughput. MACE leverages the insight that not all model updates equally affect output alignment and allocates GPU cycles accordingly to balance throughput, latency, and update freshness. Our trace-driven evaluation shows that MACE matches or exceeds continuous retraining while reducing inference latency by up to 63% and maintaining throughput under resource constraints. Compared to periodic retraining, MACE improves latency breakdown across prefill, decode, and finetune stages, and sustains GPU utilization above 85% in NVIDIA AGX Orin. These results demonstrate that iteration-level hybrid scheduling is a promising direction for deploying LLMs with continual learning capabilities on edge platforms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes