Stratos: An End-to-End Distillation Pipeline for Customized LLMs under Distributed Cloud Environments
This addresses the industrial need for cost-efficient, customized LLMs for vertical domains, though it appears incremental as an automation pipeline for existing distillation techniques.
The authors tackled the problem of automating knowledge distillation for customized large language models under user-defined constraints in distributed cloud environments, resulting in a student model that achieved four times the accuracy of its GPT-4o teacher baseline on a domain-specific Mahjong reasoning task while reducing latency and cost.
The growing industrial demand for customized and cost-efficient large language models (LLMs) is fueled by the rise of vertical, domain-specific tasks and the need to optimize performance under constraints such as latency and budget. Knowledge distillation, as an efficient model compression and transfer technique, offers a feasible solution. However, existing distillation frameworks often require manual intervention and struggle to meet such complex user-defined distillation requirements. To bridge this gap, we propose Stratos, an end-to-end LLM distillation pipeline that automates server and model selection, knowledge distillation, and deployment in distributed cloud environments. Given user-defined constraints on model performance and system budget, Stratos automatically selects Pareto-optimal servers, dynamically matches teacher-student pairs, and adapts distillation strategies based on task complexity to optimize cloud hosting. Experiments show that Stratos produces a student model that achieves four times the accuracy of its GPT-4o teacher baseline on a rare, domain-specific Mahjong reasoning task with reverse synthetic data and knowledge injection. Moreover, it achieves reduced latency and cost without compromising accuracy. These results highlight its promise for vertical-domain LLM deployment.