Zhexiang Zhang

2papers

2 Papers

96.6DCApr 28
Janus: Disaggregating Attention and Experts for Scalable MoE Inference

Zhexiang Zhang, Ye Wang, Yumiao Zhao et al.

Serving large Mixture-of-Experts (MoE) models is challenging because of their large memory footprints, heterogeneous resource demands, and highly dynamic inference workloads. Most existing MoE inference systems deploy the entire model as a monolithic unit, forcing attention and MoE layers to share the same resource configuration despite their different scaling behaviors and resource bottlenecks. Such coarse-grained provisioning leads to resource inefficiency and suboptimal performance. We present JANUS, a scalable and resource-efficient MoE inference system built around three key principles. First, JANUS disaggregates attention and MoE layers onto separate GPU worker pools, enabling independent resource provisioning for the two layer types, and uses an adaptive two-phase communication mechanism for low-latency data exchange. Second, because MoE-layer execution is often memory-bound and highly sensitive to activated-expert imbalance, JANUS introduces a lightweight, microsecond-scale activation scheduler that balances per-layer activated experts across MoE instances to reduce inference latency. Third, JANUS employs a fine-grained, SLO-aware resource scaling scheme that jointly selects attention resources, MoE resources, and expert placement to minimize GPU cost under token-level SLOs. Evaluation shows that JANUS improves per-GPU throughput by up to 4.7x over state-of-the-art MoE inference baselines while satisfying token-level latency SLOs.

50.3CLApr 2
DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

Liang Zhu, Feiteng Fang, Yuelin Bai et al.

Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.