LGMar 27, 2025
MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution AwarenessZihao Zheng, Xiuping Cui, Size Zheng et al.
With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs, which are suboptimal for MoE quantization, due to MoEs' complex data-model distribution. To address this problem, we decouple the complexity of MoEs' data-model distribution into a multi-stage analysis and reveal MoEs' inherent dynamics. The analysis results show that the expert performance of MoE varies dynamically both within and across data distributions. Based on these, we design two quantization strategies with data-model distribution awareness and integrate them into an end-to-end framework for MoE quantization, which is named MoQa. MoQa uses an expert-level mix-precision base quantization with distribution awareness. Moreover, MoQa uses a channel-level quantization adjustment to dynamically adjust expert performance to adapt to novel distributions. Experiments show that MoQa's base quantization achieves a 0.49~8.51 PPL decrease on known distributions. With the adjustments, MoQa achieves a 2.74~6.44 PPL decrease and 1.85%~3.77% average accuracy improvements on novel distributions. We believe MoQa will play a role in future MoE construction, optimization, and compression.
CVMay 27, 2025
EaqVLA: Encoding-aligned Quantization for Vision-Language-Action ModelsFeng Jiang, Zihao Zheng, Xiuping Cui et al.
With the development of Embodied Artificial intelligence, the end-to-end control policy such as Vision-Language-Action (VLA) model has become the mainstream. Existing VLA models faces expensive computing/storage cost, which need to be optimized. Quantization is considered as the most effective method which can not only reduce the memory cost but also achieve computation acceleration. However, we find the token alignment of VLA models hinders the application of existing quantization methods. To address this, we proposed an optimized framework called EaqVLA, which apply encoding-aligned quantization to VLA models. Specifically, we propose an complete analysis method to find the misalignment in various granularity. Based on the analysis results, we propose a mixed precision quantization with the awareness of encoding alignment. Experiments shows that the porposed EaqVLA achieves better quantization performance (with the minimal quantization loss for end-to-end action control and xxx times acceleration) than existing quantization methods.
LGMay 17, 2025
FedHQ: Hybrid Runtime Quantization for Federated LearningZihao Zheng, Ziyao Wang, Xiuping Cui et al.
Federated Learning (FL) is a decentralized model training approach that preserves data privacy but struggles with low efficiency. Quantization, a powerful training optimization technique, has been widely explored for integration into FL. However, many studies fail to consider the distinct performance attribution between particular quantization strategies, such as post-training quantization (PTQ) or quantization-aware training (QAT). As a result, existing FL quantization methods rely solely on either PTQ or QAT, optimizing for speed or accuracy while compromising the other. To efficiently accelerate FL and maintain distributed convergence accuracy across various FL settings, this paper proposes a hybrid quantitation approach combining PTQ and QAT for FL systems. We conduct case studies to validate the effectiveness of using hybrid quantization in FL. To solve the difficulty of modeling speed and accuracy caused by device and data heterogeneity, we propose a hardware-related analysis and data-distribution-related analysis to help identify the trade-off boundaries for strategy selection. Based on these, we proposed a novel framework named FedHQ to automatically adopt optimal hybrid strategy allocation for FL systems. Specifically, FedHQ develops a coarse-grained global initialization and fine-grained ML-based adjustment to ensure efficiency and robustness. Experiments show that FedHQ achieves up to 2.47x times training acceleration and up to 11.15% accuracy improvement and negligible extra overhead.