LGSep 1, 2025

Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks

Xinlu Zhang, Na Yan, Yang Su, Yansha Deng, Toktam Mahmoodi

arXiv:2509.01750v27.11 citationsh-index: 32

Originality Incremental advance

AI Analysis

This work addresses communication bottlenecks for bandwidth-limited clients in privacy-preserving federated learning of LLMs, representing an incremental improvement over existing methods.

The paper tackles the high communication overhead in federated distillation for large language models (LLMs) over wireless networks by proposing an adaptive Top-k logit selection and aggregation scheme, achieving a 50% reduction in communication overhead while maintaining superior performance.

Federated learning (FL) for large language models (LLMs) offers a privacy-preserving scheme, enabling clients to collaboratively fine-tune locally deployed LLMs or smaller language models (SLMs) without exchanging raw data. While parameter-sharing methods in traditional FL models solves number of technical challenges, they still incur high communication overhead and struggle with adapting to heterogeneous model architectures. Federated distillation, a framework for mutual knowledge transfer via shared logits, typically offers lower communication overhead than parameter-sharing methods. However, transmitting logits from LLMs remains challenging for bandwidth-limited clients due to their high dimensionality. In this work, we focus on a federated LLM distillation with efficient communication overhead. To achieve this, we first propose an adaptive Top-k logit selection mechanism, dynamically sparsifying logits according to real-time communication conditions. Then to tackle the dimensional inconsistency introduced by the adaptive sparsification, we design an adaptive logits aggregation scheme, effectively alleviating the artificial and uninformative inputs introduced by conventional zero-padding methods. Finally, to enhance the distillation effect, we incorporate LoRA-adapted hidden-layer projection from LLM into the distillation loss, reducing the communication overhead further while providing richer representation. Experimental results demonstrate that our scheme achieves superior performance compared to baseline methods while effectively reducing communication overhead by approximately 50%.

View on arXiv PDF

Similar