CVDec 14, 2023

CLIP-guided Federated Learning on Heterogeneous and Long-Tailed Data

arXiv:2312.08648v115 citationsh-index: 40
Originality Incremental advance
AI Analysis

This work addresses the problem of data heterogeneity and class imbalance in federated learning for decentralized machine learning systems, offering an incremental improvement by integrating pre-trained CLIP models.

The paper tackles the challenge of federated learning on heterogeneous and long-tailed data by proposing CLIP2FL, a method that leverages CLIP's vision-language supervision to improve client-side feature representation and server-side classifier training, resulting in enhanced performance on benchmark datasets like CIFAR-10-LT and CIFAR-100-LT with reported accuracy gains of up to 5.2%.

Federated learning (FL) provides a decentralized machine learning paradigm where a server collaborates with a group of clients to learn a global model without accessing the clients' data. User heterogeneity is a significant challenge for FL, which together with the class-distribution imbalance further enhances the difficulty of FL. Great progress has been made in large vision-language models, such as Contrastive Language-Image Pre-training (CLIP), which paves a new way for image classification and object recognition. Inspired by the success of CLIP on few-shot and zero-shot learning, we use CLIP to optimize the federated learning between server and client models under its vision-language supervision. It is promising to mitigate the user heterogeneity and class-distribution balance due to the powerful cross-modality representation and rich open-vocabulary prior knowledge. In this paper, we propose the CLIP-guided FL (CLIP2FL) method on heterogeneous and long-tailed data. In CLIP2FL, the knowledge of the off-the-shelf CLIP model is transferred to the client-server models, and a bridge is built between the client and server. Specifically, for client-side learning, knowledge distillation is conducted between client models and CLIP to improve the ability of client-side feature representation. For server-side learning, in order to mitigate the heterogeneity and class-distribution imbalance, we generate federated features to retrain the server model. A prototype contrastive learning with the supervision of the text encoder of CLIP is introduced to generate federated features depending on the client-side gradients, and they are used to retrain a balanced server classifier.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes