LGJan 21

FedUMM: A General Framework for Federated Learning with Unified Multimodal Models

Zhaolong Su, Leheng Zhao, Xiaoying Wu, Ziyue Xu, Jindong Wang

arXiv:2601.15390v11.4

Originality Incremental advance

AI Analysis

This work addresses privacy and deployment issues for unified multimodal models in distributed scenarios, though it is incremental as it builds on existing federated learning and adapter techniques.

The paper tackles the challenge of training unified multimodal models in privacy-sensitive distributed settings by proposing FedUMM, a federated learning framework that uses parameter-efficient fine-tuning with LoRA adapters, resulting in competitive performance with centralized training while reducing communication costs by over an order of magnitude.

Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datasets are gathered in a central server, limiting the deployment in privacy-sensitive and geographically distributed scenarios. In this paper, we present FedUMM, a general federated learning framework for UMMs under non-IID multimodal data with low communication cost. Built on NVIDIA FLARE, FedUMM instantiates federation for a BLIP3o backbone via parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation models, and the server aggregates only adapter updates. We evaluate on VQA v2 and the GenEval compositional generation benchmarks under Dirichlet-controlled heterogeneity with up to 16 clients. Results show slight degradation as client count and heterogeneity increase, while remaining competitive with centralized training. We further analyze computation--communication trade-offs and demonstrate that adapter-only federation reduces per-round communication by over an order of magnitude compared to full fine-tuning, enabling practical federated UMM training. This work provides empirical experience for future research on privacy-preserving federated unified multimodal models.

View on arXiv PDF

Similar