DCLGFeb 10, 2025

Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach

arXiv:2502.06355v31 citationsh-index: 37
Originality Incremental advance
AI Analysis

This addresses the challenge of resource constraints for edge computing in multimodal AI, though it is incremental as it builds on split learning.

The paper tackles the problem of deploying multimodal transformers on edge devices by proposing MPSL, a parallel split learning approach that reduces client-side computations by 250x and matches or outperforms federated learning across 7 datasets.

Multimodal transformers integrate diverse data types like images, audio, and text, advancing tasks such as audio-visual understanding and image-text retrieval; yet their high parameterization limits deployment on resource-constrained edge devices. Split Learning (SL), which partitions models at a designated cut-layer to offload compute-intensive operations to the server, offers a promising approach for distributed training of multimodal transformers, though its application remains underexplored. We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner, while eliminating label sharing, client synchronization, and per-client sub-model management. MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs. Our evaluation across 7 multimodal datasets demonstrates that MPSL matches or outperforms Federated Learning, reduces client-side computations by 250x, and achieves superior scalability in communication cost with model growth. Through extensive analysis, we highlight task suitability, trade-offs, and scenarios where MPSL excels, inspiring further exploration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes