DCAIApr 23

Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning

arXiv:2604.2207225.6h-index: 6
AI Analysis

For federated learning practitioners using serverless infrastructure, this work removes the hard scalability ceiling imposed by per-function memory limits.

GradsSharding partitions the gradient tensor into M shards for independent averaging, enabling federated aggregation of arbitrarily large models on serverless platforms. It achieves a 2.7x cost reduction at VGG-16 scale and remains deployable beyond the 10 GB memory ceiling.

Federated learning (FL) aggregation on serverless platforms faces a hard scalability ceiling: existing architectures (lambda-FL, LIFL) partition clients across aggregators, but every aggregator must hold the complete model gradient in memory. When gradients exceed the per-function memory limit (e.g., 10 GB on AWS Lambda), aggregation becomes infeasible regardless of tree depth or branching factor. We propose GradsSharding, which instead partitions the gradient tensor into M shards, each averaged independently by a serverless function that receives contributions from all clients. Because FedAvg averaging is element-wise, this produces bit-identical results to tree-based approaches, so model accuracy is invariant by construction. Per-function memory is bounded at O(|θ|/M), independent of client count, enabling aggregation of arbitrarily large models. We evaluate GradsSharding against lambda-FL and LIFL through HPC experiments and real AWS Lambda deployments across model sizes from 43 MB to 5 GB. Results show a cost crossover at approximately 500 MB gradient size, 2.7x cost reduction at VGG-16 scale, and that GradsSharding is the only architecture that remains deployable beyond the serverless memory ceiling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes