Model Parallelism With Subnetwork Data Parallelism
This addresses memory constraints for researchers and engineers training large neural networks, offering an incremental improvement over existing distributed training methods.
The paper tackles the memory demands of pre-training large neural networks by introducing Subnetwork Data Parallelism (SDP), a distributed training framework that partitions models into subnetworks trained across workers without exchanging activations. Experiments on CNNs, transformers, and LLM pre-training show SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance.
Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structured subnetworks trained across workers without exchanging activations. We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains while providing additional regularization. We further explore two subnetwork construction strategies: neuron level and block level, applied across both CNNs and transformers. In experiments spanning CNNs and transformers on CIFAR and ImageNet, as well as LLM pre-training on FineWeb, SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance. Notably, in FLOP-matched settings, forward masking can sometimes achieve better performance.