Scaling Language Model Size in Cross-Device Federated Learning
This work addresses the challenge of scaling language models in federated learning for distributed device applications, representing an incremental improvement over existing methods.
The authors tackled the problem of training large language models in cross-device federated learning by applying techniques like partial model training and quantization to mitigate communication and computation bottlenecks. They achieved a 21M parameter Transformer and 20.2M parameter Conformer with ~10x smaller client-to-server communication cost and 11% lower perplexity than smaller LSTMs.
Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and communication-efficient optimizers, we are able to train a $21$M parameter Transformer and $20.2$M parameter Conformer that achieve the same or better perplexity as that of a similarly sized LSTM with $\sim10\times$ smaller client-to-server communication cost and $11\%$ lower perplexity than smaller LSTMs commonly studied in literature.