Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction
This enables electronic structure prediction for materials with extended defects or disordered phases, which is incremental as it builds on existing eGNN methods by addressing scalability bottlenecks.
The paper tackles the memory limitations of training equivariant graph neural networks for large-scale electronic structure prediction by developing a distributed implementation with GPU communication and graph partitioning, achieving strong scaling up to 128 GPUs and weak scaling up to 512 GPUs with 87% parallel efficiency for structures with up to 190,000 atoms.
Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales, enabling investigation of the electronic properties of materials with extended defects, interfaces, or exhibiting disordered phases. However, as interactions between atomic orbitals typically extend over 10+ angstroms, the graph representations required for this task tend to be densely connected, and the memory requirements to perform training and inference on these large structures can exceed the limits of modern GPUs. Here we present a distributed eGNN implementation which leverages direct GPU communication and introduce a partitioning strategy of the input graph to reduce the number of embedding exchanges between GPUs. Our implementation shows strong scaling up to 128 GPUs, and weak scaling up to 512 GPUs with 87% parallel efficiency for structures with 3,000 to 190,000 atoms on the Alps supercomputer.