Distributed Cross-Channel Hierarchical Aggregation for Foundation Models
This addresses efficiency bottlenecks for researchers and practitioners using large-scale vision transformers in domains like hyperspectral imaging and weather forecasting, representing an incremental improvement over existing distributed methods.
The paper tackles the compute-intensive challenge of tokenizing and aggregating images in vision-based scientific foundation models by introducing the Distributed Cross-Channel Hierarchical Aggregation (D-CHAG) approach, which achieved up to a 75% reduction in memory usage and more than doubled sustained throughput on up to 1,024 GPUs.
Vision-based scientific foundation models hold significant promise for advancing scientific discovery and innovation. This potential stems from their ability to aggregate images from diverse sources such as varying physical groundings or data acquisition systems and to learn spatio-temporal correlations using transformer architectures. However, tokenizing and aggregating images can be compute-intensive, a challenge not fully addressed by current distributed methods. In this work, we introduce the Distributed Cross-Channel Hierarchical Aggregation (D-CHAG) approach designed for datasets with a large number of channels across image modalities. Our method is compatible with any model-parallel strategy and any type of vision transformer architecture, significantly improving computational efficiency. We evaluated D-CHAG on hyperspectral imaging and weather forecasting tasks. When integrated with tensor parallelism and model sharding, our approach achieved up to a 75% reduction in memory usage and more than doubled sustained throughput on up to 1,024 AMD GPUs on the Frontier Supercomputer.