ChipLight: Cross-Layer Optimization of Chiplet Design with Optical Interconnects for LLM Training
This work addresses the communication bottleneck in large-scale LLM training by co-optimizing chiplet architecture, training parallel strategy, and optical interconnect topology.
ChipLight presents a cross-layer optimization method for chiplet design with optical interconnects, achieving significantly improved training efficiency for large-scale LLM training clusters.
In large-scale distributed LLM training, communication between devices becomes the key performance bottleneck. Chiplet technology can integrate multiple dies into a package to scale-up node performance with higher bandwidth. Meanwhile, optical interconnect (OI) technology offers long-reach, high-bandwidth links, making it well suited for scale-out networks. The combination of these two technologies has the potential to overcome communication bottlenecks within and across packages. In this work, we present ChipLight, a cross-layer multi-objective design and optimization method for training clusters leveraging chiplet and OI. We first abstract an architecture model for such complex clusters, co-optimizing chiplet architecture, training parallel strategy, and OI network topology. Based on such models, we tailor the design space exploration flow by combining both black-box and white-box methodologies. Evaluated by our experimental results, ChipLight achieves significantly improved training efficiency and provides valuable design insights for the development of future training clusters.