Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training
For researchers and engineers designing geo-distributed AI training systems, this work provides quantitative guidance on the impact of fiber latency and the benefits of hollow-core fiber.
This paper uses discrete-event simulation to quantify how fiber latency affects compute-communication overlap in geo-distributed AI training with data parallelism, finding that optimal inter-cluster distances are 10-100 km, where hollow-core fiber achieves 25% higher overlap.
We use discrete-event simulation to quantify the impact of fiber latency on the efficacy of geo-distributed AI model training with data parallelism. We conclude that the optimum distances between two AI clusters is 10-100km, over which hollow-core fiber enables 25% higher compute-communication overlap.