DC NIMay 7

CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure

Eric Ding, Byungsoo Oh, Bhaskar Kataria, Kaiwen Guo, Jelena Gvero, Abhishek Vijaya Kumar, Arjun Devraj, Lindsey Bowen, Atharv Sonwane, Emaad Manzoor, Rachee Singh

arXiv:2605.0654476.7

AI Analysis

For researchers and engineers evaluating LLM infrastructure, CCL-Bench provides reusable evidence and fine-grained metrics that expose insights summary-statistic benchmarks cannot, addressing the need for explainable performance comparisons.

CCL-Bench is a trace-based benchmark for LLM infrastructure that records execution traces, workload cards, and launch scripts to compute fine-grained efficiency metrics. It reveals that higher compute-communication overlap can coincide with longer training step time, doubling TPU interconnect bandwidth yields higher improvement than GPU on small/medium workloads, and the best-tuned configuration on one framework can be up to 3× slower than on a peer framework on identical hardware.

Evaluative claims about LLM infrastructure -- ``workload X is fastest on hardware Y with software Z'' -- depend on a complex configuration space spanning hardware accelerators, interconnect bandwidth, software frameworks, parallelism plans, and communication libraries. Current infrastructure evaluation benchmarks publish a small set of end-to-end numbers that do not explain why one configuration outperforms another. We present CCL-Bench, a trace-based benchmark that addresses the limitations of existing benchmarks by recording reusable evidence for every ML workload. Each contributed data point in CCL-Bench packages an execution trace, a YAML workload card, and the launch scripts. We have developed a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from this evidence. Using CCL-Bench, we surface three claims that summary-statistic benchmarks cannot support: (i) higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, (ii) doubling TPU interconnect bandwidth yields a much higher end-to-end improvement in step time than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can run up to 3$\times$ slower than the best-tuned configuration on a peer framework on identical hardware.

View on arXiv PDF

Similar