DC LGSep 3, 2025

Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training

Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, Yang Bai, Shuguang Wang

arXiv:2509.03018v18.07 citationsh-index: 7SOSP

Originality Incremental advance

AI Analysis

This addresses reliability problems for large-scale LLM training systems, such as at ByteDance, by providing tools for debugging collective communication issues, representing an incremental improvement over existing black-box libraries.

The paper tackles the problem of hidden reliability issues in collective communication during LLM training by proposing Mycroft, a lightweight distributed tracing and root cause analysis system, which detected anomalies within 15 seconds in 90% of cases and identified root causes within 20 seconds in 60% of cases in deployment.

Reliability is essential for ensuring efficiency in LLM training. However, many real-world reliability issues remain difficult to resolve, resulting in wasted resources and degraded model performance. Unfortunately, today's collective communication libraries operate as black boxes, hiding critical information needed for effective root cause analysis. We propose Mycroft, a lightweight distributed tracing and root cause analysis system designed to address previously hidden reliability issues in collective communication. Mycroft's key idea is to trace collective communication states and leverage internal control and data dependencies to resolve reliability problems in LLM training. Mycroft has been deployed at ByteDance for over six months to debug collective communication related issues at runtime. It detected anomalies within 15 seconds in 90% of cases and identified the root cause within 20 seconds in 60% of cases. We also conducted extensive fault injection experiments to demonstrate Mycroft's capability and efficiency.

View on arXiv PDF

Similar