PLDCLGSep 5, 2025

veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD

arXiv:2509.07003v11 citationsh-index: 7
Originality Highly original
AI Analysis

This work addresses the problem of simplifying and improving distributed tensor programming for researchers and engineers training large-scale models, representing a novel method for known bottlenecks rather than a foundational shift.

The paper tackles the challenges of ensuring consistency and achieving high performance in eager-mode Single Program Multiple Data (SPMD) programming for distributed training of large language models, resulting in up to 2.2x speedup over state-of-the-art systems and a 78.4% reduction in code complexity while preserving single-device-equivalent results.

Large Language Models (LLMs) have scaled rapidly in size and complexity, requiring increasingly intricate parallelism for distributed training, such as 3D parallelism. This sophistication motivates a shift toward simpler, more debuggable programming paradigm like Single Program Multiple Data (SPMD). However, SPMD in eager execution introduces two key challenges: ensuring consistency with single-device execution and achieving high performance at scale. In this paper, we introduce veScale, an eager-mode training system that fully embraces SPMD paradigm to democratize distributed tensor programming. veScale addresses the prevalent issue of inconsistent results in systems like PyTorch by introducing a novel algorithm of distributed Random Number Generation (RNG) compatible with arbitrary sharded operators. veScale also significantly boosts training performance by reducing PyTorch primitive's overhead and improving communication efficiency. Evaluations show that veScale delivers up to 2.2x speedup over the state-of-the-art training systems, like TorchTitan, and cuts code complexity by 78.4%, while preserving single-device-equivalent results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes