Ziji Shi

LG
h-index3
6papers
140citations
Novelty57%
AI Score30

6 Papers

LGFeb 1, 2023
TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

Ziji Shi, Le Jiang, Ang Wang et al.

Tensor parallelism is an essential technique for distributed training of large neural networks. However, automatically determining an optimal tensor parallel strategy is challenging due to the gigantic search space, which grows exponentially with model size and tensor dimension. This prohibits the adoption of auto-parallel systems on larger models. We observe that neural networks usually contain repeated substructures, and build an automatic parallelism framework named TAPAS that eliminates redundant search efforts. TAPAS employs a divide-and-conquer approach that efficiently folds the search space by identifying those unique substructures. As a result, it runs at sub-linear complexity concerning the model size, making it a scalable solution for training large-scale networks. Our evaluations demonstrate that TAPAS outperforms the state-of-the-art automatic parallelism frameworks by up to $160\times$ in search speed on a wide range of models, and the performance of derived strategies is competitive or even better compared with the expert-engineered Megatron-LM library.

DCFeb 16, 2023
Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

Shiwei Zhang, Lansong Diao, Siyu Wang et al.

We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of tensor programs, which facilitates its generalization to unprecedented applications. Additionally, it implements a task-oriented controller and a distributed runtime for optimal performance. Rhino explores on a complete and systematic parallelization strategy space that comprises all the paradigms commonly employed in deep learning (DL), in addition to strided partitioning and pipeline parallelism on non-linear models. Aiming to efficiently search for a near-optimal parallel execution plan, our analysis of production clusters reveals general heuristics to speed up the strategy search. On top of it, two optimization levels are designed to offer users flexible trade-offs between the search time and strategy quality. Our experiments demonstrate that Rhino can not only re-discover the expert-crafted strategies of classic, research and production DL models, but also identify novel parallelization strategies which surpass existing systems for novel models.

LGOct 30, 2023
ROAM: memory-efficient large DNN training via optimized operator ordering and memory layout

Huiyao Shu, Ang Wang, Ziji Shi et al.

As deep learning models continue to increase in size, the memory requirements for training have surged. While high-level techniques like offloading, recomputation, and compression can alleviate memory pressure, they also introduce overheads. However, a memory-efficient execution plan that includes a reasonable operator execution order and tensor memory layout can significantly increase the models' memory efficiency and reduce overheads from high-level techniques. In this paper, we propose ROAM which operates on computation graph level to derive memory-efficient execution plan with optimized operator order and tensor memory layout for models. We first propose sophisticated theories that carefully consider model structure and training memory load to support optimization for large complex graphs that have not been well supported in the past. An efficient tree-based algorithm is further proposed to search task divisions automatically, along with delivering high performance and effectiveness to solve the problem. Experiments show that ROAM achieves a substantial memory reduction of 35.7%, 13.3%, and 27.2% compared to Pytorch and two state-of-the-art methods and offers a remarkable 53.7x speedup. The evaluation conducted on the expansive GPT2-XL further validates ROAM's scalability.

DCNov 6, 2024
ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Ziji Shi, Jialin Li, Yang You

Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.

LGJul 25, 2021
Go Wider Instead of Deeper

Fuzhao Xue, Ziji Shi, Futao Wei et al.

More transformer blocks with residual connections have recently achieved impressive results on various tasks. To achieve better performance with fewer trainable parameters, recent methods are proposed to go shallower by parameter sharing or model compressing along with the depth. However, weak modeling capacity limits their performance. Contrastively, going wider by inducing more trainable matrixes and parameters would produce a huge model requiring advanced parallelism to train and inference. In this paper, we propose a parameter-efficient framework, going wider instead of deeper. Specially, following existing works, we adapt parameter sharing to compress along depth. But, such deployment would limit the performance. To maximize modeling capacity, we scale along model width by replacing feed-forward network (FFN) with mixture-of-experts (MoE). Across transformer blocks, instead of sharing normalization layers, we propose to use individual layernorms to transform various semantic representations in a more parameter-efficient way. To evaluate our plug-and-run framework, we design WideNet and conduct comprehensive experiments on popular computer vision and natural language processing benchmarks. On ImageNet-1K, our best model outperforms Vision Transformer (ViT) by $1.5\%$ with $0.72 \times$ trainable parameters. Using $0.46 \times$ and $0.13 \times$ parameters, our WideNet can still surpass ViT and ViT-MoE by $0.8\%$ and $2.1\%$, respectively. On four natural language processing datasets, WideNet outperforms ALBERT by $1.8\%$ on average and surpass BERT using factorized embedding parameterization by $0.8\%$ with fewer parameters.

SYMay 9, 2018
A Collision-Free Path Planning Algorithm for Unmanned Aerial Vehicle Delivery

Ziji Shi, Wee Keong Ng

Path planning is important for the autonomy of Unmanned Aerial Vehicle (UAV), especially for scheduling UAV delivery. However, the operating environment of UAVs is usually uncertain and dynamic. Without proper planning, collisions may happen where multiple UAVs are congested. Besides, there may also be temporary no-fly zone setup by authorities that makes airspace unusable. Thus, proper pre-departure planning that avoids such places is needed. In this paper, we formulate this problem into a Constraint Satisfaction Problem to find a collision-free shortest path on a dynamic graph. We propose a collision-free path planning algorithm that is based on A* algorithm. The main novelty is that we invent a heuristic function that also considers waiting time. We later show that, with added waiting penalty, the proposed algorithm is optimal because the heuristic is admissible. Implementation of this algorithm simulates UAV delivery using Singapore's airspace structure. Our simulation exhibits desirable runtime performance. Using the proposed algorithm, the percentage of collision-free routes decreases as number of requests per unit area increases, and this percentage drops significantly at boundary value. Our empirical analysis could aid the decision-making of no-fly zone policy and infrastructure of UAV delivery.