6.7PLMay 14
Mat2Boundary: Treating User-Defined Boundary Condition as SpMV for Distributed PDE Solvers on Block-Structured GridsYanzheng Cai, Mingzhe Zhang, Shengqi Chen et al.
Boundary-condition (BC) handling is a major source of complexity in PDE solvers on structured and block-structured grids, especially for high-order methods and distributed-memory execution. We present Mat2Boundary, a DSL and compiler for boundary computations that models a broad class of boundary-conditions as affine sparse linear operators. This abstraction unifies halo copying, circular and symmetric mappings, zero padding, block-edge synchronization, and user-defined interpolation, while exposing a modular basic sub-matrix interface for declarative composition. To make this representation efficient, Mat2Boundary combines multi-stage programming and polyhedral analysis to generate matrix-free kernels for structured cases, support user-defined sparse matrices for irregular cases, eliminate redundant boundary work, and synthesize reusable communication schedules for distributed execution. Evaluated on two shallow-water equation solvers on cubed-sphere grids and HPCG, Mat2Boundary achieves up to 7.6$\times$ BC-kernel speedup, reduces BC code by over 70%, and scales to 1,344 CPU cores with 72%-88% efficiency.
CLJun 20, 2021Code
CPM-2: Large-scale Cost-effective Pre-trained Language ModelsZhengyan Zhang, Yuxian Gu, Xu Han et al.
In recent years, the size of pre-trained language models (PLMs) has grown by leaps and bounds. However, efficiency issues of these large-scale PLMs limit their utilization in real-world scenarios. We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. (1) We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. (2) We explore the best practice of prompt tuning with large-scale PLMs. Compared with conventional fine-tuning, prompt tuning significantly reduces the number of task-specific parameters. (3) We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources. Based on our cost-effective pipeline, we pre-train two models: an encoder-decoder bilingual model with 11 billion parameters (CPM-2) and its corresponding MoE version with 198 billion parameters. In our experiments, we compare CPM-2 with mT5 on downstream tasks. Experimental results show that CPM-2 has excellent general language intelligence. Moreover, we validate the efficiency of InfMoE when conducting inference of large-scale models having tens of billions of parameters on a single GPU. All source code and model parameters are available at https://github.com/TsinghuaAI/CPM.