Jesper Larsson Träff

2papers

2 Papers

7.5DCMay 28
Effective MPI: User-defined Datatypes and Cartesian Communicators for Zero-copy All-to-all Communication in Multidimensional Tori

Jesper Larsson Träff

We present and show how to implement a non-trivial all-to-all communication algorithm for arbitrary $d$-dimensional tori effectively in MPI. Given a factorization of the number of processes $p$ into $d$ factors that can be mapped onto a $d$-dimensional torus, we first utilize a Cartesian communicator to split a given $p$-process MPI communicator into, for each MPI process, $d$ smaller communicators spanning each of the dimensions of the torus to which the process belongs, and cache these communicators in order to avoid expensive splitting at each all-to-all operation. The all-to-all operation itself is decomposed into a sequence of $d$ MPI_Alltoall operations on the dimension-wise communicators. The non-trivial data rearrangement before and after each MPI_Alltoall call is implicit only and effected by MPI derived datatypes. This makes the implementation of the algorithm formally \emph{zero-copy}, meaning that no explicit process-local reordering of data blocks ever has to be performed. In order to achieve this, the algorithm employs a double-buffering scheme with modest temporary buffer requirements. By choosing the factorization of $p$ and selecting appropriate implementations for the component MPI_Alltoall operations, the presented implementation gives ample opportunities for algorithm tuning and adaptation to the particular high-performance system. A few, select experimental results show competitive performance with native MPI_Alltoall implementations and illustrate problems that common MPI_Alltoall implementations may have.

8.1DSApr 28
Two Efficient Message-passing Exclusive Scan Algorithms

Jesper Larsson Träff

Parallel scan primitives compute element-wise inclusive or exclusive prefix sums of input vectors contributed by $p$ consecutively ranked processors under an associative, possibly expensive, binary operator $\oplus$. In message-passing systems with bounded, one-ported communication capabilities, at least $\lceil\log_2 p\rceil$ or $\lceil\log_2 (p-1)\rceil$ send-receive communication rounds are required to perform the scans. While there are well-known, simple algorithms for the inclusive scan that solve the problem in $\lceil\log_2 p\rceil$ send-receive communication rounds with $\lceil\log_2 p\rceil$ applications of the $\oplus$ operator, the exclusive scan is different and has been much less addressed. By considering natural invariants for the exclusive prefix sums problem, we present two different algorithms that are efficient in the number of communication rounds and in the number of applications of the $\oplus$ operator. The first algorithm consists of an inclusive scan phase and an exclusive scan phase and trades the number of communication rounds against the number of applications of the $\oplus$ operator. The smallest number of inclusive scan rounds with $q=\lceil\log_2 p\rceil$ rounds in total is $q'\geq q-\log_2(2^q-p+1)$. The other algorithm is a modification of a round-optimal all-reduce algorithm, and the number of additional applications of the $\oplus$ operator is dependent on the number of bits set (popcount of) in $p-1$. Both algorithms are relevant for small(er) input vectors where performance is dominated by the number of communication rounds. For large input vectors, other (pipelined, fixed-degree tree) algorithms must be used.