John Bachan

CENov 28, 2017

ExaGridPF: A Parallel Power Flow Solver for Transmission and Unbalanced Distribution Systems

Bin Wang, John Bachan, Cy Chan

This paper investigates parallelization strategies for solving power flow problems in both transmission and unbalanced, three-phase distribution systems by developing a scalable power flow solver, ExaGridPF, which is compatible with existing high-performance computing platforms. Newton-Raphson (NR) and Newton-Krylov (NK) algorithms have been implemented to verify the performance improvement over both standard IEEE test cases and synthesized grid topologies. For three-phase, unbalanced system, we adapt the current injection method (CIM) to model the power flow and utilize SuperLU to parallelize the computing load across multiple threads. The experimental results indicate that more than 5 times speedup ratio can be achieved for synthesized large-scale transmission topologies, and significant efficiency improvements are observed over existing methods for the distribution networks.

DCNov 19, 2025

GPU-Initiated Networking for NCCL

Khaled Hamidouche, John Bachan, Pak Markthub et al.

Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN's practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL's unified runtime, combining low-latency operations with NCCL's collective algorithms and production infrastructure.

John Bachan

2 Papers