Elena Agostini

2papers

2 Papers

DCNov 19, 2025
GPU-Initiated Networking for NCCL

Khaled Hamidouche, John Bachan, Pak Markthub et al.

Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN's practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL's unified runtime, combining low-latency operations with NCCL's collective algorithms and production infrastructure.

CRJan 4, 2019
BitCracker: BitLocker meets GPUs

Elena Agostini, Massimo Bernaschi

BitLocker is a full-disk encryption feature available in recent Windows versions. It is designed to protect data by providing encryption for entire volumes and it makes use of a number of different authentication methods. In this paper we present a solution, named BitCracker, to attempt the decryption, by means of a dictionary attack, of memory units encrypted by BitLocker with a user supplied password or the recovery password. To that purpose, we resort to GPU (Graphics Processing Units) that are, by now, widely used as general-purpose coprocessors in high performance computing applications. BitLocker decryption process requires the computation of a very large number of SHA- 256 hashes and also AES, so we propose a very fast solution, highly tuned for Nvidia GPU, for both of them. We analyze the performance of our CUDA implementation on several Nvidia GPUs and we carry out a comparison of our SHA-256 hash with the Hashcat password cracker tool. Finally, we present our OpenCL version, recently released as a plugin of the John The Ripper tool.