Peter Pietzuch

h-index41

10papers

562citations

Novelty60%

AI Score36

Ranked #99,911 of 194,257 authors (top 51%)#2,460 in CR (top 36%)

10 Papers

6.9LGOct 3, 2022

MSRL: Distributed Reinforcement Learning with Dataflow Fragments

Huanzhou Zhu, Bo Zhao, Gang Chen et al.

Reinforcement learning (RL) trains many agents, which is resource-intensive and must scale to large GPU clusters. Different RL training algorithms offer different opportunities for distributing and parallelising the computation. Yet, current distributed RL systems tie the definition of RL algorithms to their distributed execution: they hard-code particular distribution strategies and only accelerate specific parts of the computation (e.g. policy network updates) on GPU workers. Fundamentally, current systems lack abstractions that decouple RL algorithms from their execution. We describe MindSpore Reinforcement Learning (MSRL), a distributed RL training system that supports distribution policies that govern how RL training computation is parallelised and distributed on cluster resources, without requiring changes to the algorithm implementation. MSRL introduces the new abstraction of a fragmented dataflow graph, which maps Python functions from an RL algorithm's training loop to parallel computational fragments. Fragments are executed on different devices by translating them to low-level dataflow representations, e.g. computational graphs as supported by deep learning engines, CUDA implementations or multi-threaded CPU processes. We show that MSRL subsumes the distribution strategies of existing systems, while scaling RL training to 64 GPUs.

13.0DCDec 8, 2023Code

Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections

Marcel Wagenländer, Guo Li, Bo Zhao et al.

Deep learning (DL) jobs use multi-dimensional parallelism, i.e. combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may experience changes to their GPU allocation: (i) resource elasticity during training adds or removes GPUs; (ii) hardware maintenance may require redeployment on different GPUs; and (iii) GPU failures force jobs to run with fewer devices. Current DL frameworks tie jobs to a set of GPUs and thus lack support for these scenarios. In particular, they cannot change the multi-dimensional parallelism of an already-running job in an efficient and model-independent way. We describe Scalai, a state management library for DL systems that enables jobs to change their parallelism dynamically after the GPU allocation is updated at runtime. Scalai achieves this through a new abstraction, a parallelizable tensor collection (PTC), that externalizes the job state during training. After a GPU change, Scalai uses the PTC to transform the job state: the PTC repartitions the dataset state under data parallelism and exposes it to DL workers through a virtual file system; and the PTC obtains the model state as partitioned checkpoints and transforms them to reflect the new parallelization configuration. For efficiency, Scalai executes PTC transformations in parallel with minimum data movement between workers. Our experiments show that Scalai enables DL jobs to support dynamic parallelization with low overhead.

2.3DCJan 9, 2025

Tempo: Compiled Dynamic Deep Learning with Symbolic Dependence Graphs

Pedro F. Silvestre, Peter Pietzuch

Deep learning (DL) algorithms are often defined in terms of temporal relationships: a tensor at one timestep may depend on tensors from earlier or later timesteps. Such dynamic dependencies (and corresponding dynamic tensor shapes) are difficult to express and optimize: while eager DL systems support such dynamism, they cannot apply compiler-based optimizations; graph-based systems require static tensor shapes, which forces users to pad tensors or break-up programs into multiple static graphs. We describe Tempo, a new DL system that combines the dynamism of eager execution with the whole-program optimizations of graph-based compilation. Tempo achieves this through a declarative programming model with recurrent tensors, which include explicit temporal dimensions. Temporal dimensions can be indexed using symbolic expressions to express dynamic dependencies on past and future tensors. Based on this, Tempo constructs a symbolic dependence graph, which concisely encodes dynamic dependencies between operators, and applies whole-program optimizations, such as algebraic simplifications, vectorization, tiling, and fusion. By tiling dynamic dependencies into static-size blocks, Tempo can also reuse existing static code-generators. It then uses a polyhedral model to find a feasible execution schedule, which includes memory management operations. We show that Tempo achieves a 7$\times$ speedup over JAX for Llama-3.2-3B decoding; for reinforcement learning algorithms, Tempo achieves a 54$\times$ speedup, with 16$\times$ lower peak memory usage.

2.3CRDec 13, 2024

VerifiableFL: Verifiable Claims for Federated Learning using Exclaves

Jinnan Guo, Kapil Vaswani, Andrew Paverd et al.

In federated learning (FL), data providers jointly train a machine learning model without sharing their training data. This makes it challenging to provide verifiable claims about properties of the final trained FL model, e.g., related to the employed training data, the used data sanitization, or the correct training algorithm -- a malicious data provider can simply deviate from the correct training protocol without being detected. While prior FL training systems have explored the use of trusted execution environments (TEEs) to combat such attacks, existing approaches struggle to link attestation proofs from TEEs robustly and effectively with claims about the trained FL model. TEEs have also been shown to suffer from a wide range of attacks, including side-channel attacks. We describe VerifiableFL, a system for training FL models that provides verifiable claims about trained models with the help of runtime attestation proofs. VerifiableFL generates such proofs using the new abstraction of exclaves, which are integrity-only execution environments without any secrets, thus making them immune to data leakage attacks. Whereas previous approaches only attested whole TEEs statically, i.e., at deployment time, VerifiableFL uses exclaves to attest individual data transformations during FL training. These runtime attestation proofs then form an attested dataflow graph of the entire FL model training computation. The graph can be checked by an auditor to ensure that the trained FL model satisfies its verifiable claims, such as the use of particular data sanitization by data providers or aggregation strategy by the model provider. We implement VerifiableFL by extending NVIDIA's NVFlare FL framework to use exclaves, and show that VerifiableFL introduces less than 10% overhead compared to unprotected FL model training.

9.7DCMay 18, 2023Code

Quiver: Supporting GPUs for Low-Latency, High-Throughput GNN Serving with Workload Awareness

Zeyuan Tan, Xiulong Yuan, Congjie He et al.

Systems for serving inference requests on graph neural networks (GNN) must combine low latency with high throughout, but they face irregular computation due to skew in the number of sampled graph nodes and aggregated GNN features. This makes it challenging to exploit GPUs effectively: using GPUs to sample only a few graph nodes yields lower performance than CPU-based sampling; and aggregating many features exhibits high data movement costs between GPUs and CPUs. Therefore, current GNN serving systems use CPUs for graph sampling and feature aggregation, limiting throughput. We describe Quiver, a distributed GPU-based GNN serving system with low-latency and high-throughput. Quiver's key idea is to exploit workload metrics for predicting the irregular computation of GNN requests, and governing the use of GPUs for graph sampling and feature aggregation: (1) for graph sampling, Quiver calculates the probabilistic sampled graph size, a metric that predicts the degree of parallelism in graph sampling. Quiver uses this metric to assign sampling tasks to GPUs only when the performance gains surpass CPU-based sampling; and (2) for feature aggregation, Quiver relies on the feature access probability to decide which features to partition and replicate across a distributed GPU NUMA topology. We show that Quiver achieves up to 35 times lower latency with an 8 times higher throughput compared to state-of-the-art GNN approaches (DGL and PyG).

13.4OSAug 29, 2019

SGX-LKL: Securing the Host OS Interface for Trusted Execution

Christian Priebe, Divya Muthukumaran, Joshua Lind et al.

Hardware support for trusted execution in modern CPUs enables tenants to shield their data processing workloads in otherwise untrusted cloud environments. Runtime systems for the trusted execution must rely on an interface to the untrusted host OS to use external resources such as storage, network, and other functions. Attackers may exploit this interface to leak data or corrupt the computation. We describe SGX-LKL, a system for running Linux binaries inside of Intel SGX enclaves that only exposes a minimal, protected and oblivious host interface: the interface is (i) minimal because SGX-LKL uses a complete library OS inside the enclave, including file system and network stacks, which requires a host interface with only 7 calls; (ii) protected because SGX-LKL transparently encrypts and integrity-protects all data passed via low-level I/O operations; and (iii) oblivious because SGX-LKL performs host operations independently of the application workload. For oblivious disk I/O, SGX-LKL uses an encrypted ext4 file system with shuffled disk blocks. We show that SGX-LKL protects TensorFlow training with a 21% overhead.

13.0CRJun 17, 2019

Using Trusted Execution Environments for Secure Stream Processing of Medical Data

Carlos Segarra, Ricard Delgado-Gonzalo, Mathieu Lemay et al.

Processing sensitive data, such as those produced by body sensors, on third-party untrusted clouds is particularly challenging without compromising the privacy of the users generating it. Typically, these sensors generate large quantities of continuous data in a streaming fashion. Such vast amount of data must be processed efficiently and securely, even under strong adversarial models. The recent introduction in the mass-market of consumer-grade processors with Trusted Execution Environments (TEEs), such as Intel SGX, paves the way to implement solutions that overcome less flexible approaches, such as those atop homomorphic encryption. We present a secure streaming processing system built on top of Intel SGX to showcase the viability of this approach with a system specifically fitted for medical data. We design and fully implement a prototype system that we evaluate with several realistic datasets. Our experimental results show that the proposed system achieves modest overhead compared to vanilla Spark while offering additional protection guarantees under powerful attackers and threat models.

13.8DCJan 8, 2019Code

CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers

Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich et al.

Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of training data at a time, partitioned across GPUs, and average the resulting partial gradients to obtain an updated global model. To fully utilise all GPUs, systems must increase the batch size, which hinders statistical efficiency. Users tune hyper-parameters such as the learning rate to compensate for this, which is complex and model-specific. We describe CROSSBOW, a new single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size - however small - while scaling to multiple GPUs. CROSSBOW uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. We introduce SMA, a synchronous variant of model averaging in which replicas independently explore the solution space with gradient descent, but adjust their search synchronously based on the trajectory of a globally-consistent average model. CROSSBOW achieves high hardware efficiency with small batch sizes by potentially training multiple model replicas per GPU, automatically tuning the number of replicas to maximise throughput. Our experiments show that CROSSBOW improves the training time of deep learning models on an 8-GPU server by 1.3-4x compared to TensorFlow.

27.0CRJul 18, 2017

Teechain: A Secure Payment Network with Asynchronous Blockchain Access

Joshua Lind, Oded Naor, Ittay Eyal et al.

Blockchains such as Bitcoin and Ethereum execute payment transactions securely, but their performance is limited by the need for global consensus. Payment networks overcome this limitation through off-chain transactions. Instead of writing to the blockchain for each transaction, they only settle the final payment balances with the underlying blockchain. When executing off-chain transactions in current payment networks, parties must access the blockchain within bounded time to detect misbehaving parties that deviate from the protocol. This opens a window for attacks in which a malicious party can steal funds by deliberately delaying other parties' blockchain access and prevents parties from using payment networks when disconnected from the blockchain. We present Teechain, the first layer-two payment network that executes off-chain transactions asynchronously with respect to the underlying blockchain. To prevent parties from misbehaving, Teechain uses treasuries, protected by hardware trusted execution environments (TEEs), to establish off-chain payment channels between parties. Treasuries maintain collateral funds and can exchange transactions efficiently and securely, without interacting with the underlying blockchain. To mitigate against treasury failures and to avoid having to trust all TEEs, Teechain replicates the state of treasuries using committee chains, a new variant of chain replication with threshold secret sharing. Teechain achieves at least a 33x higher transaction throughput than the state-of-the-art Lightning payment network. A 30-machine Teechain deployment can handle over 1 million Bitcoin transactions per second.

23.1CRDec 22, 2016

Teechan: Payment Channels Using Trusted Execution Environments

Joshua Lind, Ittay Eyal, Peter Pietzuch et al.

Blockchain protocols are inherently limited in transaction throughput and latency. Recent efforts to address performance and scale blockchains have focused on off-chain payment channels. While such channels can achieve low latency and high throughput, deploying them securely on top of the Bitcoin blockchain has been difficult, partly because building a secure implementation requires changes to the underlying protocol and the ecosystem. We present Teechan, a full-duplex payment channel framework that exploits trusted execution environments. Teechan can be deployed securely on the existing Bitcoin blockchain without having to modify the protocol. It: (i) achieves a higher transaction throughput and lower transaction latency than prior solutions; (ii) enables unlimited full-duplex payments as long as the balance does not exceed the channel's credit; (iii) requires only a single message to be sent per payment in any direction; and (iv) places at most two transactions on the blockchain under any execution scenario. We have built and deployed the Teechan framework using Intel SGX on the Bitcoin network. Our experiments show that, not counting network latencies, Teechan can achieve 2,480 transactions per second on a single channel, with sub-millisecond latencies.