LGJul 16, 2025
BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM TrainingRui Li, Xiaoyun Zhi, Jinxin Chi et al.
Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.
CRDec 19, 2021
An Architecture for Exploiting Native User-Land Checkpoint-Restart to Improve FuzzingPrashant Singh Chouhan, Gregory Price, Gene Cooperman
Fuzzing is one of the most popular and widely used techniques to find vulnerabilities in any application. Fuzzers are fast enough, but they still spend a good portion of time to restart a crashed application and then fuzz it from the beginning. Fuzzing an application from a point deeper in the execution is also important. To do this, a user needs to take a snapshot of the program while fuzzing it on top of an emulator, virtual machine, or by utilizing a special kernel module to enable checkpointing. Even with this ability, it can be difficult to attach a fuzzer after restoring a checkpoint. As a result, most fuzzers leverage a form of fork-server design. We propose a novel testing architecture that allows users to attach a fuzzer after the program has started running. We do this by natively checkpointing the target application at a point of interest, and attaching the fuzzer after restoring the checkpoint. A fork-server may even be engaged at the point of restoration. This not only improves the throughput of the fuzzing campaign by minimizing startup time, but opens up a new way to fuzz applications. With this architecture, a user can take a series of checkpoints at points of interest, and run parallel tests to reduce the overall state-complexity of an individual test. Checkpoints allow us to begin fuzzing from a deeper point in the execution path, omitting prior execution from the required coverage path. This and other checkpointing techniques are described in the paper to help improve fuzzing.
DCJul 13, 2019
A Secure Cloud with Minimal Provider TrustAmin Mosayyebzadeh, Gerardo Ravago, Apoorve Mohan et al.
Bolted is a new architecture for a bare metal cloud with the goal of providing security-sensitive customers of a cloud the same level of security and control that they can obtain in their own private data centers. It allows tenants to elastically allocate secure resources within a cloud while being protected from other previous, current, and future tenants of the cloud. The provisioning of a new server to a tenant isolates a bare metal server, only allowing it to communicate with other tenant's servers once its critical firmware and software have been attested to the tenant. Tenants, rather than the provider, control the tradeoffs between security, price, and performance. A prototype demonstrates scalable end-to-end security with small overhead compared to a less secure alternative.
DCJul 13, 2019
Supporting Security Sensitive Tenants in a Bare-Metal CloudAmin Mosayyebzadeh, Apoorve Mohan, Sahil Tikale et al.
Bolted is a new architecture for bare-metal clouds that enables tenants to control tradeoffs between security, price, and performance. Security-sensitive tenants can minimize their trust in the public cloud provider and achieve similar levels of security and control that they can obtain in their own private data centers. At the same time, Bolted neither imposes overhead on tenants that are security insensitive nor compromises the flexibility or operational efficiency of the provider. Our prototype exploits a novel provisioning system and specialized firmware to enable elasticity similar to virtualized clouds. Experimentally we quantify the cost of different levels of security for a variety of workloads and demonstrate the value of giving control to the tenant.