75.4DCApr 8
Nexus: Transparent I/O Offloading for High-Density Serverless ComputingJooYoung Park, Kevin Nguetchouang, Jovan Stojkovic et al.
Serverless computing relies on extreme multi-tenancy to remain economically viable, driving providers to rely on virtual machines (VMs) that ensure strong isolation and seamless ecosystem compatibility with the FaaS programming model. However, current architectures tightly couple application processing logic with I/O processing, forcing every VM to duplicate a heavy communication fabric (cloud SDK, RPC, and TCP/IP). Our analysis reveals this duplication consumes over 25% of a function's memory footprint, and may double the CPU cycles in VMs compared to bare-metal execution. While prior systems attempt to solve this using WebAssembly or library OSes, they naively sacrifice ecosystem compatibility, forcing developers to migrate code and dependencies to new languages. We introduce Nexus, a serverless-native KVM-based hypervisor that transparently decouples compute from I/O. Nexus shifts the execution model by intercepting communication fabric at the API boundary and offloading it to an always-on host shared backend via zero-copy shared memory. This removes the heavyweight communication fabric from the guest VM, while preserving the conventional serverless programming model. By structurally separating these domains, Nexus unlocks asynchronous I/O optimizations: overlapping input payload prefetching with VM restoration from a snapshot and writing output payloads back to storage off the critical path. Compared to the production baseline, Nexus reduces overall node-level CPU and memory consumption by up to 44% and 31%, respectively, thus increasing deployment density by 37%. Also, Nexus reduces warm- and cold-start latency by 39% and 10%, respectively, bringing the response time within 20% of that of a WASM-based, ecosystem-incompatible hypervisor.
91.0DCApr 7
CoStream: Codec-Guided Resource-Efficient System for Video Streaming AnalyticsYulin Zou, Yan Chen, Wenyan Chen et al.
Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.
LGOct 24, 2021
Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining PerformanceXinYu Piao, DoangJoo Synn, JooYoung Park et al.
Recent deep learning models are difficult to train using a large batch size, because commodity machines may not have enough memory to accommodate both the model and a large data batch size. The batch size is one of the hyper-parameters used in the training model, and it is dependent on and is limited by the target machine memory capacity because the batch size can only fit into the remaining memory after the model is uploaded. Moreover, the data item size is also an important factor because if each data item size is larger then the batch size that can fit into the remaining memory becomes smaller. This paper proposes a method called Micro-Batch Processing (MBP) to address this problem. This method helps deep learning models to train by providing a batch processing method that splits a batch into a size that can fit in the remaining memory and processes them sequentially. After processing the small batches individually, a loss normalization algorithm based on the gradient accumulation is used to maintain the performance. The purpose of our method is to allow deep learning models to train using larger batch sizes that exceed the memory capacity of a system without increasing the memory size or using multiple devices (GPUs).
CRJul 2, 2016
Identifying ECUs Using Inimitable Characteristics of Signals in Controller Area NetworksWonsuk Choi, Hyo Jin Jo, Samuel Woo et al.
In the last several decades, the automotive industry has come to incorporate the latest Information and Communications (ICT) technology, increasingly replacing mechanical components of vehicles with electronic components. These electronic control units (ECUs) communicate with each other in an in-vehicle network that makes the vehicle both safer and easier to drive. Controller Area Networks (CANs) are the current standard for such high quality in-vehicle communication. Unfortunately, however, CANs do not currently offer protection against security attacks. In particular, they do not allow for message authentication and hence are open to attacks that replay ECU messages for malicious purposes. Applying the classic cryptographic method of message authentication code (MAC) is not feasible since the CAN data frame is not long enough to include a sufficiently long MAC to provide effective authentication. In this paper, we propose a novel identification method, which works in the physical layer of an in-vehicle CAN network. Our method identifies ECUs using inimitable characteristics of signals enabling detection of a compromised or alien ECU being used in a replay attack. Unlike previous attempts to address security issues in the in-vehicle CAN network, our method works by simply adding a monitoring unit to the existing network, making it deployable in current systems and compliant with required CAN standards. Our experimental results show that the bit string and classification algorithm that we utilized yielded more accurate identification of compromised ECUs than any other method proposed to date. The false positive rate is more than 2 times lower than the method proposed by P.-S. Murvay et al. This paper is also the first to identify potential attack models that systems should be able to detect.