ROMay 27
Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous DrivingQitao Weng, Heechul Yun
Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.
ARMar 27Code
Per-Bank Memory Bandwidth Regulation for Predictable and Performant Real-Time SystemConnor Rudy Sullivan, Amin Mamandipoor, Cole Ridge Strickler et al.
Modern multicore system-on-chips (SoCs) share off-chip DRAM across cores, where bank-level interference can significantly degrade performance and threaten real-time guarantees. While prior work has focused on per-core bandwidth regulation, these approaches treat main memory as a monolithic resource and overlook DRAM's inherent bank-level parallelism. We show that DRAM interference is fundamentally a bank-level phenomenon. We characterize the guaranteed bandwidth of modern DRAM, demonstrate that it remains effectively constant across generations, and show how this limitation can be exploited by single-bank attacks. These results highlight the need for bank-aware memory management for predictable and efficient real-time systems. We design and implement a novel per-bank memory bandwidth regulator in an open-source RISC-V SoC and evaluate it using FireSim with both synthetic and real-world workloads. Our evaluation demonstrates that per-bank regulation effectively mitigates adversarial bank contention and achieves a 5.74x average throughput improvement for best-effort workloads over traditional bank-oblivious approaches while providing the same-level of performance isolation guarantees for real-time workloads.
ARApr 14
Tensor Memory Engine: On-the-fly Data Reorganization for Ideal LocalityDenis Hoornaert, Cole Strickler, Manos Athanassoulis et al. · harvard
The shift to data-intensive processing from the cloud to the edge has introduced new challenges and expectations for the next generation of intelligent computing systems. As the memory wall continues to grow, modern systems can only meet these performance expectations by displaying data access patterns that exhibit ideal layouts in memory and ideal spatiotemporal locality in caches. However, only a few data-intensive applications are characterized by ideal locality. Instead, most applications exhibit either (i) poor locality when naively implemented and must undergo costly redesigns and tuning or (ii) inflated memory footprint to offer proper locality. To address the aforementioned challenges, we propose a hardware/software co-designed approach that can be implemented on commercially available SoC/FPGA platforms. Our approach seamlessly inserts in the CPUs' data path a Tensor Memory Engine that provides data with an ideal cache locality to running applications by (i) accessing the memory on behalf of the CPUs and (ii) composing a re-organized view of the memory layout. Unlike in- and near-memory computing approaches, it sets itself apart by clearly decoupling computing and memory accesses; computation is still performed on CPUs while the data re-organization is delegated to the Tensor Memory Engine.
OSApr 12, 2011
Deterministic Real-time Thread SchedulingHeechul Yun, Cheolgi Kim, Lui Sha
Race condition is a timing sensitive problem. A significant source of timing variation comes from nondeterministic hardware interactions such as cache misses. While data race detectors and model checkers can check races, the enormous state space of complex software makes it difficult to identify all of the races and those residual implementation errors still remain a big challenge. In this paper, we propose deterministic real-time scheduling methods to address scheduling nondeterminism in uniprocessor systems. The main idea is to use timing insensitive deterministic events, e.g, an instruction counter, in conjunction with a real-time clock to schedule threads. By introducing the concept of Worst Case Executable Instructions (WCEI), we guarantee both determinism and real-time performance.
CVAug 25, 2022
Anytime-Lidar: Deadline-aware 3D Object DetectionAhmet Soyyigit, Shuochao Yao, Heechul Yun
In this work, we present a novel scheduling framework enabling anytime perception for deep neural network (DNN) based 3D object detection pipelines. We focus on computationally expensive region proposal network (RPN) and per-category multi-head detector components, which are common in 3D object detection pipelines, and make them deadline-aware. We propose a scheduling algorithm, which intelligently selects the subset of the components to make effective time and accuracy trade-off on the fly. We minimize accuracy loss of skipping some of the neural network sub-components by projecting previously detected objects onto the current scene through estimations. We apply our approach to a state-of-art 3D object detection network, PointPillars, and evaluate its performance on Jetson Xavier AGX using nuScenes dataset. Compared to the baselines, our approach significantly improve the network's accuracy under various deadline constraints.
LGAug 23, 2022
DeepPicarMicro: Applying TinyML to Autonomous Cyber Physical SystemsMichael Bechtel, QiTao Weng, Heechul Yun
Running deep neural networks (DNNs) on tiny Micro-controller Units (MCUs) is challenging due to their limitations in computing, memory, and storage capacity. Fortunately, recent advances in both MCU hardware and machine learning software frameworks make it possible to run fairly complex neural networks on modern MCUs, resulting in a new field of study widely known as TinyML. However, there have been few studies to show the potential for TinyML applications in cyber physical systems (CPS). In this paper, we present DeepPicarMicro, a small self-driving RC car testbed, which runs a convolutional neural network (CNN) on a Raspberry Pi Pico MCU. We apply a state-of-the-art DNN optimization to successfully fit the well-known PilotNet CNN architecture, which was used to drive NVIDIA's real self-driving car, on the MCU. We apply a state-of-art network architecture search (NAS) approach to find further optimized networks that can effectively control the car in real-time in an end-to-end manner. From an extensive systematic experimental evaluation study, we observe an interesting relationship between the accuracy, latency, and control performance of a system. From this, we propose a joint optimization strategy that takes both accuracy and latency of a model in the network architecture search process for AI enabled CPS.
CVSep 17, 2024Code
VALO: A Versatile Anytime Framework for LiDAR-based Object Detection Deep Neural NetworksAhmet Soyyigit, Shuochao Yao, Heechul Yun
This work addresses the challenge of adapting dynamic deadline requirements for LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on resource-constrained edge platforms. Therefore, a tradeoff between detection accuracy and latency should be dynamically managed at runtime to achieve optimum results. In this paper, we introduce VALO (Versatile Anytime algorithm for LiDAR Object detection), a novel data-centric approach that enables anytime computing of 3D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate execution without sacrificing accuracy. We implement VALO on state-of-the-art 3D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available athttps://github.com/CSL-KU/VALO}{github.com/CSL-KU/VALO.
DCMar 5, 2019Code
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSimFarzad Farshchi, Qijing Huang, Heechul Yun
NVDLA is an open-source deep neural network (DNN) accelerator which has received a lot of attention by the community since its introduction by Nvidia. It is a full-featured hardware IP and can serve as a good reference for conducting research and development of SoCs with integrated accelerators. However, an expensive FPGA board is required to do experiments with this IP in a real SoC. Moreover, since NVDLA is clocked at a lower frequency on an FPGA, it would be hard to do accurate performance analysis with such a setup. To overcome these limitations, we integrate NVDLA into a real RISC-V SoC on the Amazon cloud FPGA using FireSim, a cycle-exact FPGA-accelerated simulator. We then evaluate the performance of NVDLA by running YOLOv3 object-detection algorithm. Our results show that NVDLA can sustain 7.5 fps when running YOLOv3. We further analyze the performance by showing that sharing the last-level cache with NVDLA can result in up to 1.56x speedup. We then identify that sharing the memory system with the accelerator can result in unpredictable execution time for the real-time tasks running on this platform. We believe this is an important issue that must be addressed in order for on-chip DNN accelerators to be incorporated in real-time embedded systems.
NIJul 9, 2025
DAF: An Efficient End-to-End Dynamic Activation Framework for on-Device DNN TrainingRenyuan Liu, Yuyang Leng, Kaiyan Liu et al.
Recent advancements in on-device training for deep neural networks have underscored the critical need for efficient activation compression to overcome the memory constraints of mobile and edge devices. As activations dominate memory usage during training and are essential for gradient computation, compressing them without compromising accuracy remains a key research challenge. While existing methods for dynamic activation quantization promise theoretical memory savings, their practical deployment is impeded by system-level challenges such as computational overhead and memory fragmentation. To address these challenges, we introduce DAF, a Dynamic Activation Framework that enables scalable and efficient on-device training through system-level optimizations. DAF achieves both memory- and time-efficient dynamic quantization training by addressing key system bottlenecks. It develops hybrid reduction operations tailored to the memory hierarchies of mobile and edge SoCs, leverages collaborative CPU-GPU bit-packing for efficient dynamic quantization, and implements an importance-aware paging memory management scheme to reduce fragmentation and support dynamic memory adjustments. These optimizations collectively enable DAF to achieve substantial memory savings and speedup without compromising model training accuracy. Evaluations on various deep learning models across embedded and mobile platforms demonstrate up to a $22.9\times$ reduction in memory usage and a $3.2\times$ speedup, making DAF a scalable and practical solution for resource-constrained environments.
CRMay 21, 2020
Memory-Aware Denial-of-Service Attacks on Shared Cache in Multicore Real-Time SystemsMichael Bechtel, Heechul Yun
In this paper, we identify that memory performance plays a crucial role in the feasibility and effectiveness for performing denial-of-service attacks on shared cache. Based on this insight, we introduce new cache DoS attacks, which can be mounted from the user-space and can cause extreme worst-case execution time (WCET) impacts to cross-core victims -- even if the shared cache is partitioned -- by taking advantage of the platform's memory address mapping information and HugePage support. We deploy these enhanced attacks on two popular embedded out-of-order multicore platforms using both synthetic and real-world benchmarks. The proposed DoS attacks achieve up to 111X WCET increases on the tested platforms.
CRMar 27, 2020
SpectreRewind: Leaking Secrets to Past InstructionsJacob Fustos, Michael Bechtel, Heechul Yun
Transient execution attacks utilize micro-architectural covert channels to leak secrets that should not have been accessible during logical program execution. Commonly used micro-architectural covert channels are those that leave lasting footprints in the microarchitectural state, for example, a cache state change, from which the secret is recovered after the transient execution is completed. In this paper, we present SpectreRewind, a new approach to create contention based covert channels for transient execution attacks. In our approach, a covert channel is established by issuing the necessary instructions logically before the transiently executed victim code. Unlike prior contention based covert channels, which require simultaneous multi-threading (SMT), SpectreRewind supports single hardware thread based covert channels, making it viable on systems where attacker cannot utilize SMT. We show that contention on the floating point division unit on commodity processors can be used to create a high-performance (~100 KB/s), low-noise covert channel for transient execution attacks instead of commonly used flush+reload based cache covert channels. We implement a Meltdown attack utilizing the proposed covert channel showing competitive performance compared to the stateof-the-art cache based covert channel implementation. We also show that the covert channel works in the JavaScript engine of a Chrome browser.
CRFeb 26, 2012
S3A: Secure System Simplex Architecture for Enhanced Security of Cyber-Physical SystemsSibin Mohan, Stanley Bak, Emiliano Betti et al.
Until recently, cyber-physical systems, especially those with safety-critical properties that manage critical infrastructure (e.g. power generation plants, water treatment facilities, etc.) were considered to be invulnerable against software security breaches. The recently discovered 'W32.Stuxnet' worm has drastically changed this perception by demonstrating that such systems are susceptible to external attacks. Here we present an architecture that enhances the security of safety-critical cyber-physical systems despite the presence of such malware. Our architecture uses the property that control systems have deterministic execution behavior, to detect an intrusion within 0.6 μs while still guaranteeing the safety of the plant. We also show that even if an attack is successful, the overall state of the physical system will still remain safe. Even if the operating system's administrative privileges have been compromised, our architecture will still be able to protect the physical system from coming to harm.