ARSep 1, 2022Code
Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load PredictionRahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran et al.
Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.
ETMay 28
Uncertainty-triggered wake-up enables energy-efficient, error-resilient edge AI with memristor front endsThéo Ballet, Aymen Romdhane, Bruno Lovison-Franco et al.
Memristor computing offers a route to low-energy edge AI, but device variability, sensitivity to operating conditions, and system-integration challenges can hinder deployment. Here we show that these limitations can be mitigated by using memristor AI not as the final decision maker but as the ultra-low-power, always-on front end of a heterogeneous inference system. We implement this architecture by coupling a fabricated memristor Bayesian machine to a programmable CPU running a higher-power, higher-accuracy software neural network. The memristor front end acts as a probabilistic screener. When it predicts an abnormal event or produces an ambiguous or invalid output, a dedicated hardware wake-up path activates the CPU, which produces the final decision. We validate this architecture on a heartbeat-classification benchmark by interfacing the fabricated Bayesian machine with an FPGA-based wake-up platform and CPU back end. The resulting uncertainty-triggered wake-up system achieves high final classification accuracy under nominal operation and maintains this accuracy even when the memristor front end is degraded by voltage scaling or reduced programming margins, because unreliable outputs are converted into recoverable wake-up events instead of becoming silent errors. Post-layout analysis of an ASIC implementation shows that average energy is governed primarily by wake-up frequency, providing practical design rules for choosing front-end operating points. These results establish uncertainty-triggered wake-up as a strategy for energy-efficient, error-resilient edge AI.
ARMay 15, 2022
Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement LearningGagandeep Singh, Rakesh Nadig, Jisung Park et al.
Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a "best-fit" storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range of workloads and storage device configurations, and (2) makes it difficult for designers to extend these techniques to different storage system configurations (e.g., with a different number or different types of storage devices) than the configuration they are designed for. We introduce Sibyl, the first technique that uses reinforcement learning for data placement in hybrid storage systems. Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online. We implement Sibyl on real systems with various HSS configurations. Our results show that Sibyl provides 21.6%/19.9% performance improvement in a performance-oriented/cost-oriented HSS configuration compared to the best previous data placement technique. Our evaluation using an HSS configuration with three different storage devices shows that Sibyl outperforms the state-of-the-art data placement policy by 23.9%-48.2%, while significantly reducing the system architect's burden in designing a data placement mechanism that can simultaneously incorporate three storage devices. We show that Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge of future access patterns while incurring a very modest storage overhead of only 124.4 KiB.
LGFeb 2, 2021
Fast Exploration of Weight Sharing Opportunities for CNN CompressionEtienne Dupuis, David Novo, Ian O'Connor et al.
The computational workload involved in Convolutional Neural Networks (CNNs) is typically out of reach for low-power embedded devices. There are a large number of approximation techniques to address this problem. These methods have hyper-parameters that need to be optimized for each CNNs using design space exploration (DSE). The goal of this work is to demonstrate that the DSE phase time can easily explode for state of the art CNN. We thus propose the use of an optimized exploration process to drastically reduce the exploration time without sacrificing the quality of the output.