14.1ARApr 17
Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0F. Nisa Bostanci, Haocong Luo, Ataberk Olgun et al.
A MICRO 2024 best paper runner-up publication (the Mess paper) with all three artifact badges awarded (including ``Reproducible'') proposes a new benchmark to evaluate real and simulated memory system performance. The publication contends that Ramulator 2.0 and DAMOV (ZSim+Ramulator) (along with other existing memory system simulators) ``poorly resemble the actual system performance'' and asserts that their simulator is better. In this paper, we show that the Mess paper has 1) demonstrable technical misconfigurations, 2) methodological errors in interpreting simulation statistics, and 3) an incomplete artifact that makes its key results irreproducible. We demonstrate that the Ramulator 2.0 simulation results reported in the Mess paper are incorrect due to multiple configuration errors instead of inherent simulation inaccuracy claimed by the Mess paper. We show that by correctly configuring Ramulator 2.0, Ramulator 2.0's simulated memory system performance actually resembles real system characteristics well, and thus a key claimed contribution of the Mess paper is factually incorrect. We also identify that the DAMOV simulation results in the Mess paper use wrong simulation statistics that are unrelated to the simulated DRAM performance. We show that DAMOV's simulated DRAM latency is not constant, in contrast to the Mess paper's claim. Moreover, the Mess paper's artifact repository lacks the necessary sources to fully reproduce all the Mess paper's results. We find that the experiment scripts use simulator executables and other resources that are neither described in the Mess paper nor found in the artifact repository. We strongly encourage the computer architecture community to consider our corrections to the Ramulator 2.0 and DAMOV results of the Mess paper to prevent the propagation of inaccurate and misleading results and to maintain the reliability of the scientific record.
ARApr 10, 2024Code
PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory SystemSteve Rhyner, Haocong Luo, Juan Gómez-Luna et al.
Modern Machine Learning (ML) training on large-scale datasets is a very time-consuming workload. It relies on the optimization algorithm Stochastic Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance. Processor-centric architectures (e.g., CPUs, GPUs) commonly used for modern ML training workloads based on SGD are bottlenecked by data movement between the processor and memory units due to the poor data locality in accessing large datasets. As a result, processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities of popular distributed SGD algorithms on real-world PIM systems to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized parallel SGD algorithms on the real-world UPMEM PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and highlight the need for a shift to an algorithm-hardware codesign. Our results demonstrate three major findings: 1) The UPMEM PIM system can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, especially when operations and datatypes are natively supported by PIM hardware, 2) it is important to carefully choose the optimization algorithms that best fit PIM, and 3) the UPMEM PIM system does not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. We open source all our code to facilitate future research.
53.1ARMar 12
DiscoRD: An Experimental Methodology for Quickly Discovering the Reliable Read Disturbance Threshold of Real DRAM ChipsAtaberk Olgun, F. Nisa Bostanci, Ismail Emir Yuksel et al.
State-of-the-art DRAM read disturbance mitigations rely on the read disturbance threshold (RDT) (e.g., the number of aggressor row activations needed to induce the first read disturbance bitflip) to securely and performance- and energy-efficiently prevent read disturbance bitflips. However, accurately and exhaustively characterizing the RDT of every DRAM row in a chip is time intensive. Rapidly determining RDT is important for enabling secure, performance- and energy-efficient systems. Our goal is to develop and evaluate a reliable and rapid read disturbance testing methodology. To that end, we develop DiscoRD building on the key results of an extensive experimental characterization study using 212 real DDR4 chips whereby we measure the RDT of hundreds of thousands of DRAM rows millions of times. We develop an empirical model for read disturbance bitflips and evaluate the probability of read-disturbance-induced uncorrectable errors when a read disturbance mechanism is configured using a single $RDT_{min}$ measurement. Using this model we demonstrate that 1) relying on a lightweight error-correcting code (ECC) alone yields relatively high uncorrectable error probability and 2) combining ECC, infrequent memory scrubbing, and configurable read disturbance mitigation mechanisms can greatly reduce the error probability. Building on our observations and analyses, we discuss the RDT of each individual row can be identified more precisely. Our results show that error tolerance, memory scrubbing, online profiling, and run-time configurable read disturbance mitigation techniques are important to enable secure and energy-efficient spatial-variation aware read disturbance mitigations. We hope that DiscoRD drives research that enables us to quantitatively navigate the performance/cost - reliability tradeoff space for read disturbance mitigation techniques.
CROct 19, 2021
A Deeper Look into RowHammer`s Sensitivities: Experimental Analysis of Real DRAM Chips and Implications on Future Attacks and DefensesLois Orosa, Abdullah Giray Yağlıkçı, Haocong Luo et al.
RowHammer is a circuit-level DRAM vulnerability where repeatedly accessing (i.e., hammering) a DRAM row can cause bit flips in physically nearby rows. The RowHammer vulnerability worsens as DRAM cell size and cell-to-cell spacing shrink. Recent studies demonstrate that modern DRAM chips, including chips previously marketed as RowHammer-safe, are even more vulnerable to RowHammer than older chips such that the required hammer count to cause a bit flip has reduced by more than 10X in the last decade. Therefore, it is essential to develop a better understanding and in-depth insights into the RowHammer vulnerability of modern DRAM chips to more effectively secure current and future systems. Our goal in this paper is to provide insights into fundamental properties of the RowHammer vulnerability that are not yet rigorously studied by prior works, but can potentially be $i$) exploited to develop more effective RowHammer attacks or $ii$) leveraged to design more effective and efficient defense mechanisms. To this end, we present an experimental characterization using 248~DDR4 and 24~DDR3 modern DRAM chips from four major DRAM manufacturers demonstrating how the RowHammer effects vary with three fundamental properties: 1)~DRAM chip temperature, 2)~aggressor row active time, and 3)~victim DRAM cell's physical location. Among our 16 new observations, we highlight that a RowHammer bit flip 1)~is very likely to occur in a bounded range, specific to each DRAM cell (e.g., 5.4% of the vulnerable DRAM cells exhibit errors in the range 70C to 90C), 2)~is more likely to occur if the aggressor row is active for longer time (e.g., RowHammer vulnerability increases by 36% if we keep a DRAM row active for 15 column accesses), and 3)~is more likely to occur in certain physical regions of the DRAM module under attack (e.g., 5% of the rows are 2x more vulnerable than the remaining 95% of the rows).
ARJun 10, 2021
CODIC: A Low-Cost Substrate for Enabling Custom In-DRAM Functionalities and OptimizationsLois Orosa, Yaohua Wang, Mohammad Sadrosadati et al.
DRAM is the dominant main memory technology used in modern computing systems. Computing systems implement a memory controller that interfaces with DRAM via DRAM commands. DRAM executes the given commands using internal components (e.g., access transistors, sense amplifiers) that are orchestrated by DRAM internal timings, which are fixed foreach DRAM command. Unfortunately, the use of fixed internal timings limits the types of operations that DRAM can perform and hinders the implementation of new functionalities and custom mechanisms that improve DRAM reliability, performance and energy. To overcome these limitations, we propose enabling programmable DRAM internal timings for controlling in-DRAM components. To this end, we design CODIC, a new low-cost DRAM substrate that enables fine-grained control over four previously fixed internal DRAM timings that are key to many DRAM operations. We implement CODIC with only minimal changes to the DRAM chip and the DDRx interface. To demonstrate the potential of CODIC, we propose two new CODIC-based security mechanisms that outperform state-of-the-art mechanisms in several ways: (1) a new DRAM Physical Unclonable Function (PUF) that is more robust and has significantly higher throughput than state-of-the-art DRAM PUFs, and (2) the first cold boot attack prevention mechanism that does not introduce any performance or energy overheads at runtime.
ARMay 19, 2021
QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM ChipsAtaberk Olgun, Minesh Patel, A. Giray Yağlıkçı et al.
True random number generators (TRNG) sample random physical processes to create large amounts of random numbers for various use cases, including security-critical cryptographic primitives, scientific simulations, machine learning applications, and even recreational entertainment. Unfortunately, not every computing system is equipped with dedicated TRNG hardware, limiting the application space and security guarantees for such systems. To open the application space and enable security guarantees for the overwhelming majority of computing systems that do not necessarily have dedicated TRNG hardware, we develop QUAC-TRNG. QUAC-TRNG exploits the new observation that a carefully-engineered sequence of DRAM commands activates four consecutive DRAM rows in rapid succession. This QUadruple ACtivation (QUAC) causes the bitline sense amplifiers to non-deterministically converge to random values when we activate four rows that store conflicting data because the net deviation in bitline voltage fails to meet reliable sensing margins. We experimentally demonstrate that QUAC reliably generates random values across 136 commodity DDR4 DRAM chips from one major DRAM manufacturer. We describe how to develop an effective TRNG (QUAC-TRNG) based on QUAC. We evaluate the quality of our TRNG using NIST STS and find that QUAC-TRNG successfully passes each test. Our experimental evaluations show that QUAC-TRNG generates true random numbers with a throughput of 3.44 Gb/s (per DRAM channel), outperforming the state-of-the-art DRAM-based TRNG by 15.08x and 1.41x for basic and throughput-optimized versions, respectively. We show that QUAC-TRNG utilizes DRAM bandwidth better than the state-of-the-art, achieving up to 2.03x the throughput of a throughput-optimized baseline when scaling bus frequencies to 12 GT/s.