Ataberk Olgun

AR
h-index20
7papers
420citations
Novelty49%
AI Score47

7 Papers

ARSep 1, 2022Code
Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran et al.

Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.

15.0ARApr 17
Cleaning up the Mess: Re-Evaluating the Real-System Modeling Accuracy of Ramulator 2.0

F. Nisa Bostanci, Haocong Luo, Ataberk Olgun et al.

A MICRO 2024 best paper runner-up publication (the Mess paper) with all three artifact badges awarded (including ``Reproducible'') proposes a new benchmark to evaluate real and simulated memory system performance. The publication contends that Ramulator 2.0 and DAMOV (ZSim+Ramulator) (along with other existing memory system simulators) ``poorly resemble the actual system performance'' and asserts that their simulator is better. In this paper, we show that the Mess paper has 1) demonstrable technical misconfigurations, 2) methodological errors in interpreting simulation statistics, and 3) an incomplete artifact that makes its key results irreproducible. We demonstrate that the Ramulator 2.0 simulation results reported in the Mess paper are incorrect due to multiple configuration errors instead of inherent simulation inaccuracy claimed by the Mess paper. We show that by correctly configuring Ramulator 2.0, Ramulator 2.0's simulated memory system performance actually resembles real system characteristics well, and thus a key claimed contribution of the Mess paper is factually incorrect. We also identify that the DAMOV simulation results in the Mess paper use wrong simulation statistics that are unrelated to the simulated DRAM performance. We show that DAMOV's simulated DRAM latency is not constant, in contrast to the Mess paper's claim. Moreover, the Mess paper's artifact repository lacks the necessary sources to fully reproduce all the Mess paper's results. We find that the experiment scripts use simulator executables and other resources that are neither described in the Mess paper nor found in the artifact repository. We strongly encourage the computer architecture community to consider our corrections to the Ramulator 2.0 and DAMOV results of the Mess paper to prevent the propagation of inaccurate and misleading results and to maintain the reliability of the scientific record.

ARApr 10, 2024Code
PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System

Steve Rhyner, Haocong Luo, Juan Gómez-Luna et al.

Modern Machine Learning (ML) training on large-scale datasets is a very time-consuming workload. It relies on the optimization algorithm Stochastic Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance. Processor-centric architectures (e.g., CPUs, GPUs) commonly used for modern ML training workloads based on SGD are bottlenecked by data movement between the processor and memory units due to the poor data locality in accessing large datasets. As a result, processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities of popular distributed SGD algorithms on real-world PIM systems to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized parallel SGD algorithms on the real-world UPMEM PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and highlight the need for a shift to an algorithm-hardware codesign. Our results demonstrate three major findings: 1) The UPMEM PIM system can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, especially when operations and datatypes are natively supported by PIM hardware, 2) it is important to carefully choose the optimization algorithms that best fit PIM, and 3) the UPMEM PIM system does not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. We open source all our code to facilitate future research.

53.3ARMar 12
DiscoRD: An Experimental Methodology for Quickly Discovering the Reliable Read Disturbance Threshold of Real DRAM Chips

Ataberk Olgun, F. Nisa Bostanci, Ismail Emir Yuksel et al.

State-of-the-art DRAM read disturbance mitigations rely on the read disturbance threshold (RDT) (e.g., the number of aggressor row activations needed to induce the first read disturbance bitflip) to securely and performance- and energy-efficiently prevent read disturbance bitflips. However, accurately and exhaustively characterizing the RDT of every DRAM row in a chip is time intensive. Rapidly determining RDT is important for enabling secure, performance- and energy-efficient systems. Our goal is to develop and evaluate a reliable and rapid read disturbance testing methodology. To that end, we develop DiscoRD building on the key results of an extensive experimental characterization study using 212 real DDR4 chips whereby we measure the RDT of hundreds of thousands of DRAM rows millions of times. We develop an empirical model for read disturbance bitflips and evaluate the probability of read-disturbance-induced uncorrectable errors when a read disturbance mechanism is configured using a single $RDT_{min}$ measurement. Using this model we demonstrate that 1) relying on a lightweight error-correcting code (ECC) alone yields relatively high uncorrectable error probability and 2) combining ECC, infrequent memory scrubbing, and configurable read disturbance mitigation mechanisms can greatly reduce the error probability. Building on our observations and analyses, we discuss the RDT of each individual row can be identified more precisely. Our results show that error tolerance, memory scrubbing, online profiling, and run-time configurable read disturbance mitigation techniques are important to enable secure and energy-efficient spatial-variation aware read disturbance mitigations. We hope that DiscoRD drives research that enables us to quantitatively navigate the performance/cost - reliability tradeoff space for read disturbance mitigation techniques.

CROct 19, 2021
A Deeper Look into RowHammer`s Sensitivities: Experimental Analysis of Real DRAM Chips and Implications on Future Attacks and Defenses

Lois Orosa, Abdullah Giray Yağlıkçı, Haocong Luo et al.

RowHammer is a circuit-level DRAM vulnerability where repeatedly accessing (i.e., hammering) a DRAM row can cause bit flips in physically nearby rows. The RowHammer vulnerability worsens as DRAM cell size and cell-to-cell spacing shrink. Recent studies demonstrate that modern DRAM chips, including chips previously marketed as RowHammer-safe, are even more vulnerable to RowHammer than older chips such that the required hammer count to cause a bit flip has reduced by more than 10X in the last decade. Therefore, it is essential to develop a better understanding and in-depth insights into the RowHammer vulnerability of modern DRAM chips to more effectively secure current and future systems. Our goal in this paper is to provide insights into fundamental properties of the RowHammer vulnerability that are not yet rigorously studied by prior works, but can potentially be $i$) exploited to develop more effective RowHammer attacks or $ii$) leveraged to design more effective and efficient defense mechanisms. To this end, we present an experimental characterization using 248~DDR4 and 24~DDR3 modern DRAM chips from four major DRAM manufacturers demonstrating how the RowHammer effects vary with three fundamental properties: 1)~DRAM chip temperature, 2)~aggressor row active time, and 3)~victim DRAM cell's physical location. Among our 16 new observations, we highlight that a RowHammer bit flip 1)~is very likely to occur in a bounded range, specific to each DRAM cell (e.g., 5.4% of the vulnerable DRAM cells exhibit errors in the range 70C to 90C), 2)~is more likely to occur if the aggressor row is active for longer time (e.g., RowHammer vulnerability increases by 36% if we keep a DRAM row active for 15 column accesses), and 3)~is more likely to occur in certain physical regions of the DRAM module under attack (e.g., 5% of the rows are 2x more vulnerable than the remaining 95% of the rows).

ARMay 19, 2021
QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips

Ataberk Olgun, Minesh Patel, A. Giray Yağlıkçı et al.

True random number generators (TRNG) sample random physical processes to create large amounts of random numbers for various use cases, including security-critical cryptographic primitives, scientific simulations, machine learning applications, and even recreational entertainment. Unfortunately, not every computing system is equipped with dedicated TRNG hardware, limiting the application space and security guarantees for such systems. To open the application space and enable security guarantees for the overwhelming majority of computing systems that do not necessarily have dedicated TRNG hardware, we develop QUAC-TRNG. QUAC-TRNG exploits the new observation that a carefully-engineered sequence of DRAM commands activates four consecutive DRAM rows in rapid succession. This QUadruple ACtivation (QUAC) causes the bitline sense amplifiers to non-deterministically converge to random values when we activate four rows that store conflicting data because the net deviation in bitline voltage fails to meet reliable sensing margins. We experimentally demonstrate that QUAC reliably generates random values across 136 commodity DDR4 DRAM chips from one major DRAM manufacturer. We describe how to develop an effective TRNG (QUAC-TRNG) based on QUAC. We evaluate the quality of our TRNG using NIST STS and find that QUAC-TRNG successfully passes each test. Our experimental evaluations show that QUAC-TRNG generates true random numbers with a throughput of 3.44 Gb/s (per DRAM channel), outperforming the state-of-the-art DRAM-based TRNG by 15.08x and 1.41x for basic and throughput-optimized versions, respectively. We show that QUAC-TRNG utilizes DRAM bandwidth better than the state-of-the-art, achieving up to 2.03x the throughput of a throughput-optimized baseline when scaling bus frequencies to 12 GT/s.

CRFeb 11, 2021
BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows

Abdullah Giray Yağlıkçı, Minesh Patel, Jeremie S. Kim et al.

Aggressive memory density scaling causes modern DRAM devices to suffer from RowHammer, a phenomenon where rapidly activating a DRAM row can cause bit-flips in physically-nearby rows. Recent studies demonstrate that modern DRAM chips, including chips previously marketed as RowHammer-safe, are even more vulnerable to RowHammer than older chips. Many works show that attackers can exploit RowHammer bit-flips to reliably mount system-level attacks to escalate privilege and leak private data. Therefore, it is critical to ensure RowHammer-safe operation on all DRAM-based systems. Unfortunately, state-of-the-art RowHammer mitigation mechanisms face two major challenges. First, they incur increasingly higher performance and/or area overheads when applied to more vulnerable DRAM chips. Second, they require either proprietary information about or modifications to the DRAM chip design. In this paper, we show that it is possible to efficiently and scalably prevent RowHammer bit-flips without knowledge of or modification to DRAM internals. We introduce BlockHammer, a low-cost, effective, and easy-to-adopt RowHammer mitigation mechanism that overcomes the two key challenges by selectively throttling memory accesses that could otherwise cause RowHammer bit-flips. The key idea of BlockHammer is to (1) track row activation rates using area-efficient Bloom filters and (2) use the tracking data to ensure that no row is ever activated rapidly enough to induce RowHammer bit-flips. By doing so, BlockHammer (1) makes it impossible for a RowHammer bit-flip to occur and (2) greatly reduces a RowHammer attack's impact on the performance of co-running benign applications. Compared to state-of-the-art RowHammer mitigation mechanisms, BlockHammer provides competitive performance and energy when the system is not under a RowHammer attack and significantly better performance and energy when the system is under attack.