Hwayong Nam

h-index10
2papers

2 Papers

17.4CRApr 22
PVAC: A RowHammer Mitigation Architecture Exploiting Per-victim-row Counting

Jumin Kim, Seungmin Baek, Hwayong Nam et al.

As DRAM scaling exacerbates RowHammer, DDR5 introduces per-row activation counting (PRAC) to track aggressor activity. However, PRAC indiscriminately increments counters on every activation -- including benign refreshes -- while relying solely on explicit RFM operations for resets. Consequently, counters saturate even in an idle bank, triggering cascading mitigations and degrading performance. This vulnerability arises from a fundamental mismatch: PRAC tracks the aggressor but aims to protect the victim. We present Per-Victim-row hAmmered Counting (PVAC), a victim-based counting mechanism that aligns the counter semantics with the physical disturbance mechanism of RowHammer. PVAC increments the counters of victim rows, resets the activated row, and naturally bounds counter values under normal refresh. To enable efficient victim-based updates, PVAC employs a dedicated counter subarray (CSA) that performs all counter resets and increments concurrently with normal accesses, without timing overhead. We further devise an energy-efficient CSA layout that minimizes refresh-induced counter accesses. Through victim-based counting, PVAC supports higher hammering tolerance than PRAC while maintaining the same worst-case safety guarantee. Across benign workloads and adversarial attack patterns, PVAC avoids spurious Alerts, eliminates PRAC timing penalties, and achieves higher performance and lower energy consumption than prior PRAC-based defenses.

ARJul 21, 2025
The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts

Sungmin Yun, Seonyong Park, Hwayong Nam et al.

Computational workloads composing traditional Transformer models are starkly bifurcated. Multi-Head Attention (MHA) is memory-bound, with low arithmetic intensity, while feedforward layers are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the MHA bottleneck. This paper argues that recent architectural shifts, namely Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE), challenge the premise of specialized attention hardware. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude greater than that of MHA, shifting it close to a compute-bound regime well-suited for modern accelerators like GPUs. Second, by distributing MoE experts across a pool of accelerators, their arithmetic intensity can be tuned through batching to match that of the dense layers, creating a more balanced computational profile. These findings reveal a diminishing need for specialized attention hardware. The central challenge for next-generation Transformers is no longer accelerating a single memory-bound layer. Instead, the focus must shift to designing balanced systems with sufficient compute, memory capacity, memory bandwidth, and high-bandwidth interconnects to manage the diverse demands of large-scale models.