89.2ETApr 15
DTCO Exploration of NOR-Type IGZO FeFETs for Read-Dominated MemoriesYang Xiang, Zhuo Chen, Nicolo Ronchi et al.
InGaZnO (IGZO) channel FeFETs have attracted notable interest thanks to their advances in endurance. This work evaluates the viability of NOR-type IGZO FeFETs for readcentric AI inference workloads via design-technology cooptimization (DTCO). We demonstrate the cross-node bitcell footprint scalability of back-end-of-line (BEOL) IGZO FeFETs capable of delivering 10-A SRAM-equivalent area (0.016 um2) with 7-nm ground rules and reaching sub-5 ns random access latency despite writability challenges. We further identify the sensing margin penalty in NOR FeFET arrays arising from sneak current associated with the negative program-state Vt, which requires positive-Vt engineering in order to eliminate the unwanted negative voltage read inhibition - for example, by ferroelectric layer thinning. Last but not least, we elucidate the read margin implications on 3D FeNOR for storage-class memories (SCMs), with the 3D stacking density limited by additional sneak current from neighbor channel shunting.
19.6ARApr 13
Technology solutions targeting the performance of gen-AI inference in resource constrained platformsJoyjit Kundu, Joshua Klein, Aakash Patel et al.
The rise of generative AI workloads, particularly language model inference, is intensifying on/off-chip memory pressure. Multimodal inputs such as video streams or images and downstream applications like Question Answering (QA) and analysis over large documents incur long context lengths, requiring caching of massive Key and Value states of the previous tokens. Even a low degree of concurrent inference serving on resource-constrained devices, like mobiles, can further add to memory capacity pressure and runtime memory management complexity. In this paper, we evaluate the performance implications of two emerging technology solutions to alleviate the memory pressure in terms of both capacity and bandwidth using a hierarchical roofline-based analytical performance model. For large models (e.g., 13B parameters) and context lengths, we investigate the performance implications of High Bandwidth Storage (HBS) and outline bandwidth/latency requirements to achieve an acceptable throughput for interactivity. For small models (e.g., 1B parameters), we evaluate the merit of a bonded global buffer memory chiplet and propose how to best utilize it.