ZettaLith: An Architectural Exploration of Extreme-Scale AI Inference Acceleration
This addresses the challenge of scaling AI deployment for widespread use by offering a more efficient and cost-effective inference solution, though it is incremental as it builds on existing technologies.
The paper tackles the high computational cost and power consumption of AI systems by introducing ZettaLith, a scalable architecture that reduces AI inference cost and power by over 1,000x compared to current GPU-based systems, with a single rack projected to achieve 1.507 zettaFLOPS in 2027.
The high computational cost and power consumption of current and anticipated AI systems present a major challenge for widespread deployment and further scaling. Current hardware approaches face fundamental efficiency limits. This paper introduces ZettaLith, a scalable computing architecture designed to reduce the cost and power of AI inference by over 1,000x compared to current GPU-based systems. Based on architectural analysis and technology projections, a single ZettaLith rack could potentially achieve 1.507 zettaFLOPS in 2027 - representing a theoretical 1,047x improvement in inference performance, 1,490x better power efficiency, and could be 2,325x more cost-effective than current leading GPU racks for FP4 transformer inference. The ZettaLith architecture achieves these gains by abandoning general purpose GPU applications, and via the multiplicative effect of numerous co-designed architectural innovations using established digital electronic technologies, as detailed in this paper. ZettaLith's core architectural principles scale down efficiently to exaFLOPS desktop systems and petaFLOPS mobile chips, maintaining their roughly 1,000x advantage. ZettaLith presents a simpler system architecture compared to the complex hierarchy of current GPU clusters. ZettaLith is optimized exclusively for AI inference and is not applicable for AI training.