ETApr 21

Homodyne Photonic Tensor Processor exceeds 1,000-TOPS

Lian Zhou, Kaiwen Xue, Yun-Jhu Lee, Chun-Ho Lee, Yuan Li, Kiwon Kwon, Weipeng Zhang, Songlin Zhao, Jason Moraes, Niranjan Bhatia, Ryan Hamerly, Mengjie Yu

arXiv:2604.184967.4h-index: 27

Predicted impact top 82% in ET · last 90 daysOriginality Highly original

AI Analysis

This work provides a near-term pathway for photonic accelerators in large-scale AI training and low-latency inference, addressing the energy and speed bottlenecks of electronic processors.

The authors demonstrate a homodyne photonic tensor processor that achieves over 1,000 TOPS throughput for general matrix multiplication, using time multiplexing to reduce modulator count and wafer-scale TFLN transmitters. The system reaches 7-bit accuracy at 120 Gbaud/s and up to 6,000 TOPS total throughput, with 330-TOPS/W efficiency.

High-performance computing underpins modern artificial intelligence (AI), enabling foundation models, real-time inference and perception in autonomous systems, and data-intensive scientific simulations. Recent advances in quantization techniques utilizing low-precision computation without degrading model accuracy, create new opportunities for analog photonic computing characterized by ultra-high clock rates and low energy consumption. Here we propose and demonstrate a coherent homodyne integrated circuit capable of general matrix multiplication (GEMM) with aggregate throughput that exceeds 1,000 TOPS (tera-operations per second), enabled by massive on-chip optical fanout and parallelism. By leveraging time multiplexing, the required modulator count is reduced from O($N^2$) to O(N), allowing dense integration of record-scale 256 $\times$ 256 homodyne units (each <0.0064 $mm^2$) within a single reticle. We employ wafer-scale fabricated 64 thin-film lithium niobate (TFLN) transmitters (each over 40-GHz bandwidth with propagation loss of 0.2 dB/cm) to encode data and chip-to-chip coupled to Si/SiN computing circuits (64 channels). Our system achieves up to 7-bit computational accuracy across 8 $\times$ 8 parallel channels at record computing clockrate 120 Gbaud/s, and 6-bit statistical accuracy across 256 $\times$ 100 channels at 20-128 Gbaud/s, representing a total throughput of 1,000-6,000 TOPS. Massive parallelism amortizes the optoelectronic (OE) conversion to allow 330-TOPS/W efficiency using foundry-available packaging technology. The system throughput is benchmarked with Qwen2.5-0.5 billion parameter models that generate accurate tokens. High throughput and energy efficiency establish a near-term pathway toward light-based accelerators for large-scale training and low-latency inference from datacenters to edges, accelerating new models toward artificial general intelligence.

View on arXiv PDF

Similar