Jeff Zhang

AR
h-index50
14papers
1,331citations
Novelty53%
AI Score54

14 Papers

86.2ARMay 30Code
MACO: A Multi-Agent LLM Framework for Automated CGRA Hardware/Software Co-Design

Zesong Jiang, Yuqi Sun, Qing Zhong et al.

Designing optimal Coarse-Grained Reconfigurable Arrays (CGRAs) requires navigating a vast, interdependent hardware/software space bottlenecked by costly manual iteration. We present MACO, an open-source, multi-agent LLM framework that automates CGRA HW/SW co-design. MACO decomposes the design loop into four collaborative stages, HW/SW Co-design, Error Correction, Best-Design Selection, and Evaluation & Feedback, to iteratively optimize power, performance, and area (PPA). To accelerate convergence and efficiently traverse the design space, MACO introduces an exponentially decaying exploration strategy, EDA-guided LLM self-learning, and robust rule-based error correction. Evaluated against state-of-the-art baselines, MACO reduces power consumption by 25.9%, improves performance by 20.0%, and accelerates the search process by 5x. Finally, we validate MACO's physical design through a complete 7nm ASIC design flow.

LGMar 21, 2024
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han, Chao Gao, Jinyang Liu et al.

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities. Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adjusting the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large model to adapt it to a specific task or domain while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large-scale language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design. In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to providing an extensive survey from an algorithmic standpoint, we also examine various real-world system designs to investigate the implementation costs associated with different PEFT approaches. This survey serves as a valuable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed ......

ARJul 7, 2024
SCATTER: Algorithm-Circuit Co-Sparse Photonic Accelerator with Thermal-Tolerant, Power-Efficient In-situ Light Redistribution

Ziang Yin, Nicholas Gangi, Meng Zhang et al.

Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads. However, limited reconfigurability, high electrical-optical conversion cost, and thermal sensitivity limit the deployment of current optical analog computing engines to support power-restricted, performance-sensitive AI workloads at scale. Sparsity provides a great opportunity for hardware-efficient AI accelerators. However, current dense photonic accelerators fail to fully exploit the power-saving potential of algorithmic sparsity. It requires sparsity-aware hardware specialization with a fundamental re-design of photonic tensor core topology and cross-layer device-circuit-architecture-algorithm co-optimization aware of hardware non-ideality and power bottleneck. To trim down the redundant power consumption while maximizing robustness to thermal variations, we propose SCATTER, a novel algorithm-circuit co-sparse photonic accelerator featuring dynamically reconfigurable signal path via thermal-tolerant, power-efficient in-situ light redistribution and power gating. A power-optimized, crosstalk-aware dynamic sparse training framework is introduced to explore row-column structured sparsity and ensure marginal accuracy loss and maximum power efficiency. The extensive evaluation shows that our cross-stacked optimized accelerator SCATTER achieves a 511X area reduction and 12.4X power saving with superior crosstalk tolerance that enables unprecedented circuit layout compactness and on-chip power efficiency.

OPTICSNov 20, 2024Code
SimPhony: A Device-Circuit-Architecture Cross-Layer Modeling and Simulation Framework for Heterogeneous Electronic-Photonic AI System

Ziang Yin, Meng Zhang, Amir Begovic et al.

Electronic-photonic integrated circuits (EPICs) offer transformative potential for next-generation high-performance AI but require interdisciplinary advances across devices, circuits, architecture, and design automation. The complexity of hybrid systems makes it challenging even for domain experts to understand distinct behaviors and interactions across design stack. The lack of a flexible, accurate, fast, and easy-to-use EPIC AI system simulation framework significantly limits the exploration of hardware innovations and system evaluations on common benchmarks. To address this gap, we propose SimPhony, a cross-layer modeling and simulation framework for heterogeneous electronic-photonic AI systems. SimPhony offers a platform that enables (1) generic, extensible hardware topology representation that supports heterogeneous multi-core architectures with diverse photonic tensor core designs; (2) optics-specific dataflow modeling with unique multi-dimensional parallelism and reuse beyond spatial/temporal dimensions; (3) data-aware energy modeling with realistic device responses, layout-aware area estimation, link budget analysis, and bandwidth-adaptive memory modeling; and (4) seamless integration with model training framework for hardware/software co-simulation. By providing a unified, versatile, and high-fidelity simulation platform, SimPhony enables researchers to innovate and evaluate EPIC AI hardware across multiple domains, facilitating the next leap in emerging AI hardware. We open-source our codes at https://github.com/ScopeX-ASU/SimPhony

OPTICSNov 8, 2024Code
Multi-Dimensional Reconfigurable, Physically Composable Hybrid Diffractive Optical Neural Network

Ziang Yin, Yu Yao, Jeff Zhang et al.

Diffractive optical neural networks (DONNs), leveraging free-space light wave propagation for ultra-parallel, high-efficiency computing, have emerged as promising artificial intelligence (AI) accelerators. However, their inherent lack of reconfigurability due to fixed optical structures post-fabrication hinders practical deployment in the face of dynamic AI workloads and evolving applications. To overcome this challenge, we introduce, for the first time, a multi-dimensional reconfigurable hybrid diffractive ONN system (MDR-HDONN), a physically composable architecture that unlocks a new degree of freedom and unprecedented versatility in DONNs. By leveraging full-system learnability, MDR-HDONN repurposes fixed fabricated optical hardware, achieving exponentially expanded functionality and superior task adaptability through the differentiable learning of system variables. Furthermore, MDR-HDONN adopts a hybrid optical/photonic design, combining the reconfigurability of integrated photonics with the ultra-parallelism of free-space diffractive systems. Extensive evaluations demonstrate that MDR-HDONN has digital-comparable accuracy on various task adaptations with 74x faster speed and 194x lower energy. Compared to prior DONNs, MDR-HDONN shows exponentially larger functional space with 5x faster training speed, paving the way for a new paradigm of versatile, composable, hybrid optical/photonic AI computing. We will open-source our codes.

OPTICSDec 31, 2025
Toward Large-Scale Photonics-Empowered AI Systems: From Physical Design Automation to System-Algorithm Co-Exploration

Ziang Yin, Hongjian Zhou, Nicholas Gangi et al.

In this work, we identify three considerations that are essential for realizing practical photonic AI systems at scale: (1) dynamic tensor operation support for modern models rather than only weight-static kernels, especially for attention/Transformer-style workloads; (2) systematic management of conversion, control, and data-movement overheads, where multiplexing and dataflow must amortize electronic costs instead of letting ADC/DAC and I/O dominate; and (3) robustness under hardware non-idealities that become more severe as integration density grows. To study these coupled tradeoffs quantitatively, and to ensure they remain meaningful under real implementation constraints, we build a cross-layer toolchain that supports photonic AI design from early exploration to physical realization. SimPhony provides implementation-aware modeling and rapid cross-layer evaluation, translating physical costs into system-level metrics so architectural decisions are grounded in realistic assumptions. ADEPT and ADEPT-Z enable end-to-end circuit and topology exploration, connecting system objectives to feasible photonic fabrics under practical device and circuit constraints. Finally, Apollo and LiDAR provide scalable photonic physical design automation, turning candidate circuits into manufacturable layouts while accounting for routing, thermal, and crosstalk constraints.

CLNov 17, 2025
Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Lavender Y. Jiang, Angelica Chen, Xu Han et al.

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

AIFeb 26, 2025
CNS-Obsidian: A Neurosurgical Vision-Language Model Built From Scientific Publications

Anton Alyakin, Jaden Stryker, Daniel Alexander Alber et al.

General-purpose vision-language models (VLMs) demonstrate impressive capabilities, but their opaque training on uncurated internet data posse critical limitations for high-stakes decision-making, such as in neurosurgery. We present CNS-Obsidian, a neurosurgical VLM trained on peer-reviewed neurosurgical literature, and demonstrate its clinical utility compared with GPT-4o in a real-world setting. We compiled 23,984 articles from Neurosurgery Publications journals, yielding 78,853 figures and captions. Using GPT-4o and Claude Sonnet-3.5, we converted these image-text pairs into 263,064 training samples across three formats: instruction fine-tuning, multiple-choice questions, and differential diagnosis. We trained CNS-Obsidian, a fine-tune of the 34-billion parameter LLaVA-Next model. In a blinded, randomized deployment trial at NYU Langone Health (Aug 30-Nov 30, 2024), neurosurgeons were assigned to use either CNS-Obsidian or GPT-4o as a diagnostic co-pilot after patient consultations. Primary outcomes were diagnostic helpfulness and accuracy. CNS-Obsidian matched GPT-4o on synthetic questions (76.13% vs 77.54%, p=0.235), but only achieved 46.81% accuracy on human-generated questions versus GPT-4o's 65.70% (p<10-15). In the randomized trial, 70 consultations were evaluated (32 CNS-Obsidian, 38 GPT-4o) from 959 total consults. CNS-Obsidian received positive ratings in 40.62% of cases versus 57.89% for GPT-4o (p=0.230). Both models included correct diagnosis in approximately 60% of cases (59.38% vs 65.79%, p=0.626). Domain-specific VLMs trained on curated scientific literature can approach frontier model performance in specialized medical domains despite being orders of magnitude smaller and less expensive to train. However, low clinical utilization suggests chatbot interfaces may not align with specialist workflows, indicating need for alternative AI integration strategies.

ROOct 27, 2021
Millimeter Wave Wireless Assisted Robot Navigation with Link State Classification

Mingsheng Yin, Akshaj Veldanda, Amee Trivedi et al.

The millimeter wave (mmWave) bands have attracted considerable attention for high precision localization applications due to the ability to capture high angular and temporal resolution measurements. This paper explores mmWave-based positioning for a target localization problem where a fixed target broadcasts mmWave signals and a mobile robotic agent attempts to capture the signals to locate and navigate to the target. A three-stage procedure is proposed: First, the mobile agent uses tensor decomposition methods to detect the multipath channel components and estimate their parameters. Second, a machine-learning trained classifier is then used to predict the link state, meaning if the strongest path is line-of-sight (LOS) or non-LOS (NLOS). For the NLOS case, the link state predictor also determines if the strongest path arrived via one or more reflections. Third, based on the link state, the agent either follows the estimated angles or uses computer vision or other sensor to explore and map the environment. The method is demonstrated on a large dataset of indoor environments supplemented with ray tracing to simulate the wireless propagation. The path estimation and link state classification are also integrated into a state-of-the-art neural simultaneous localization and mapping (SLAM) module to augment camera and LIDAR-based navigation. It is shown that the link state classifier can successfully generalize to completely new environments outside the training set. In addition, the neural-SLAM module with the wireless path estimation and link state classifier provides rapid navigation to the target, close to a baseline that knows the target location.

ARMay 18, 2021
RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance

Udit Gupta, Samuel Hsia, Jeff Zhang et al.

Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing distinct parallelism opportunities. RecPipe implements an inference scheduler to map multi-stage recommendation engines onto commodity, heterogeneous platforms (e.g., CPUs, GPUs).While the hardware-aware scheduling improves ranking efficiency, the commodity platforms suffer from many limitations requiring specialized hardware. Thus, we design RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput. RPAc-cel is designed specifically to exploit the distinct design space opened via RecPipe. In particular, RPAccel processes queries in sub-batches to pipeline recommendation stages, implements dual static and dynamic embedding caches, a set of top-k filtering units, and a reconfigurable systolic array. Com-pared to prior-art and at iso-quality, we demonstrate that RPAccel improves latency and throughput by 3x and 6x.

LGJul 2, 2018
FATE: Fast and Accurate Timing Error Prediction Framework for Low Power DNN Accelerator Design

Jeff Zhang, Siddharth Garg

Deep neural networks (DNN) are increasingly being accelerated on application-specific hardware such as the Google TPU designed especially for deep learning. Timing speculation is a promising approach to further increase the energy efficiency of DNN accelerators. Architectural exploration for timing speculation requires detailed gate-level timing simulations that can be time-consuming for large DNNs that execute millions of multiply-and-accumulate (MAC) operations. In this paper we propose FATE, a new methodology for fast and accurate timing simulations of DNN accelerators like the Google TPU. FATE proposes two novel ideas: (i) DelayNet, a DNN based timing model for MAC units; and (ii) a statistical sampling methodology that reduces the number of MAC operations for which timing simulations are performed. We show that FATE results in between 8 times-58 times speed-up in timing simulations, while introducing less than 2% error in classification accuracy estimates. We demonstrate the use of FATE by comparing to conventional DNN accelerator that uses 2's complement (2C) arithmetic with an alternative implementation that uses signed magnitude representations (SMR). We show that that the SMR implementation provides 18% more energy savings for the same classification accuracy than 2C, a result that might be of independent interest.

NEFeb 11, 2018
ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Neural Network Accelerators

Jeff Zhang, Kartheek Rangineni, Zahra Ghodsi et al.

Hardware accelerators are being increasingly deployed to boost the performance and energy efficiency of deep neural network (DNN) inference. In this paper we propose Thundervolt, a new framework that enables aggressive voltage underscaling of high-performance DNN accelerators without compromising classification accuracy even in the presence of high timing error rates. Using post-synthesis timing simulations of a DNN accelerator modeled on the Google TPU, we show that Thundervolt enables between 34%-57% energy savings on state-of-the-art speech and image recognition benchmarks with less than 1% loss in classification accuracy and no performance loss. Further, we show that Thundervolt is synergistic with and can further increase the energy efficiency of commonly used run-time DNN pruning techniques like Zero-Skip.

LGFeb 11, 2018
Analyzing and Mitigating the Impact of Permanent Faults on a Systolic Array Based Neural Network Accelerator

Jeff Zhang, Tianyu Gu, Kanad Basu et al.

Due to their growing popularity and computational cost, deep neural networks (DNNs) are being targeted for hardware acceleration. A popular architecture for DNN acceleration, adopted by the Google Tensor Processing Unit (TPU), utilizes a systolic array based matrix multiplication unit at its core. This paper deals with the design of fault-tolerant, systolic array based DNN accelerators for high defect rate technologies. To this end, we empirically show that the classification accuracy of a baseline TPU drops significantly even at extremely low fault rates (as low as $0.006\%$). We then propose two novel strategies, fault-aware pruning (FAP) and fault-aware pruning+retraining (FAP+T), that enable the TPU to operate at fault rates of up to $50\%$, with negligible drop in classification accuracy (as low as $0.1\%$) and no run-time performance overhead. The FAP+T does introduce a one-time retraining penalty per TPU chip before it is deployed, but we propose optimizations that reduce this one-time penalty to under 12 minutes. The penalty is then amortized over the entire lifetime of the TPU's operation.

CVJun 27, 2017
Hierarchical Model for Long-term Video Prediction

Peter Wang, Zhongxia Yan, Jeff Zhang

Video prediction has been an active topic of research in the past few years. Many algorithms focus on pixel-level predictions, which generates results that blur and disintegrate within a few frames. In this project, we use a hierarchical approach for long-term video prediction. We aim at estimating high-level structure in the input frame first, then predict how that structure grows in the future. Finally, we use an image analogy network to recover a realistic image from the predicted structure. Our method is largely adopted from the work by Villegas et al. The method is built with a combination of LSTMs and analogy-based convolutional auto-encoder networks. Additionally, in order to generate more realistic frame predictions, we also adopt adversarial loss. We evaluate our method on the Penn Action dataset, and demonstrate good results on high-level long-term structure prediction.