Chenyang Zhou

AI
h-index4
9papers
49citations
Novelty54%
AI Score57

9 Papers

MAMay 13Code
ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Zhongkai Yu, Yichen Lin, Chenyang Zhou et al.

Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self-trained models address the deployment constraint but remain single-turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross-comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack-based inference workflow to prevent error propagation across turns, and a two-stage training pipeline that first trains each agent individually to saturate its code-generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data-generation framework that produces 64.4K high-quality reference model training samples. ChipMATE achieves 75.0\% and 80.1\% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self-trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.

AIJan 29Code
ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

Zhongkai Yu, Chenyang Zhou, Yichen Lin et al.

While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI-aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state-of-the-art Claude-4.5-opus achieving only 30.74\% on Verilog generation and 13.33\% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95\% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high-quality training data generation, facilitating future research in this underexplored domain. Our code is available at https://github.com/zhongkaiyu/ChipBench.git.

AIJul 19, 2024
The Cardinality of Identifying Code Sets for Soccer Ball Graph with Application to Remote Sensing

Anna L. D. Latour, Arunabha Sen, Kaustav Basu et al.

In the context of satellite monitoring of the earth, we can assume that the surface of the earth is divided into a set of regions. We assume that the impact of a big social/environmental event spills into neighboring regions. Using Identifying Code Sets (ICSes), we can deploy sensors in such a way that the region in which an event takes place can be uniquely identified, even with fewer sensors than regions. As Earth is almost a sphere, we use a soccer ball as a model. We construct a Soccer Ball Graph (SBG), and provide human-oriented, analytical proofs that 1) the SBG has at least 26 ICSes of cardinality ten, implying that there are at least 26 different ways to deploy ten satellites to monitor the Earth and 2) that the cardinality of the minimum Identifying Code Set (MICS) for the SBG is at least nine. We then provide a machine-oriented formal proof that the cardinality of the MICS for the SBG is in fact ten, meaning that one must deploy at least ten satellites to monitor the Earth in the SBG model. We also provide machine-oriented proof that there are exactly 26 ICSes of cardinality ten for the SBG.

AIMar 15
Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Tiantian Mi, Dongming Shan, Zhen Huang et al.

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.

DCOct 7, 2025Code
Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting

Zhongkai Yu, Yue Guan, Zihao Yu et al.

Large Language Models (LLMs) with Mixture of Experts (MoE) architectures achieve remarkable performance improvements, but their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit serving systems. To forecast the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across three state-of-the-art large-scale MoE models (200B- 671B) using over 24,000 requests spanning diverse workloads. With the resulting 150GB+ trace files, we perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. Taking wafer-scale GPUs as a case study, we demonstrate that minor architectural modifications leveraging our insights achieve substantial performance gains, delivering 6.3X and 4.0X average speedups on DeepSeek V3 and Qwen3, respectively. Our work provides the first comprehensive data-centric analysis of MoE models at scale. Our profiling traces and analysis results are publicly available at {https://huggingface.co/datasets/core12345/MoE_expert_selection_trace. We will also release our simulation framework shortly to facilitate future research in this area.

ARApr 28
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu, Haotian Ye, Chenyang Zhou et al.

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.

AIJul 25, 2018
Decentralized Cooperative Planning for Automated Vehicles with Hierarchical Monte Carlo Tree Search

Karl Kurzer, Chenyang Zhou, J. Marius Zöllner

Today's automated vehicles lack the ability to cooperate implicitly with others. This work presents a Monte Carlo Tree Search (MCTS) based approach for decentralized cooperative planning using macro-actions for automated vehicles in heterogeneous environments. Based on cooperative modeling of other agents and Decoupled-UCT (a variant of MCTS), the algorithm evaluates the state-action-values of each agent in a cooperative and decentralized manner, explicitly modeling the interdependence of actions between traffic participants. Macro-actions allow for temporal extension over multiple time steps and increase the effective search depth requiring fewer iterations to plan over longer horizons. Without predefined policies for macro-actions, the algorithm simultaneously learns policies over and within macro-actions. The proposed method is evaluated under several conflict scenarios, showing that the algorithm can achieve effective cooperative planning with learned macro-actions in heterogeneous environments.

RONov 19, 2017
CPG-Based Control Scheme for Quadruped Robot to Withstand the Lateral Impact

Qingsheng Luo, Chenyang Zhou, Yan Jia et al.

This paper aims to present a stability control strategy for quadruped robot under lateral impact with the help of lateral trot. We firstly propose five necessary conditions for keeping balance. The classical four-neuron Central Pattern Generator (CPG) network with Hopf oscillators is then extended to eight-neuron network with four more trigger-enabled neurons, which controls the lateral trot. With proper adjustment of network's parameters, such network can coordinate the lateral and longitudinal trot gait. Based on Zero Movement Point (ZMP) theory, the robot is modeled as an inverted pendulum to plan the Center of Gravity (CoG) position and calculate the needed lateral step length. The simulation shows that the lateral acceleration of the quadruped robot after lateral impact regains to the normal range in a short time. Comparison shows that the maximal lateral impact that robot can resist increases about 125% from 0.72g to 1.55g.

NIJan 24, 2017
On Robustness in Multilayer Interdependent Network

Joydeep Banerjee, Chenyang Zhou, Arunabha Sen

Critical Infrastructures like power and communication networks are highly interdependent on each other for their full functionality. Many significant research have been pursued to model the interdependency and failure analysis of these interdependent networks. However, most of these models fail to capture the complex interdependencies that might actually exist between the infrastructures. The \emph{Implicative Interdependency Model} that utilizes Boolean Logic to capture complex interdependencies was recently proposed which overcome the limitations of the existing models. A number of problems were studies based on this model. In this paper we study the \textit{Robustness} problem in Interdependent Power and Communication Network. The robustness is defined with respect to two parameters $K \in I^{+} \cup \{0\}$ and $ρ\in (0,1]$. We utilized the \emph{Implicative Interdependency Model} model to capture the complex interdependency between the two networks. The model classifies the interdependency relations into four cases. Computational complexity of the problem is analyzed for each of these cases. A polynomial time algorithm is designed for the first case that outputs the optimal solution. All the other cases are proved to be NP-complete. An in-approximability bound is provided for the third case. For the general case we formulate an Integer Linear Program to get the optimal solution and a polynomial time heuristic. The applicability of the heuristic is evaluated using power and communication network data of Maricopa County, Arizona. The experimental results showed that the heuristic almost always produced near optimal value of parameter $K$ for $ρ< 0.42$.