Ziyu Kong

h-index4

4papers

7citations

Novelty65%

AI Score56

Ranked #21,206 of 205,806 authors (top 10%)#782 in AI (top 5%)

4 Papers

99.4CLMay 9Code

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang, Jiabao Zhuang, Wenqing Jing et al.

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly-cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via novel metrics, namely Unordered Semantic Tree Edit Distance US-TED/US-NTED and Semantic Path Similarity Sem-Path. Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent). Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: capability-side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference; alignment-side, all 12 LLMs converge to Sem-Path 28--29%, well below 47--58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench

AIJan 12Code

Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

Jiaxuan Lu, Ziyu Kong, Yemin Wang et al.

The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test-Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem-driven artifacts, TTE overcomes the rigidity and long-tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state-of-the-art performance in both accuracy and tool efficiency, while enabling effective cross-domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test-Time-Tool-Evol.

57.2CRApr 22

zkCraft: Prompt-Guided LLM as a Zero-Shot Mutation Pattern Oracle for TCCT-Powered ZK Fuzzing

Rong Fu, Jia Yee Tan, Youjin Wang et al.

Zero-knowledge circuits enable privacy-preserving and scalable systems but are difficult to implement correctly due to the tight coupling between witness computation and circuit constraints. We present zkCraft, a practical framework that combines deterministic, R1CS-aware localization with proof-bearing search to detect semantic inconsistencies. zkCraft encodes candidate constraint edits into a single Row-Vortex polynomial and replaces repeated solver queries with a Violation IOP that certifies the existence of edits together with a succinct proof. Deterministic LLM-driven mutation templates bias exploration toward edge cases while preserving auditable algebraic verification. Evaluation on real Circom code shows that proof-bearing localization detects diverse under- and over-constrained faults with low false positives and reduces costly solver interaction. Our approach bridges formal verification and automated debugging, offering a scalable path for robust ZK circuit development.

NIFeb 13

Chimera: Neuro-Symbolic Attention Primitives for Trustworthy Dataplane Intelligence

Rong Fu, Xiaowen Ma, Kun Liu et al.

Deploying expressive learning models directly on programmable dataplanes promises line-rate, low-latency traffic analysis but remains hindered by strict hardware constraints and the need for predictable, auditable behavior. Chimera introduces a principled framework that maps attention-oriented neural computations and symbolic constraints onto dataplane primitives, enabling trustworthy inference within the match-action pipeline. Chimera combines a kernelized, linearized attention approximation with a two-layer key-selection hierarchy and a cascade fusion mechanism that enforces hard symbolic guarantees while preserving neural expressivity. The design includes a hardware-aware mapping protocol and a two-timescale update scheme that together permit stable, line-rate operation under realistic dataplane budgets. The paper presents the Chimera architecture, a hardware mapping strategy, and empirical evidence showing that neuro-symbolic attention primitives can achieve high-fidelity inference within the resource envelope of commodity programmable switches.