Zhengxu Yan

LGMay 29, 2025Code

VERINA: Benchmarking Verifiable Code Generation

Zhe Ye, Zhengxu Yan, Jingxuan He et al.

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves a 61.4\% code correctness rate, 51.0\% for specification soundness and completeness, and a mere 3.6\% proof success rate (based on one trial per task). We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

LGDec 12, 2025

NoveltyRank: A Retrieval-Augmented Framework for Conceptual Novelty Estimation in AI Research

Zhengxu Yan, Han Li, Yuming Feng

The accelerating pace of scientific publication makes it difficult to identify truly original research among incremental work. We propose a framework for estimating the conceptual novelty of research papers by combining semantic representation learning with retrieval-based comparison against prior literature. We model novelty as both a binary classification task (novel vs. non-novel) and a pairwise ranking task (comparative novelty), enabling absolute and relative assessments. Experiments benchmark three model scales, ranging from compact domain-specific encoders to a zero-shot frontier model. Results show that fine-tuned lightweight models outperform larger zero-shot models despite their smaller parameter count, indicating that task-specific supervision matters more than scale for conceptual novelty estimation. We further deploy the best-performing model as an online system for public interaction and real-time novelty scoring.

Zhengxu Yan

2 Papers