Zhefan Wang

CL
h-index8
6papers
14citations
Novelty42%
AI Score51

6 Papers

CLMay 18Code
Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains

Zhonghang Yuan, Zhefan Wang, Fang Hu et al.

Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models (LLMs) in domains such as mathematics and coding. However, its applications on knowledge-intensive domains have not been effectively explored due to the scarcity of high-quality verifiable data. Furthermore, current RLVR focuses solely on the correctness of final answers, leading to the limitations of flawed reasoning and sparse reward signals. In this work, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains through automated verifiable data synthesis, while enabling verification of the LLM's reasoning process. Extensive experiments demonstrate that K2V enhances the reasoning of LLM in knowledge-intensive domains without significantly compromising the model's general capabilities. This study also suggests that integrating automated data synthesis with reasoning verification is a promising direction to enhance model capabilities in these broader domains. Code is available at https://github.com/SeedScientist/K2V.

CLMay 28
Personalized Turn-Level User Conversation Satisfaction Benchmark

Zhefan Wang, Zhiqiang Guo, Weizhi Ma et al.

User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.

CLMay 19, 2025Code
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science

Jie Ying, Zihong Chen, Zhefan Wang et al.

Seed science is essential for modern agriculture, directly influencing crop yields and global food security. However, challenges such as interdisciplinary complexity and high costs with limited returns hinder progress, leading to a shortage of experts and insufficient technological support. While large language models (LLMs) have shown promise across various fields, their application in seed science remains limited due to the scarcity of digital resources, complex gene-trait relationships, and the lack of standardized benchmarks. To address this gap, we introduce SeedBench -- the first multi-task benchmark specifically designed for seed science. Developed in collaboration with domain experts, SeedBench focuses on seed breeding and simulates key aspects of modern breeding processes. We conduct a comprehensive evaluation of 26 leading LLMs, encompassing proprietary, open-source, and domain-specific fine-tuned models. Our findings not only highlight the substantial gaps between the power of LLMs and the real-world seed science problems, but also make a foundational step for research on LLMs for seed design.

IRMar 9, 2025Code
ROGRAG: A Robustly Optimized GraphRAG Framework

Zhefan Wang, Huanjun Kong, Jie Ying et al.

Large language models (LLMs) commonly struggle with specialized or emerging topics which are rarely seen in the training corpus. Graph-based retrieval-augmented generation (GraphRAG) addresses this by structuring domain knowledge as a graph for dynamic retrieval. However, existing pipelines involve complex engineering workflows, making it difficult to isolate the impact of individual components. It is also challenging to evaluate the retrieval effectiveness due to the overlap between the pretraining and evaluation datasets. In this work, we introduce ROGRAG, a Robustly Optimized GraphRAG framework. Specifically, we propose a multi-stage retrieval mechanism that integrates dual-level with logic form retrieval methods to improve retrieval robustness without increasing computational cost. To further refine the system, we incorporate various result verification methods and adopt an incremental database construction approach. Through extensive ablation experiments, we rigorously assess the effectiveness of each component. Our implementation includes comparative experiments on SeedBench, where Qwen2.5-7B-Instruct initially underperformed. ROGRAG significantly improves the score from 60.0% to 75.0% and outperforms mainstream methods. Experiments on domain-specific datasets reveal that dual-level retrieval enhances fuzzy matching, while logic form retrieval improves structured reasoning, highlighting the importance of multi-stage retrieval.ROGRAG is released as an open-source resource and supports installation with pip.

LGMay 15, 2024
An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

Renqi Chen, Wenwei Han, Haohao Zhang et al.

Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.

MAOct 6, 2021
Scalable Multi-Agent Reinforcement Learning for Residential Load Scheduling under Data Governance

Zhaoming Qin, Nanqing Dong, Di Liu et al.

As a data-driven approach, multi-agent reinforcement learning (MARL) has made remarkable advances in solving cooperative residential load scheduling problems. However, centralized training, the most common paradigm for MARL, limits large-scale deployment in communication-constrained cloud-edge environments. As a remedy, distributed training shows unparalleled advantages in real-world applications but still faces challenge with system scalability, e.g., the high cost of communication overhead during coordinating individual agents, and needs to comply with data governance in terms of privacy. In this work, we propose a novel MARL solution to address these two practical issues. Our proposed approach is based on actor-critic methods, where the global critic is a learned function of individual critics computed solely based on local observations of households. This scheme preserves household privacy completely and significantly reduces communication cost. Simulation experiments demonstrate that the proposed framework achieves comparable performance to the state-of-the-art actor-critic framework without data governance and communication constraints.