Chunyan Li

AI
h-index20
4papers
12citations
Novelty43%
AI Score36

4 Papers

CLMar 20, 2024
Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

Peng Zhou, Jianmin Wang, Chunyan Li et al.

While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 82.58%, 68.03%, and 67.48%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science.

AIOct 28, 2025
BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data

Bingsen Qiu, Zijian Liu, Xiao Liu et al.

Building training-ready multi-hop question answering (QA) datasets that truly stress a model's retrieval and reasoning abilities remains highly challenging recently. While there have been a few recent evaluation datasets that capture the characteristics of hard-to-search but easy-to-verify problems -- requiring the integration of ambiguous, indirect, and cross-domain cues -- these data resources remain scarce and are mostly designed for evaluation, making them unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL). Meanwhile, manually curating non-trivially retrievable questions -- where answers cannot be found through a single direct query but instead require multi-hop reasoning over oblique and loosely connected evidence -- incurs prohibitive human costs and fails to scale, creating a critical data bottleneck for training high-capability retrieval-and-reasoning agents. To address this, we present an automated framework for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources. The system (i) grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion; (ii) applies reverse question construction to compose oblique cues so that isolated signals are underinformative but their combination uniquely identifies the target entity; and (iii) enforces quality with a two-step evaluation pipeline that combines multi-model consensus filtering with structured constraint decomposition and evidence-based matching. The result is a scalable process that yields complex, retrieval-resistant yet verifiable questions suitable for SFT/RL training as well as challenging evaluation, substantially reducing human curation effort while preserving the difficulty profile of strong evaluation benchmarks.

CVSep 16, 2025
SHREC 2025: Protein surface shape retrieval including electrostatic potential

Taher Yacoub, Camille Depenveiller, Atsushi Tatsuma et al.

This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.

NAJul 13, 2025
Energy Dissipation Rate Guided Adaptive Sampling for Physics-Informed Neural Networks: Resolving Surface-Bulk Dynamics in Allen-Cahn Systems

Chunyan Li, Wenkai Yu, Qi Wang

We introduce the Energy Dissipation Rate guided Adaptive Sampling (EDRAS) strategy, a novel method that substantially enhances the performance of Physics-Informed Neural Networks (PINNs) in solving thermodynamically consistent partial differential equations (PDEs) over arbitrary domains. EDRAS leverages the local energy dissipation rate density as a guiding metric to identify and adaptively re-sample critical collocation points from both the interior and boundary of the computational domain. This dynamical sampling approach improves the accuracy of residual-based PINNs by aligning the training process with the underlying physical structure of the system. In this study, we demonstrate the effectiveness of EDRAS using the Allen-Cahn phase field model in irregular geometries, achieving up to a sixfold reduction in the relative mean square error compared to traditional residual-based adaptive refinement (RAR) methods. Moreover, we compare EDRAS with other residual-based adaptive sampling approaches and show that EDRAS is not only computationally more efficient but also more likely to identify high-impact collocation points. Through numerical solutions of the Allen-Cahn equation with both static (Neumann) and dynamic boundary conditions in 2D disk- and ellipse-shaped domains solved using PINN coupled with EDRAS, we gain significant insights into how dynamic boundary conditions influence bulk phase evolution and thermodynamic behavior. The proposed approach offers an effective, physically informed enhancement to PINN frameworks for solving thermodynamically consistent models, making PINN a robust and versatile computational tool for investigating complex thermodynamic processes in arbitrary geometries.