Yiran Cheng

SE
h-index20
4papers
3citations
Novelty38%
AI Score43

4 Papers

53.3SEMay 18
Mapping NVD Records to Their Vulnerability-fixing Commits: How Hard is It?

Huu Hung Nguyen, Ting Zhang, Duc Manh Tran et al.

Mapping National Vulnerability Database (NVD) records to vulnerability-fixing commits (VFCs) is crucial for vulnerability analysis but challenging due to sparse explicit links in NVD references. This study explores this mapping's feasibility through an empirical approach. Manual analysis of NVD references showed Git references enable over 86% success, while non-Git references achieve under 14%. Using these findings, we built an automated pipeline extracting 31,942 VFCs from 20,360 NVD records (8.7% of 235,341) with 87% precision, mainly from Git references. To fill gaps, we mined six external security databases, yielding 29,254 VFCs for 18,985 records (8.1%) at 88.4% precision, and GitHub repositories, adding 3,686 VFCs for 2,795 records (1.2%) at 73% precision. Combining these, we mapped 26,710 unique records (11.3% coverage) from 7,634 projects, with overlap between NVD and external databases, plus unique GitHub contributions. Despite success with Git references, 88.7% of records remain unmapped, highlighting the difficulty without Git links. This study offers insights for enhancing vulnerability datasets and guiding future automated security research.

CRJan 9Code
Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs

Honghao Liu, Xuhui Jiang, Chengjin Xu et al.

Preserving privacy in sensitive data while pretraining large language models on small, domain-specific corpora presents a significant challenge. In this work, we take an exploratory step toward privacy-preserving continual pretraining by proposing an entity-based framework that synthesizes encrypted training data to protect personally identifiable information (PII). Our approach constructs a weighted entity graph to guide data synthesis and applies deterministic encryption to PII entities, enabling LLMs to encode new knowledge through continual pretraining while granting authorized access to sensitive data through decryption keys. Our results on limited-scale datasets demonstrate that our pretrained models outperform base models and ensure PII security, while exhibiting a modest performance gap compared to models trained on unencrypted synthetic data. We further show that increasing the number of entities and leveraging graph-based synthesis improves model performance, and that encrypted models retain instruction-following capabilities with long retrieved contexts. We discuss the security implications and limitations of deterministic encryption, positioning this work as an initial investigation into the design space of encrypted data pretraining for privacy-preserving LLMs. Our code is available at https://github.com/DataArcTech/SoE.

CVDec 25, 2025
Resolving compositional and conformational heterogeneity in cryo-EM with deformable 3D Gaussian representations

Bintao He, Yiran Cheng, Hongjia Li et al.

Understanding protein flexibility and its dynamic interactions with other molecules is essential for studying protein function. Although cryogenic electron microscopy(cryo-EM) provides an opportunity to observe macromolecular dynamics directly, computational analysis of datasets mixing continuous and discrete structural states remains a formidable challenge. Here we introduce GaussianEM, a Gaussian-based pseudo-atomic framework that simultaneously resolves compositional and conformational heterogeneity from cryo-EM images. GaussianEM employs a dual-encoder-single-decoder architecture to decompose images into learnable Gaussian components, with variability encoded through modulated parameters. This explicit parameterization yields a continuous, intuitive representation of conformational dynamics that inherently preserves local structural integrity. By modeling displacements in Gaussian space, we capture atomic-scale conformational landscapes, bridging density maps and all-atom models. In comprehensive experiments, GaussianEM successfully reconstructs complex compositional and conformational variability,and resolves previously unobserved details in public datasets. Quantitative evaluations further confirm its ability to capture broader conformational diversity without sacrificing structural fidelity.

SEMar 2, 2025
Towards Reliable LLM-Driven Fuzz Testing: Vision and Road Ahead

Yiran Cheng, Hong Jin Kang, Lwin Khin Shar et al.

Fuzz testing is a crucial component of software security assessment, yet its effectiveness heavily relies on valid fuzz drivers and diverse seed inputs. Recent advancements in Large Language Models (LLMs) offer transformative potential for automating fuzz testing (LLM4Fuzz), particularly in generating drivers and seeds. However, current LLM4Fuzz solutions face critical reliability challenges, including low driver validity rates and seed quality trade-offs, hindering their practical adoption. This paper aims to examine the reliability bottlenecks of LLM-driven fuzzing and explores potential research directions to address these limitations. It begins with an overview of the current development of LLM4SE and emphasizes the necessity for developing reliable LLM4Fuzz solutions. Following this, the paper envisions a vision where reliable LLM4Fuzz transforms the landscape of software testing and security for industry, software development practitioners, and economic accessibility. It then outlines a road ahead for future research, identifying key challenges and offering specific suggestions for the researchers to consider. This work strives to spark innovation in the field, positioning reliable LLM4Fuzz as a fundamental component of modern software testing.