SEMay 6Code
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for CodeKaifeng He, Xiaojun Zhang, Peiliang Cai et al.
Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation. Our repository is available at https://github.com/SYSUSELab/From-Data-to-Code.
SEJan 10, 2023
Understanding the Complexity and Its Impact on Testing in ML-Enabled SystemsJunming Cao, Bihuan Chen, Longjie Hu et al.
Machine learning (ML) enabled systems are emerging with recent breakthroughs in ML. A model-centric view is widely taken by the literature to focus only on the analysis of ML models. However, only a small body of work takes a system view that looks at how ML components work with the system and how they affect software engineering for MLenabled systems. In this paper, we adopt this system view, and conduct a case study on Rasa 3.0, an industrial dialogue system that has been widely adopted by various companies around the world. Our goal is to characterize the complexity of such a largescale ML-enabled system and to understand the impact of the complexity on testing. Our study reveals practical implications for software engineering for ML-enabled systems.
CVNov 26, 2025
The Age-specific Alzheimer 's Disease Prediction with Characteristic Constraints in Nonuniform Time SpanXin Hong, Kaifeng Huang
Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer's disease poses challenges, particularly in accurately representing the disease's characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.
CVFeb 23, 2025Code
Learning from Rendering: Realistic and Controllable Extreme Rainy Image Synthesis for Autonomous Driving SimulationKaibin Zhou, Kaifeng Huang, Hao Deng et al.
Autonomous driving simulators provide an effective and low-cost alternative for evaluating or enhancing visual perception models. However, the reliability of evaluation depends on the diversity and realism of the generated scenes. Extreme weather conditions, particularly extreme rainfalls, are rare and costly to capture in real-world settings. While simulated environments can help address this limitation, existing rainy image synthesizers often suffer from poor controllability over illumination and limited realism, which significantly undermines the effectiveness of the model evaluation. To that end, we propose a learning-from-rendering rainy image synthesizer, which combines the benefits of the realism of rendering-based methods and the controllability of learning-based methods. To validate the effectiveness of our extreme rainy image synthesizer on semantic segmentation task, we require a continuous set of well-labeled extreme rainy images. By integrating the proposed synthesizer with the CARLA driving simulator, we develop CARLARain an extreme rainy street scene simulator which can obtain paired rainy-clean images and labels under complex illumination conditions. Qualitative and quantitative experiments validate that CARLARain can effectively improve the accuracy of semantic segmentation models in extreme rainy scenes, with the models' accuracy (mIoU) improved by 5% - 8% on the synthetic dataset and significantly enhanced in real extreme rainy scenarios under complex illuminations. Our source code and datasets are available at https://github.com/kb824999404/CARLARain/.
SEDec 4, 2021Code
Tracking Patches for Open Source Software VulnerabilitiesCongying Xu, Bihuan Chen, Chenhao Lu et al.
Open source software (OSS) vulnerabilities threaten the security of software systems that use OSS. Vulnerability databases provide valuable information (e.g., vulnerable version and patch) to mitigate OSS vulnerabilities. There arises a growing concern about the information quality of vulnerability databases. However, it is unclear what the quality of patches in existing vulnerability databases is; and existing manual or heuristic-based approaches for patch tracking are either too expensive or too specific to apply to all OSS vulnerabilities.
SEFeb 25, 2020Code
An Empirical Study of Usages, Updates and Risks of Third-Party Libraries in Java ProjectsYing Wang, Bihuan Chen, Kaifeng Huang et al.
Third-party libraries are a central building block to develop software systems. However, outdated third-party libraries are commonly used, and developers are usually less aware of the potential risks. Therefore, a quantitative and holistic study on usages, updates and risks of third-party libraries can provide practical insights to improve the ecosystem sustainably. In this paper, we conduct such a study in the Java ecosystem. Specifically, we conduct a library usage analysis (e.g., usage intensity and outdatedness) and a library update analysis (e.g., update intensity and delay) using 806 open-source projects. The two analyses aim to quantify usage and update practices holistically from the perspective of both open-source projects and third-party libraries. Then, we conduct a library risk analysis (e.g., potential risk and developer response) in terms of bugs with 15 popularly-used third-party libraries. This analysis aims to quantify the potential risk of using outdated libraries and the developer response to the risk. Our findings from the three analyses provide practical insights to developers and researchers on problems and potential solutions in maintaining third-party libraries (e.g., smart alerting and automated updating of outdated libraries). To demonstrate the usefulness of our findings, we propose a bug-driven alerting system for assisting developers to make confident decisions in updating third-party library versions. We have released our dataset to foster valuable applications and improve the ecosystem.
CRNov 3, 2025
Panther: A Cost-Effective Privacy-Preserving Framework for GNN Training and Inference Services in Cloud EnvironmentsCongcong Chen, Xinyu Liu, Kaifeng Huang et al.
Graph Neural Networks (GNNs) have marked significant impact in traffic state prediction, social recommendation, knowledge-aware question answering and so on. As more and more users move towards cloud computing, it has become a critical issue to unleash the power of GNNs while protecting the privacy in cloud environments. Specifically, the training data and inference data for GNNs need to be protected from being stolen by external adversaries. Meanwhile, the financial cost of cloud computing is another primary concern for users. Therefore, although existing studies have proposed privacy-preserving techniques for GNNs in cloud environments, their additional computational and communication overhead remain relatively high, causing high financial costs that limit their widespread adoption among users. To protect GNN privacy while lowering the additional financial costs, we introduce Panther, a cost-effective privacy-preserving framework for GNN training and inference services in cloud environments. Technically, Panther leverages four-party computation to asynchronously executing the secure array access protocol, and randomly pads the neighbor information of GNN nodes. We prove that Panther can protect privacy for both training and inference of GNN models. Our evaluation shows that Panther reduces the training and inference time by an average of 75.28% and 82.80%, respectively, and communication overhead by an average of 52.61% and 50.26% compared with the state-of-the-art, which is estimated to save an average of 55.05% and 59.00% in financial costs (based on on-demand pricing model) for the GNN training and inference process on Google Cloud Platform.
SEApr 2
Automated Functional Testing for Malleable Mobile Application Driven from User IntentYuying Wang, Kaifeng Huang, Hao Deng et al.
Software malleability allows applications to be easily changed, configured, and adapted even after deployment. While prior work has explored configurable systems, adaptive recommender systems, and malleable GUIs, these approaches are often tailored to specific software and lack generalizability. In this work, we envision per-user malleable mobile applications, where end-users can specify requirements that are automatically implemented via LLM-based code generation. However, realizing this vision requires overcoming the key challenge of designing automated test generation that can reliably verify both the presence and correctness of user-specified functionalities. We propose \tool, a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. We build a benchmark spanning six popular mobile applications with both correct and faulty user-requested functionalities, demonstrating that \tool effectively validates per-user features and is practical for real-world deployment. Our work highlights the feasibility of shifting mobile app development from a product-manager-driven to an end-user-driven paradigm.
AIFeb 11
See, Plan, Snap: Evaluating Multimodal GUI Agents in ScratchXingyi Zhang, Yulei Ye, Kaifeng Huang et al.
Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.
SEJul 8, 2025
TigAug: Data Augmentation for Testing Traffic Light Detection in Autonomous Driving SystemsYou Lu, Dingji Wang, Kaifeng Huang et al.
Autonomous vehicle technology has been developed in the last decades with recent advances in sensing and computing technology. There is an urgent need to ensure the reliability and robustness of autonomous driving systems (ADSs). Despite the recent achievements in testing various ADS modules, little attention has been paid on the automated testing of traffic light detection models in ADSs. A common practice is to manually collect and label traffic light data. However, it is labor-intensive, and even impossible to collect diverse data under different driving environments. To address these problems, we propose and implement TigAug to automatically augment labeled traffic light images for testing traffic light detection models in ADSs. We construct two families of metamorphic relations and three families of transformations based on a systematic understanding of weather environments, camera properties, and traffic light properties. We use augmented images to detect erroneous behaviors of traffic light detection models by transformation-specific metamorphic relations, and to improve the performance of traffic light detection models by retraining. Large-scale experiments with four state-of-the-art traffic light detection models and two traffic light datasets have demonstrated that i) TigAug is effective in testing traffic light detection models, ii) TigAug is efficient in synthesizing traffic light images, and iii) TigAug generates traffic light images with acceptable naturalness.
SEFeb 25, 2020
Interactive, Effort-Aware Library Version HarmonizationKaifeng Huang, Bihuan Chen, Bowen Shi et al.
As a mixed result of intensive dependency on third-party libraries, flexible mechanism to declare dependencies, and increased number of modules in a project, multiple versions of the same third-party library are directly depended in different modules of a project. Such library version inconsistencies can increase dependency maintenance cost, or even lead to dependency conflicts when modules are inter-dependent. Although automated build tools (e.g., Maven's enforcer plugin) provide partial support to detect library version inconsistencies, they do not provide any support to harmonize inconsistent library versions. We first conduct a survey with 131 Java developers from GitHub to retrieve first-hand information about the root causes, detection methods, reasons for fixing or not fixing, fixing strategies, fixing efforts, and tool expectations on library version inconsistencies. Then, based on the insights from our survey, we propose LibHarmo, an interactive, effort-aware library version harmonization technique, to detect library version inconsistencies, interactively suggest a harmonized version with the least harmonization efforts based on library API usage analysis, and refactor build configuration files. LibHarmo is currently developed for Java Maven projects. Our experimental study on 443 highly-starred Java Maven projects from GitHub indicates that i) LibHarmo identifies 621 library version inconsistencies covering 152 (34.3%) of projects, and ii) the average harmonization efforts are that 1 and 12 library API calls are affected, respectively due to the deleted and changed library APIs in the harmonized version. 5 library version inconsistencies have been confirmed, and 1 of them has been already harmonized by developers.