SEMay 25, 2022
Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code CorporaAnh T. V. Dau, Thang Nguyen-Duc, Hoang Thanh-Tung et al.
Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.
CLAug 5, 2024
XMainframe: A Large Language Model for Mainframe ModernizationAnh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen et al.
Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe's performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.
SEApr 5Code
COBOL-Coder: Domain-Adapted Large Language Models for COBOL Code Generation and TranslationAnh T. V. Dau, Shin Hwei Tan, Jinqiu Yang et al.
COBOL remains a critical language for mainframe systems, yet existing large language models (LLMs) struggle to generate and translate COBOL code correctly. This paper reports our experience in developing and evaluating domain-adapted LLMs for COBOL and mainframe software engineering. We introduce (1) an automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to construct high-quality COBOL training data, and (2) COBOL-Coder, a COBOL-specialized LLM fine-tuned on the curated COBOL domain data. We evaluate COBOL-Coder on two tasks: code generation (on COBOLEval and COBOLCodeBench) and code translation (on COBOL-JavaTrans, our proposed benchmark for bidirectional COBOL-Java translation). In our experiments, COBOL-Coder achieves up to a 73.95 percent compilation success rate and 49.33 Pass-1 on COBOLEval, compared to 41.8 percent and 16.4 for GPT-4o, while most open-source baselines (e.g., CodeGemma, CodeLlama, StarCoder2) fail to produce compilable programs. For Java-to-COBOL translation, COBOL-Coder reaches 34.93 Pass-1, whereas general-purpose LLMs achieve near-zero scores. To assess the usability of LLM-generated code in real-world settings, we conduct a survey with experienced COBOL developers. Participants consistently report that COBOL-Coder exhibits stronger COBOL awareness, has more reliable program structure, and is better aligned with enterprise practices than general-purpose LLMs.
SEApr 5
COBOLAssist: Analyzing and Fixing Compilation Errors for LLM-Powered COBOL Code GenerationAnh T. V. Dau, Shin Hwei Tan, Jinqiu Yang et al.
Legacy programming languages such as COBOL (Common Business-Oriented Language) remain critical in business computing. However, maintaining legacy COBOL systems is increasingly challenging due to a declining pool of skilled developers and the persistence of COBOL errors that require deep domain expertise to resolve. This paper investigates the challenges of COBOL compilation errors and introduces a framework leveraging large language models (LLMs) to address these issues. We first categorize the common compilation errors in LLM-generated COBOL code into three groups: incomplete code errors, syntax errors, and type-related errors. We further propose COBOLAssist, a technique to enhance code correctness through iterative repairs guided by compilation feedback. Our evaluation using five LLMs including GPT variants and mAInframer, shows a high prevalence of incorrect program structures and function usage in COBOL programs and demonstrates the effectiveness of COBOLAssist, with the compilation success rates increasing from 29.5\% to 64.38\% for GPT-4o-mini and from 41.8\% to 95.89\% for GPT-4o. It also improves pass@1 significantly, for example from 9.1 to 22.6 for GPT-4. Notably, while mAInframer-34B achieves the highest compilation success rate, its functional correctness remains limited. This research not only highlights the limitations in current LLMs for COBOL but also demonstrates a practical path forward for automated debugging in legacy systems.
CLMay 9, 2023
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and GenerationDung Nguyen Manh, Nam Le Hai, Anh T. V. Dau et al.
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
CLMay 2, 2023
Class based Influence Functions for Error DetectionThang Nguyen-Duc, Hoang Thanh-Tung, Quan Hung Tran et al.
Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information to improve the stability of IFs. Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost.