Filipe R. Cogo

SE
h-index18
8papers
92citations
Novelty34%
AI Score37

8 Papers

SEMar 25, 2024Code
Output Format Biases in the Evaluation of Large Language Models for Code Translation

Marcos Macedo, Yuan Tian, Filipe R. Cogo et al.

Code translation between programming languages (PLs) is a critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. Most existing studies instruct LLMs to perform code translation and evaluate their performance by either running the generated outputs through test suites or comparing them to reference outputs (ground truth). These outputs, however, may contain not only executable source code but also additional non-code elements, such as natural language explanations or formatting tokens. We refer to the combination of source code and non-code elements as the output format. It is crucial to understand and address variations in output format, as non-code elements can interfere with evaluation metrics, resulting in biased assessments of model performance and comparisons. We conduct an empirical analysis of the outputs from eleven instruct-tuned open-source LLMs, across five PLs: C, C++, Go, Java, and Python. The results show that between 26.4% and 73.7% of outputs produced by our evaluated LLMs necessitate post-processing. To mitigate output format bias, we propose a strategic combination of prompt engineering and regular expressions that effectively extracts source code from mixed-format outputs, enabling the eleven open-source models to achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our empirical study confirms that output format bias affects widely used execution-based metrics, i.e., Computational Accuracy (CA), and text-based metrics, i.e., BLEU, CodeBLEU and CrystalBLEU. Additionally, we test five closed-source LLMs and observe that they also generate varying distributions of output formats, which could lead to output format biases. Our results highlight the need to mitigate the output format bias to enable reliable evaluations in LLMs for code translation.

SENov 1, 2024Code
InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation

Marcos Macedo, Yuan Tian, Pengyu Nie et al.

Code translation aims to convert a program from one programming language (PL) to another. This long-standing software engineering task is crucial for modernizing legacy systems, ensuring cross-platform compatibility, enhancing performance, and more. However, automating this process remains challenging due to many syntactic and semantic differences between PLs. Recent studies show that even advanced techniques such as large language models (LLMs), especially open-source LLMs, still struggle with the task. Currently, code LLMs are trained with source code from multiple programming languages, thus presenting multilingual capabilities. In this paper, we investigate whether such multilingual capabilities can be harnessed to enhance code translation. To achieve this goal, we introduce InterTrans, an LLM-based automated code translation approach that, in contrast to existing approaches, leverages intermediate translations across PLs to bridge the syntactic and semantic gaps between source and target PLs. InterTrans contains two stages. It first utilizes a novel Tree of Code Translation (ToCT) algorithm to plan transitive intermediate translation sequences between a given source and target PL, then validates them in a specific order. We evaluate InterTrans with three open LLMs on three benchmarks (i.e., CodeNet, HumanEval-X, and TransCoder) involving six PLs. Results show an absolute improvement between 18.3% to 43.3% in Computation Accuracy (CA) for InterTrans over Direct Translation with 10 attempts. The best-performing variant of InterTrans (with Magicoder LLM) achieved an average CA of 87.3%-95.4% on three benchmarks.

SESep 15, 2025Code
Understanding Prompt Management in GitHub Repositories: A Call for Best Practices

Hao Li, Hicham Masri, Filipe R. Cogo et al.

The rapid adoption of foundation models (e.g., large language models) has given rise to promptware, i.e., software built using natural language prompts. Effective management of prompts, such as organization and quality assurance, is essential yet challenging. In this study, we perform an empirical analysis of 24,800 open-source prompts from 92 GitHub repositories to investigate prompt management practices and quality attributes. Our findings reveal critical challenges such as considerable inconsistencies in prompt formatting, substantial internal and external prompt duplication, and frequent readability and spelling issues. Based on these findings, we provide actionable recommendations for developers to enhance the usability and maintainability of open-source prompts within the rapidly evolving promptware ecosystem.

SEFeb 11, 2022Code
Towards Build Verifiability for Java-based Systems

Jiawen Xiong, Yong Shi, Boyuan Chen et al.

Build verifiability refers to the property that the build of a software system can be verified by independent third parties and it is crucial for the trustworthiness of a software system. Various efforts towards build verifiability have been made to C/C++-based systems, yet the techniques for Java-based systems are not systematic and are often specific to a particular build tool (e.g., Maven). In this study, we present a systematic approach towards build verifiability on Java-based systems. Our approach consists of three parts: a unified build process, a tool that dynamically controls non-determinism during the build process, and another tool that eliminates non-equivalences by post-processing the build artifacts. We apply our approach on 46 unverified open source projects from Reproducible Central and 13 open source projects that are widely used by Huawei commercial products. As a result, 91% of the unverified Reproducible Central projects and 100% of the commercially adopted OSS projects are successfully verified with our approach. In addition, based on our experience in analyzing thousands of builds for both commercial and open source Java-based systems, we present 14 patterns that introduce non-equivalences in generated build artifacts and their respective mitigation strategies. Among these patterns, 11 (78%) are unique for Java-based system, whereas the remaining 3 (22%) are common with C/C++-based systems. The approach and the findings of this paper are useful for both practitioners and researchers who are interested in build verifiability.

SEFeb 25, 2024
Rethinking Software Engineering in the Foundation Model Era: A Curated Catalogue of Challenges in the Development of Trustworthy FMware

Ahmed E. Hassan, Dayi Lin, Gopi Krishnan Rajbahadur et al.

Foundation models (FMs), such as Large Language Models (LLMs), have revolutionized software development by enabling new use cases and business models. We refer to software built using FMs as FMware. The unique properties of FMware (e.g., prompts, agents, and the need for orchestration), coupled with the intrinsic limitations of FMs (e.g., hallucination) lead to a completely new set of software engineering challenges. Based on our industrial experience, we identified 10 key SE4FMware challenges that have caused enterprise FMware development to be unproductive, costly, and risky. In this paper, we discuss these challenges in detail and state the path for innovation that we envision. Next, we present FMArts, which is our long-term effort towards creating a cradle-to-grave platform for the engineering of trustworthy FMware. Finally, we (i) show how the unique properties of FMArts enabled us to design and develop a complex FMware for a large customer in a timely manner and (ii) discuss the lessons that we learned in doing so. We hope that the disclosure of the aforementioned challenges and our associated efforts to tackle them will not only raise awareness but also promote deeper and further discussions, knowledge sharing, and innovative solutions across the software engineering discipline.

SEMay 15, 2025
The Hitchhikers Guide to Production-ready Trustworthy Foundation Model powered Software (FMware)

Kirill Vasilevski, Benjamin Rombaut, Gopi Krishnan Rajbahadur et al.

Foundation Models (FMs) such as Large Language Models (LLMs) are reshaping the software industry by enabling FMware, systems that integrate these FMs as core components. In this KDD 2025 tutorial, we present a comprehensive exploration of FMware that combines a curated catalogue of challenges with real-world production concerns. We first discuss the state of research and practice in building FMware. We further examine the difficulties in selecting suitable models, aligning high-quality domain-specific data, engineering robust prompts, and orchestrating autonomous agents. We then address the complex journey from impressive demos to production-ready systems by outlining issues in system testing, optimization, deployment, and integration with legacy software. Drawing on our industrial experience and recent research in the area, we provide actionable insights and a technology roadmap for overcoming these challenges. Attendees will gain practical strategies to enable the creation of trustworthy FMware in the evolving technology landscape.

SEFeb 8, 2022
Assessing the alignment between the information needs of developers and the documentation of programming languages: A case study on Rust

Filipe R. Cogo, Xin Xia, Ahmed E. Hassan

Programming language documentation refers to the set of technical documents that provide application developers with a description of the high-level concepts of a language. Such documentation is essential to support application developers in the effective use of a programming language. One of the challenges faced by documenters (i.e., personnel that produce documentation) is to ensure that documentation has relevant information that aligns with the concrete needs of developers. In this paper, we present an automated approach to support documenters in evaluating the differences and similarities between the concrete information need of developers and the current state of documentation (a problem that we refer to as the topical alignment of a programming language documentation). Our approach leverages semi-supervised topic modelling to assess the similarities and differences between the topics of Q&A posts and the official documentation. To demonstrate the application of our approach, we perform a case study on the documentation of Rust. Our results show that there is a relatively high level of topical alignment in Rust documentation. Still, information about specific topics is scarce in both the Q&A websites and the documentation, particularly related topics with programming niches such as network, game, and database development. For other topics (e.g., related topics with language features such as structs, patterns and matchings, and foreign function interface), information is only available on Q&A websites while lacking in the official documentation. Finally, we discuss implications for programming language documenters, particularly how to leverage our approach to prioritize topics that should be added to the documentation.

SEJan 27, 2022
An Empirical Study of Yanked Releases in the Rust Package Registry

Hao Li, Filipe R. Cogo, Cor-Paul Bezemer

Cargo, the software packaging manager of Rust, provides a yank mechanism to support release-level deprecation, which can prevent packages from depending on yanked releases. Most prior studies focused on code-level (i.e., deprecated APIs) and package-level deprecation (i.e., deprecated packages). However, few studies have focused on release-level deprecation. In this study, we investigate how often and how the yank mechanism is used, the rationales behind its usage, and the adoption of yanked releases in the Cargo ecosystem. Our study shows that 9.6% of the packages in Cargo have at least one yanked release, and the proportion of yanked releases kept increasing from 2014 to 2020. Package owners yank releases for other reasons than withdrawing a defective release, such as fixing a release that does not follow semantic versioning or indicating a package is removed or replaced. In addition, we found that 46% of the packages directly adopted at least one yanked release and the yanked releases propagated through the dependency network, which leads to 1.4% of the releases in the ecosystem having unresolved dependencies.