Wesley K. G. Assunção

SE
4papers
Novelty36%
AI Score40

4 Papers

SEMar 16Code
Test Code Review in the Era of GitHub Actions: A Replication Study

Hui Sun, Yinan Wu, Wesley K. G. Assunção et al.

Test code is indispensable in software development, ensuring the correctness of production code and supporting maintainability. Nonetheless, errors or omissions in the test code can conceal production defects. While code review is widely adopted to assess code quality and correctness, little research has examined how test code is reviewed. Spadini et al.'s research on Gerrit (a pre-commit review model) found that test code receives significantly less discussion than production code. However, the most popular review model is currently based on pull requests (PRs), in which contributors propose changes for discussion and approval, a more negotiable and flexible model compared to Gerrit. Furthermore, GitHub Actions (GHA) has become widely used to automate pre-checks and testing, potentially impacting review practices. This leads us to explore whether Spadini et al.'s findings still hold for the PR model in the era of GHA? Our work replicates and extends their work. We focus on GitHub PRs and analyze six open-source projects. We investigate the impact of the PR model and GHA on test code review. Our results show that GitHub's PR model fosters more balanced discussions between test and production files than Gerrit, albeit with lower overall comment density. However, despite cross-project heterogeneity, GHA adoption triggered a sharp pivot toward production code. Post-GHA, for PRs involving tests, both review probability and comment density reached a median of zero. These findings reveal how evolving continuous integration pipelines can marginalize test code review. The observed decline in test-centric discussion under GHA warrants concern regarding long-term software quality. Our work also presents recommendations for stakeholders involved in the software development life cycle.

SEApr 24, 2025
Seamless Data Migration between Database Schemas with DAMI-Framework: An Empirical Study on Developer Experience

Delfina Ramos-Vidal, Alejandro Cortiñas, Miguel R. Luaces et al.

Many businesses depend on legacy systems, which often use outdated technology that complicates maintenance and updates. Therefore, software modernization is essential, particularly data migration between different database schemas. Established methodologies, like model transformation and ETL tools, facilitate this migration; they require deep knowledge of database languages and both the source and target schemas. This necessity renders data migration an error-prone and cognitively demanding task. Our objective is to alleviate developers' workloads during schema evolution through our DAMI-Framework. This framework incorporates a domain-specific language (DSL) and a parser to facilitate data migration between database schemas. DAMI-DSL simplifies schema mapping while the parser automates SQL script generation. We assess developer experience in data migration by conducting an empirical evaluation with 21 developers to assess their experiences using our DSL versus traditional SQL. The study allows us to measure their perceptions of the DSL properties and user experience. The participants praised DAMI-DSL for its readability and ease of use. The findings indicate that our DSL reduces data migration efforts compared to SQL scripts.

SEMar 19
Where are the Hidden Gems? Applying Transformer Models for Design Discussion Detection

Lawrence Arkoh, Daniel Feitosa, Wesley K. G. Assunção

Design decisions are at the core of software engineering and appear in Q\&A forums, mailing lists, pull requests, issue trackers, and commit messages. Design discussions spanning a project's history provide valuable information for informed decision-making, such as refactoring and software modernization. Machine learning techniques have been used to detect design decisions in natural language discussions; however, their effectiveness is limited by the scarcity of labeled data and the high cost of annotation. Prior work adopted cross-domain strategies with traditional classifiers, training on one domain and testing on another. Despite their success, transformer-based models, which often outperform traditional methods, remain largely unexplored in this setting. The goal of this work is to investigate the performance of transformer-based models (i.e., BERT, RoBERTa, XLNet, LaMini-Flan-T5-77M, and ChatGPT-4o-mini) for detecting design-related discussions. To this end, we conduct a conceptual replication of prior cross-domain studies while extending them with modern transformer architectures and addressing methodological issues in earlier work. The models were fine-tuned on Stack Overflow and evaluated on GitHub artifacts (i.e., pull requests, issues, and commits). BERT and RoBERTa show strong recall across domains, while XLNet achieves higher precision but lower recall. ChatGPT-4o-mini yields the highest recall and competitive overall performance, whereas LaMini-Flan-T5-77M provides a lightweight alternative with stronger precision but less balanced performance. We also evaluated similar-word injection for data augmentation, but unlike prior findings, it did not yield meaningful improvements. Overall, these results highlight both the opportunities and trade-offs of using modern language models for detecting design discussion.

SEApr 9
Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs

Kevin Lira, Baldoino Fonseca, Davy Baía et al.

Large Language Models (LLMs) have been a promising way for automated vulnerability detection. However, most prior studies have explored the use of LLMs to detect vulnerabilities only within single functions, disregarding those related to interprocedural dependencies. These studies overlook vulnerabilities that arise from data and control flows that span multiple functions. Thus, leveraging the context provided by callers and callees may help identify vulnerabilities. This study empirically investigates the effectiveness of detection, the inference cost, and the quality of explanations of four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash) in detecting vulnerabilities related to interprocedural dependencies. To do that, we conducted an empirical study on 509 vulnerabilities from the ReposVul dataset, systematically varying the level of interprocedural context (target function code-only, target function + callers, and target function + callees) and evaluating the four modern LLMs across C, C++, and Python. The results show that Gemini 3 Flash offers the best cost-effectiveness trade-off for C vulnerabilities, achieving F1 >= 0.978 at an estimated cost of $0.50-$0.58 per configuration, and Claude Haiku 4.5 correctly identified and explained the vulnerability in 93.6% of the evaluated cases. Overall, the findings have direct implications for the design of AI-assisted security analysis tools that can generalize across codebases in multiple programming languages.