Amjed Tahir

SE
h-index24
19papers
183citations
Novelty27%
AI Score51

19 Papers

87.2SEApr 7Code
Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects

Syed Mohammad Kashif, Ruiyin Li, Peng Liang et al.

New generation of AI coding tools, including AI-powered IDEs equipped with agentic capabilities, can generate code within the context of the project. These AI IDEs are increasingly perceived as capable of producing project-level code at scale. However, there is limited empirical evidence on the extent to which they can generate large-scale software systems and what design issues such systems may exhibit. To address this gap, we conducted a study to explore the capability of Cursor in generating large-scale projects and to evaluate the design quality of projects generated by Cursor. First, we propose a Feature-Driven Human-In-The-Loop (FD-HITL) framework that systematically guides project generation from curated project descriptions. We generated 10 projects using Cursor with the FD-HITL framework across three application domains and multiple technologies. We assessed the functional correctness of these projects through manual evaluation, obtaining an average functional correctness score of 91%. Next, we analyzed the generated projects using two static analysis tools, CodeScene and SonarQube, to detect design issues. We identified 1,305 design issues categorized into 9 categories by CodeScene and 3,193 issues in 11 categories by SonarQube. Our findings show that (1) when used with the FD-HITL framework, Cursor can generate functional large-scale projects averaging 16,965 LoC and 114 files; (2) the generated projects nevertheless contain design issues that may pose long-term maintainability and evolvability risks, requiring careful review by experienced developers; (3) the most prevalent issues include Code Duplication, high Code Complexity, Large Methods, Framework Best-Practice Violations, Exception-Handling Issues and Accessibility Issues; (4) these design issues violate design principles such as SRP, SoC, and DRY. The replication package is at https://github.com/Kashifraz/DIinAGP

SEDec 28, 2025Code
FasterPy: An LLM-based Code Execution Efficiency Optimization Framework

Yue Wu, Minghao Han, Ruiyin Li et al.

Code often suffers from performance bugs. These bugs necessitate the research and practice of code optimization. Traditional rule-based methods rely on manually designing and maintaining rules for specific performance bugs (e.g., redundant loops, repeated computations), making them labor-intensive and limited in applicability. In recent years, machine learning and deep learning-based methods have emerged as promising alternatives by learning optimization heuristics from annotated code corpora and performance measurements. However, these approaches usually depend on specific program representations and meticulously crafted training datasets, making them costly to develop and difficult to scale. With the booming of Large Language Models (LLMs), their remarkable capabilities in code generation have opened new avenues for automated code optimization. In this work, we proposed FasterPy, a low-cost and efficient framework that adapts LLMs to optimize the execution efficiency of Python code. FasterPy combines Retrieval-Augmented Generation (RAG), supported by a knowledge base constructed from existing performance-improving code pairs and corresponding performance measurements, with Low-Rank Adaptation (LoRA) to enhance code optimization performance. Our experimental results on the Performance Improving Code Edits (PIE) benchmark demonstrate that our method outperforms existing models on multiple metrics. The FasterPy tool and the experimental results are available at https://github.com/WuYue22/fasterpy.

53.7SEMay 7
On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies

Ali Soltanian Fard Jahromi, Amjed Tahir, Peng Liang et al.

The security of AI-generated code remains a major obstacle to its widespread adoption. Although code generation models achieve strong performance on functional benchmarks, their outputs frequently contain bugs and security weaknesses that undermine their trustworthiness. Prior work has explored a range of approaches to mitigate security issues in AI-generated code, e.g., using static analysis-guided generation and prompt engineering. However, their effectiveness varies widely across models and settings. This paper presents a systematic investigation of strategies for hardening model-generated code against a list of Common Weakness Enumeration (CWE). We assess the extent to which these strategies improve security across models and programming languages, using fine-tuning and prompting approaches for model output refinement. Beyond the prevalence of security weaknesses, we analyse the severity of identified CWEs, their co-occurrence, and the unintended consequences of remediation (i.e., whether fixing certain weaknesses introduces new weaknesses elsewhere in the same code). Our results show that security improvements are highly strategy- and model-dependent. Although some approaches reduce specific classes of weaknesses, they often introduce new weaknesses as side effects of the fixes. Moreover, no strategy consistently eliminates weaknesses across all models and scenarios, highlighting the absence of a universally effective "bulletproof" solution for secure AI-generated code.

SEDec 4, 2025
A Survey of Bugs in AI-Generated Code

Ruofan Gao, Amjed Tahir, Peng Liang et al.

Developers are widely using AI code-generation models, aiming to increase productivity and efficiency. However, there are also quality concerns regarding the AI-generated code. The generated code is produced by models trained on publicly available code, which are known to contain bugs and quality issues. Those issues can cause trust and maintenance challenges during the development process. Several quality issues associated with AI-generated code have been reported, including bugs and defects. However, these findings are often scattered and lack a systematic summary. A comprehensive review is currently lacking to reveal the types and distribution of these errors, possible remediation strategies, as well as their correlation with the specific models. In this paper, we systematically analyze the existing AI-generated code literature to establish an overall understanding of bugs and defects in generated code, providing a reference for future model improvement and quality assessment. We aim to understand the nature and extent of bugs in AI-generated code, and provide a classification of bug types and patterns present in code generated by different models. We also discuss possible fixes and mitigation strategies adopted to eliminate bugs from the generated code.

41.6SEMay 23
From Prompting to Verification: How Experience Shapes Vibe Coding Practices

Ahmed Fawzy, Amjed Tahir, Kelly Blincoe

AI code generation tools have expanded software creation beyond professional developers, giving rise to vibe coding, a practice in which users generate software via natural-language prompts, evaluate outputs primarily by execution. Prior work has examined how AI code generation tools support programming tasks within specific user groups, typically professional developers, leaving open the question of how vibe coding practices differ across experience levels. We address this gap by surveying 162 vibe coders belonging to three user experience groups: non-coders, novices, and professional developers. Our results show that experience selectively shapes vibe coding. Reported experiences and perceptions of code quality are broadly similar across groups, with all three recognising both the strengths and limitations of vibe coding. In contrast, motivations, interaction styles, and quality assurance practices diverge with experience. Non-developers are most motivated by accessibility, novices emphasise learning and experimentation, and professionals use vibe coding more frequently in work-related contexts. We synthesise these findings as a perception--action gap: a general awareness of risks in AI-generated code is broadly distributed, but the capacity to evaluate, debug, and verify remains experience-dependent. We show that vibe coding is partially democratising as it broadens access to software creation without equally distributing the expertise to evaluate it.

49.5SEMar 10
The Future of Software Engineering Conferences: A New Zealand Perspective

Kelly Blincoe, Sherlock A. Licorish, Judith Fuchs et al.

Software engineering (SE) conferences are vital for knowledge exchange and collaboration, yet can also involve significant barriers for researchers in geographically distant regions such as New Zealand. We identify barriers such as high travel costs, misaligned academic calendars, and limited representation, and propose strategies including hybrid participation, cost-conscious venues, and governance reforms. We make recommendations to promote equitable global participation and strengthen the SE research community.

SEApr 2, 2021Code
Feature Evolution and Reuse -- An Exploratory Study of Eclipse

Amjed Tahir, Sherlock A. Licorish, Stephen G. MacDonell

One of the purported ways to increase productivity and reduce development time is to reuse existing features and modules. If reuse is adopted, logically then, it will have a direct impact on a system's evolution. However, the evidence in the literature is not clear on the extent to which reuse is practiced in real-world projects, nor how it is practiced. In this paper we report the results of an investigation of reuse and evolution of software features in one of the largest open-source ecosystems - Eclipse. Eclipse provides a leading example of how a system can grow dramatically in size and number of features while maintaining its quality. Our results demonstrate the extent of feature reuse and evolution and also patterns of reuse across ten different Eclipse releases (from Europa to Neon).

SEMar 27, 2021Code
An empirical study into the relationship between class features and test smells

Amjed Tahir, Steve Counsell, Stephen G. MacDonell

While a substantial body of prior research has investigated the form and nature of production code, comparatively little attention has examined characteristics of test code, and, in particular, test smells in that code. In this paper, we explore the relationship between production code properties (at the class level) and a set of test smells, in five open source systems. Specifically, we examine whether complexity properties of a production class can be used as predictors of the presence of test smells in the associated unit test. Our results, derived from the analysis of 975 production class-unit test pairs, show that the Cyclomatic Complexity (CC) and Weighted Methods per Class (WMC) of production classes are strong indicators of the presence of smells in their associated unit tests. The Lack of Cohesion of Methods in a production class (LCOM) also appears to be a good indicator of the presence of test smells. Perhaps more importantly, all three metrics appear to be good indicators of particular test smells, especially Eager Test and Duplicated Code. The Depth of the Inheritance Tree (DIT), on the other hand, was not found to be significantly related to the incidence of test smells. The results have important implications for large-scale software development, particularly in a context where organizations are increasingly using, adopting or adapting open source code as part of their development strategy and need to ensure that classes and methods are kept as simple as possible.

SEMar 12, 2021Code
Combining Dynamic Analysis and Visualization to Explore the Distribution of Unit Test Suites

Amjed Tahir, Stephen G. MacDonell

As software systems have grown in scale and complexity the test suites built alongside those systems have also become increasingly complex. Understanding key aspects of test suites, such as their coverage of production code, is important when maintaining or reengineering systems. This work investigates the distribution of unit tests in Open Source Software (OSS) systems through the visualization of data obtained from both dynamic and static analysis. Our long-term aim is to support developers in their understanding of test distribution and the relationship of tests to production code. We first obtain dynamic coupling information from five selected OSS systems and we then map the test and production code results. The mapping is shown in graphs that depict both the dependencies between classes and static test information. We analyze these graphs using Centrality metrics derived from graph theory and SNA. Our findings suggest that, for these five systems at least, unit test and dynamic coupling information 'do not match', in that unit tests do not appear to be distributed in line with the systems' dynamic coupling. We contend that, by mapping dynamic coupling data onto unit test information, and through the use of software metrics and visualization, we can locate central system classes and identify to which classes unit testing effort has (or has not) been dedicated.

SEJan 29, 2024
An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors

Jiaxin Yu, Peng Liang, Yujia Fu et al.

Security code review is a time-consuming and labor-intensive process typically requiring integration with automated security defect detection tools. However, existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges. In this study, we conducted an empirical study to explore the potential of LLMs in detecting security defects during code review. Specifically, we evaluated the performance of six LLMs under five different prompts and compared them with state-of-the-art static analysis tools. We also performed linguistic and regression analyses for the best-performing LLM to identify quality problems in its responses and factors influencing its performance. Our findings showthat: (1) existing pre-trained LLMs have limited capability in security code review but significantly outperformthe state-of-the-art static analysis tools. (2) GPT-4 performs best among all LLMs when provided with a CWE list for reference. (3) GPT-4 frequently generates verbose or non-compliant responses with the task requirements given in the prompts. (4) GPT-4 is more adept at identifying security defects in code files with fewer tokens, containing functional logic, or written by developers with less involvement in the project.

SEApr 23, 2025
On Developers' Self-Declaration of AI-Generated Code: An Analysis of Practices

Syed Mohammad Kashif, Peng Liang, Amjed Tahir

AI code generation tools have gained significant popularity among developers, who use them to assist in software development due to their capability to generate code. Existing studies mainly explored the quality, e.g., correctness and security, of AI-generated code, while in real-world software development, the prerequisite is to distinguish AI-generated code from human-written code, which emphasizes the need to explicitly declare AI-generated code by developers. To this end, this study intends to understand the ways developers use to self-declare AI-generated code and explore the reasons why developers choose to self-declare or not. We conducted a mixed-methods study consisting of two phases. In the first phase, we mined GitHub repositories and collected 613 instances of AI-generated code snippets. In the second phase, we conducted a follow-up practitioners' survey, which received 111 valid responses. Our research revealed the practices followed by developers to self-declare AI-generated code. Most practitioners (76.6%) always or sometimes self-declare AI-generated code. In contrast, other practitioners (23.4%) noted that they never self-declare AI-generated code. The reasons for self-declaring AI-generated code include the need to track and monitor the code for future review and debugging, and ethical considerations. The reasons for not self-declaring AI-generated code include extensive modifications to AI-generated code and the developers' perception that self-declaration is an unnecessary activity. We finally provided guidelines for practitioners to self-declare AI-generated code, addressing ethical and code quality concerns.

SEJun 8, 2021
Does class size matter? An in-depth assessment of the effect of class size in software defect prediction

Amjed Tahir, Kwabena E. Bennin, Xun Xiao et al.

In the past 20 years, defect prediction studies have generally acknowledged the effect of class size on software prediction performance. To quantify the relationship between object-oriented (OO) metrics and defects, modelling has to take into account the direct, and potentially indirect, effects of class size on defects. However, some studies have shown that size cannot be simply controlled or ignored, when building prediction models. As such, there remains a question whether, and when, to control for class size. This study provides a new in-depth examination of the impact of class size on the relationship between OO metrics and software defects or defect-proneness. We assess the impact of class size on the number of defects and defect-proneness in software systems by employing a regression-based mediation (with bootstrapping) and moderation analysis to investigate the direct and indirect effect of class size in count and binary defect prediction. Our results show that the size effect is not always significant for all metrics. Of the seven OO metrics we investigated, size consistently has significant mediation impact only on the relationship between Coupling Between Objects (CBO) and defects/defect-proneness, and a potential moderation impact on the relationship between Fan-out and defects/defect-proneness. Based on our results we make three recommendations. One, we encourage researchers and practitioners to examine the impact of class size for the specific data they have in hand and through the use of the proposed statistical mediation/moderation procedures. Two, we encourage empirical studies to investigate the indirect effect of possible additional variables in their models when relevant. Three, the statistical procedures adopted in this study could be used in other empirical software engineering research to investigate the influence of potential mediators/moderators.

SEApr 26, 2021
Revisiting the size effect in software fault prediction models

Amjed Tahir, Kwabena E. Bennin, Stephen G. MacDonell et al.

BACKGROUND: In object oriented (OO) software systems, class size has been acknowledged as having an indirect effect on the relationship between certain artifact characteristics, captured via metrics, and faultproneness, and therefore it is recommended to control for size when designing fault prediction models. AIM: To use robust statistical methods to assess whether there is evidence of any true effect of class size on fault prediction models. METHOD: We examine the potential mediation and moderation effects of class size on the relationships between OO metrics and number of faults. We employ regression analysis and bootstrapping-based methods to investigate the mediation and moderation effects in two widely-used datasets comprising seventeen systems. RESULTS: We find no strong evidence of a significant mediation or moderation effect of class size on the relationships between OO metrics and faults. In particular, size appears to have a more significant mediation effect on CBO and Fan-out than other metrics, although the evidence is not consistent in all examined systems. On the other hand, size does appear to have a significant moderation effect on WMC and CBO in most of the systems examined. Again, the evidence provided is not consistent across all examined systems CONCLUSION: We are unable to confirm if class size has a significant mediation or moderation effect on the relationships between OO metrics and the number of faults. We contend that class size does not fully explain the relationships between OO metrics and the number of faults, and it does not always affect the strength/magnitude of these relationships. We recommend that researchers consider the potential mediation and moderation effect of class size when building their prediction models, but this should be examined independently for each system.

SEApr 4, 2021
Assert Use and Defectiveness in Industrial Code

Steve Counsell, Tracy Hall, Thomas Shippey et al.

The use of asserts in code has received increasing attention in the software engineering community in the past few years, even though it has been a recognized programming construct for many decades. A previous empirical study by Casalnuovo showed that methods containing asserts had fewer defects than those that did not. In this paper, we analyze the test classes of two industrial telecom Java systems to lend support to, or refute that finding. We also analyze the physical position of asserts in methods to determine if there is a relationship between assert placement and method defect-proneness. Finally, we explore the role of test method size and the relationship it has with asserts. In terms of the previous study by Casalnuovo, we found only limited evidence to support the earlier results. We did however find that defective methods with one assert tended to be located at significantly lower levels of the class position-wise than non-defective methods. Finally, method size seemed to correlate strongly with asserts, but surprisingly less so when we excluded methods with just one assert. The work described highlights the need for more studies into this aspect of code, one which has strong links with code comprehension.

SEMar 21, 2021
Understanding Code Smell Detection via Code Review: A Study of the OpenStack Community

Xiaofeng Han, Amjed Tahir, Peng Liang et al.

Code review plays an important role in software quality control. A typical review process would involve a careful check of a piece of code in an attempt to find defects and other quality issues/violations. One type of issues that may impact the quality of the software is code smells - i.e., bad programming practices that may lead to defects or maintenance issues. Yet, little is known about the extent to which code smells are identified during code reviews. To investigate the concept behind code smells identified in code reviews and what actions reviewers suggest and developers take in response to the identified smells, we conducted an empirical study of code smells in code reviews using the two most active OpenStack projects (Nova and Neutron). We manually checked 19,146 review comments obtained by keywords search and random selection, and got 1,190 smell-related reviews to study the causes of code smells and actions taken against the identified smells. Our analysis found that 1) code smells were not commonly identified in code reviews, 2) smells were usually caused by violation of coding conventions, 3) reviewers usually provided constructive feedback, including fixing (refactoring) recommendations to help developers remove smells, and 4) developers generally followed those recommendations and actioned the changes. Our results suggest that 1) developers should closely follow coding conventions in their projects to avoid introducing code smells, and 2) review-based detection of code smells is perceived to be a trustworthy approach by developers, mainly because reviews are context-sensitive (as reviewers are more aware of the context of the code given that they are part of the project's development team).

SEMar 12, 2021
On Satisfying the Android OS Community: User Feedback Still Central to Developers' Portfolios

Sherlock A. Licorish, Amjed Tahir, Michael Franklin Bosu et al.

End-users play an integral role in identifying requirements, validating software features' usefulness, locating defects, and in software product evolution in general. Their role in these activities is especially prominent in online application distribution platforms (OADPs), where software is developed for many potential users, and for which the traditional processes of requirements gathering and negotiation with a single group of end-users do not apply. With such vast access to end-users, however, comes the challenge of how to prioritize competing requirements in order to satisfy previously unknown user groups, especially with early releases of a product. One highly successful product that has managed to overcome this challenge is the Android Operating System (OS). While the requirements of early versions of the Android OS likely benefited from market research, new features in subsequent releases appear to have benefitted extensively from user reviews. Thus, lessons learned about how Android developers have managed to satisfy the user community over time could usefully inform other software products. We have used data mining and natural language processing (NLP) techniques to investigate the issues that were logged by the Android community, and how Google's remedial efforts correlated with users' requests. We found very strong alignment between end-users' top feature requests and Android developers' responses, particularly for the more recent Android releases. Our findings suggest that effort spent responding to end-users' loudest calls may be integral to software systems' survival, and a product's overall success.

SEJan 11, 2021
A Systematic Mapping Study on Dynamic Metrics and Software Quality

Amjed Tahir, Stephen G. MacDonell

Several important aspects of software product quality can be evaluated using dynamic metrics that effectively capture and reflect the software's true runtime behavior. While the extent of research in this field is still relatively limited, particularly when compared to research on static metrics, the field is growing, given the inherent advantages of dynamic metrics. The aim of this work is to systematically investigate the body of research on dynamic software metrics to identify issues associated with their selection, design and implementation. Mapping studies are being increasingly used in software engineering to characterize an emerging body of research and to identify gaps in the field under investigation. In this study we identified and evaluated 60 works based on a set of defined selection criteria. These studies were further classified and analyzed to identify their relativity to future dynamic metrics research. The classification was based on three different facets: research focus, research type and contribution type. We found a strong body of research related to dynamic coupling and cohesion metrics, with most works also addressing the abstract notion of software complexity. Specific opportunities for future work relate to a much broader range of quality dimensions.

SEJul 24, 2019
Appsent A Tool That Analyzes App Reviews

Saurabh Malgaonkar, Chan Won Lee, Sherlock A. Licorish et al.

Enterprises are always on the lookout for tools that analyze end-users perspectives on their products. In particular, app reviews have been assessed as useful for guiding improvement efforts and software evolution, however, developers find reading app reviews to be a labor intensive exercise. If such a barrier is eliminated, however, evidence shows that responding to reviews enhances end-users satisfaction and contributes towards the success of products. In this paper, we present Appsent, a mobile analytics tool as an app, to facilitate the analysis of app reviews. This development was led by a literature review on the problem and subsequent evaluation of current available solutions to this challenge. Our investigation found that there was scope to extend currently available tools that analyze app reviews. These gaps thus informed the design and development of Appsent. We subsequently performed an empirical evaluation to validate Appsent usability and the helpfulness of analytics features from users perspective. Outcomes of this evaluation reveal that Appsent provides user-friendly interfaces, helpful functionalities and meaningful analytics. Appsent extracts and visualizes important perceptions from end-users feedback, identifying insights into end-users opinions about various aspects of software features. Although Appsent was developed as a prototype for analyzing app reviews, this tool may be of utility for analyzing product reviews more generally.

SEOct 5, 2014
Understanding Class-level Testability Through Dynamic Analysis

Amjed Tahir, Stephen G. MacDonell, Jim Buchan

It is generally acknowledged that software testing is both challenging and time-consuming. Understanding the factors that may positively or negatively affect testing effort will point to possibilities for reducing this effort. Consequently there is a significant body of research that has investigated relationships between static code properties and testability. The work reported in this paper complements this body of research by providing an empirical evaluation of the degree of association between runtime properties and class-level testability in object-oriented (OO) systems. The motivation for the use of dynamic code properties comes from the success of such metrics in providing a more complete insight into the multiple dimensions of software quality. In particular, we investigate the potential relationships between the runtime characteristics of production code, represented by Dynamic Coupling and Key Classes, and internal class-level testability. Testability of a class is considered here at the level of unit tests and two different measures are used to characterise those unit tests. The selected measures relate to test scope and structure: one is intended to measure the unit test size, represented by test lines of code, and the other is designed to reflect the intended design, represented by the number of test cases. In this research we found that Dynamic Coupling and Key Classes have significant correlations with class-level testability measures. We therefore suggest that these properties could be used as indicators of class-level testability. These results enhance our current knowledge and should help researchers in the area to build on previous results regarding factors believed to be related to testability and testing. Our results should also benefit practitioners in future class testability planning and maintenance activities.