68.0SEMay 19
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent TechnologyZeeshan Rasheed, Muhammad Waseem, Kai-Kristian Kemell et al.
Context: LLM-based multi-agent systems enable automation and decision support in software development, yet existing studies rely on benchmark datasets offering only binary pass-or-fail results, limiting insight into real-world applicability. Objective: This study empirically investigates the potential and limitations of LLM-based agents in autonomous software development tasks. Method: A two-phase approach was employed: developing a multi-agent system, CodePori, for automated code generation, and conducting participant-based evaluation to assess practical performance. Results: Participant feedback reveals key strengths, challenges, and areas for improvement in LLM-based multi-agent systems, highlighting aspects missed by standard code-generation benchmarks. Conclusions: While LLM-based multi-agent systems show potential for large-scale software development, successful integration requires addressing challenges such as memory limitations, hallucinations, and code smells, alongside a practitioner-centric perspective.
62.2AIApr 17
Agentic Frameworks for Reasoning Tasks: An Empirical StudyZeeshan Rasheed, Abdul Malik Sami, Muhammad Waseem et al.
Recent advances in agentic frameworks have enabled AI agents to perform complex reasoning and decision-making. However, evidence comparing their reasoning performance, efficiency, and practical suitability remains limited. To address this gap, we empirically evaluate 22 widely used agentic frameworks across three reasoning benchmarks: BBH, GSM8K, and ARC. The frameworks were selected from 1,200 GitHub repositories collected between January 2023 and July 2025 and organized into a taxonomy based on architectural design. We evaluated them under a unified setting, measuring reasoning accuracy, execution time, computational cost, and cross-benchmark consistency. Our results show that 19 of the 22 frameworks completed all three benchmarks. Among these, 12 showed stable performance, with mean accuracy of 74.6-75.9%, execution time of 4-6 seconds per task, and cost of 0.14-0.18 cents per task. Poorer results were mainly caused by orchestration problems rather than reasoning limits. For example, Camel failed to complete BBH after 11 days because of uncontrolled context growth, while Upsonic consumed USD 1,434 in one day because repeated extraction failures triggered costly retries. AutoGen and Mastra also exhausted API quotas through iterative interactions that increased prompt length without improving results. We also found a sharp drop in mathematical reasoning. Mean accuracy on GSM8K was 44.35%, compared with 89.80% on BBH and 89.56% on ARC. Overall, this study provides the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks and shows that framework selection should prioritize orchestration quality, especially memory control, failure handling, and cost management.
42.0CYMar 12
The Landscape of Generative AI in Information Systems: A Synthesis of Secondary Reviews and Research AgendasAleksander Jarzębowicz, Adam Przybyłek, Jacinto Estima et al.
As organizations grapple with the rapid adoption of Generative AI (GenAI), this study synthesizes the state of knowledge through a systematic literature review of secondary studies and research agendas. Analyzing 28 papers published since 2023, we find that while GenAI offers transformative potential for productivity and innovation, its adoption is constrained by multiple interrelated challenges, including technical unreliability (hallucinations, performance drift), societal-ethical risks (bias, misuse, skill erosion), and a systemic governance vacuum (privacy, accountability, intellectual property). Interpreted through a socio-technical lens, these findings reveal a persistent misalignment between GenAI's fast-evolving technical subsystem and the slower-adapting social subsystem, positioning IS research as critical for achieving joint optimization. To bridge this gap, we discuss a research agenda that reorients IS scholarship from analyzing impacts toward actively shaping the co-evolution of technical capabilities with organizational procedures, societal values, and regulatory institutions--emphasizing hybrid human--AI ensembles, situated validation, design principles for probabilistic systems, and adaptive governance.
CLFeb 27, 2025
Mapping Trustworthiness in Large Language Models: A Bibliometric Analysis Bridging Theory to PracticeJosé Siqueira de Cerqueira, Kai-Kristian Kemell, Rebekah Rousi et al.
The rapid proliferation of Large Language Models (LLMs) has raised significant trustworthiness and ethical concerns. Despite the widespread adoption of LLMs across domains, there is still no clear consensus on how to define and operationalise trustworthiness. This study aims to bridge the gap between theoretical discussion and practical implementation by analysing research trends, definitions of trustworthiness, and practical techniques. We conducted a bibliometric mapping analysis of 2,006 publications from Web of Science (2019-2025) using the Bibliometrix, and manually reviewed 68 papers. We found a shift from traditional AI ethics discussion to LLM trustworthiness frameworks. We identified 18 different definitions of trust/trustworthiness, with transparency, explainability and reliability emerging as the most common dimensions. We identified 20 strategies to enhance LLM trustworthiness, with fine-tuning and retrieval-augmented generation (RAG) being the most prominent. Most of the strategies are developer-driven and applied during the post-training phase. Several authors propose fragmented terminologies rather than unified frameworks, leading to the risks of "ethics washing," where ethical discourse is adopted without a genuine regulatory commitment. Our findings highlight: persistent gaps between theoretical taxonomies and practical implementation, the crucial role of the developer in operationalising trust, and call for standardised frameworks and stronger regulatory measures to enable trustworthy and ethical deployment of LLMs.
SEJun 25, 2025
Engineering RAG Systems for Real-World Applications: Design, Development, and EvaluationMd Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell et al.
Retrieval-Augmented Generation (RAG) systems are emerging as a key approach for grounding Large Language Models (LLMs) in external knowledge, addressing limitations in factual accuracy and contextual relevance. However, there is a lack of empirical studies that report on the development of RAG-based implementations grounded in real-world use cases, evaluated through general user involvement, and accompanied by systematic documentation of lessons learned. This paper presents five domain-specific RAG applications developed for real-world scenarios across governance, cybersecurity, agriculture, industrial research, and medical diagnostics. Each system incorporates multilingual OCR, semantic retrieval via vector embeddings, and domain-adapted LLMs, deployed through local servers or cloud APIs to meet distinct user needs. A web-based evaluation involving a total of 100 participants assessed the systems across six dimensions: (i) Ease of Use, (ii) Relevance, (iii) Transparency, (iv) Responsiveness, (v) Accuracy, and (vi) Likelihood of Recommendation. Based on user feedback and our development experience, we documented twelve key lessons learned, highlighting technical, operational, and ethical challenges affecting the reliability and usability of RAG systems in practice.
SEDec 11, 2025
Vibe Coding in Practice: Flow, Technical Debt, and Guidelines for Sustainable UseMuhammad Waseem, Aakash Ahmad, Kai-Kristian Kemell et al.
Vibe Coding (VC) is a form of software development assisted by generative AI, in which developers describe the intended functionality or logic via natural language prompts, and the AI system generates the corresponding source code. VC can be leveraged for rapid prototyping or developing the Minimum Viable Products (MVPs); however, it may introduce several risks throughout the software development life cycle. Based on our experience from several internally developed MVPs and a review of recent industry reports, this article analyzes the flow-debt tradeoffs associated with VC. The flow-debt trade-off arises when the seamless code generation occurs, leading to the accumulation of technical debt through architectural inconsistencies, security vulnerabilities, and increased maintenance overhead. These issues originate from process-level weaknesses, biases in model training data, a lack of explicit design rationale, and a tendency to prioritize quick code generation over human-driven iterative development. Based on our experiences, we identify and explain how current model, platform, and hardware limitations contribute to these issues, and propose countermeasures to address them, informing research and practice towards more sustainable VC approaches.
SEAug 28, 2025
AI and Agile Software Development: A Research Roadmap from the XP2025 WorkshopZheying Zhang, Tomas Herda, Victoria Pichler et al.
This paper synthesizes the key findings from a full-day XP2025 workshop on "AI and Agile: From Frustration to Success", held in Brugg-Windisch, Switzerland. The workshop brought together over 30 interdisciplinary academic researchers and industry practitioners to tackle the concrete challenges and emerging opportunities at the intersection of Generative Artificial Intelligence (GenAI) and agile software development. Through structured, interactive breakout sessions, participants identified shared pain points like tool fragmentation, governance, data quality, and critical skills gaps in AI literacy and prompt engineering. These issues were further analyzed, revealing underlying causes and cross-cutting concerns. The workshop concluded by collaboratively co-creating a multi-thematic research roadmap, articulating both short-term, implementable actions and visionary, long-term research directions. This cohesive agenda aims to guide future investigation and drive the responsible, human-centered integration of GenAI into agile practices.
CYJan 12, 2024
Business and ethical concerns in domestic Conversational Generative AI-empowered multi-robot systemsRebekah Rousi, Hooman Samani, Niko Mäkitalo et al.
Business and technology are intricately connected through logic and design. They are equally sensitive to societal changes and may be devastated by scandal. Cooperative multi-robot systems (MRSs) are on the rise, allowing robots of different types and brands to work together in diverse contexts. Generative artificial intelligence has been a dominant topic in recent artificial intelligence (AI) discussions due to its capacity to mimic humans through the use of natural language and the production of media, including deep fakes. In this article, we focus specifically on the conversational aspects of generative AI, and hence use the term Conversational Generative artificial intelligence (CGI). Like MRSs, CGIs have enormous potential for revolutionizing processes across sectors and transforming the way humans conduct business. From a business perspective, cooperative MRSs alone, with potential conflicts of interest, privacy practices, and safety concerns, require ethical examination. MRSs empowered by CGIs demand multi-dimensional and sophisticated methods to uncover imminent ethical pitfalls. This study focuses on ethics in CGI-empowered MRSs while reporting the stages of developing the MORUL model.
SEFeb 10, 2022
Work-from-home and its implication for project management, resilience and innovation -- a global survey on software companiesAnh Nguyen-Duc, Dron Khanna, Des Greer et al.
[Context] The COVID-19 pandemic has had a disruptive impact on how people work and collaborate across all global economic sectors, including the software business. While remote working is not new for software engineers, forced Work-from-home situations to come with both constraints, limitations, and opportunities for individuals, software teams and software companies. As the "new normal" for working might be based on the current state of Work From Home (WFH), it is useful to understand what has happened and learn from that. [Objective] The goal of this study is to gain insights on how their WFH environment impacts software projects and software companies. We are also interested in understanding if the impact differs between software startups and established companies. [Method] We conducted a global-scale, cross-sectional survey during spring and summer 2021. Our results are based on quantitative and qualitative analysis of 297 valid responses. [Results] We observed a mixed perception of the impact of WFH on software project management, resilience, and innovation. Certain patterns on WFH, control and coordination mechanisms and collaborative tools are observed globally. We find that team, agility and leadership are the three most important factors for achieving resilience during the pandemic. Although startups do not perceive the impact of WFH differently, there is a difference between engineers who work in a small team context and those who work in a large team context. [Conclusion] The result suggests a contingency approach in studying and improving WFH practices and environment in the future software industry.
SEMar 14, 2021
The entrepreneurial logic of startup software development: A study of 40 software startupsAnh Nguyen-Duc, Kai-Kristian Kemell, Pekka Abrahamsson
Context: Software startups are an essential source of innovation and software-intensive products. The need to understand product development in startups and to provide relevant support are highlighted in software research. While state-of-the-art literature reveals how startups develop their software, the reasons why they adopt these activities are underexplored. Objective: This study investigates the tactics behind software engineering (SE) activities by analyzing key engineering events during startup journeys. We explore how entrepreneurial mindsets may be associated with SE knowledge areas and with each startup case. Method: Our theoretical foundation is based on causation and effectuation models. We conducted semi-structured interviews with 40 software startups. We used two-round open coding and thematic analysis to describe and identify entrepreneurial software development patterns. Additionally, we calculated an effectuation index for each startup case. Results: We identified 621 events merged into 32 codes of entrepreneurial logic in SE from the sample. We found a systemic occurrence of the logic in all areas of SE activities. Minimum Viable Product (MVP), Technical Debt (TD), and Customer Involvement (CI) tend to be associated with effectual logic, while testing activities at different levels are associated with causal logic. The effectuation index revealed that startups are either effectuation-driven or mixed-logics-driven. Conclusions: Software startups fall into two types that differentiate between how traditional SE approaches may apply to them. Effectuation seems the most relevant and essential model for explaining and developing suitable SE practices for software startups.
SEFeb 11, 2021
Business Model Canvas Should Pay More Attention to the Software Startup TeamKai-Kristian Kemell, Atte Elonen, Mari Suoranta et al.
Business Model Canvas (BMC) is a tool widely used to describe startup business models. Despite the various business aspects described, BMC pays a little emphasis on team-related factors. The importance of team-related factors in software development has been acknowledged widely in literature. While not as extensively studied, the importance of teams in software startups is also known in both literature and among practitioners. In this paper, we propose potential changes to BMC to have the tool better reflect the importance of the team, especially in a software startup environment. Based on a literature review, we identify various components related to the team, which we then further support with empirical data. We do so by means of a qualitative case study of five startups.
SEFeb 11, 2021
Amidst Uncertainty -- or Not? Decision-Making in Early-Stage Software StartupsKai-Kristian Kemell, Eveliina Ventilä, Petri Kettunen et al.
It is commonly claimed that the initial stages of any startup business are dominated by continuous, extended uncertainty, in an environment that has even been described as chaotic. Consequently, decisions are made in uncertain circumstances, so making the right decision is crucial to successful business. However, little currently exists in the way of empirical studies into this supposed uncertainty. In this paper, we study decision-making in early-stage software startups by means of a single, in-depth case study. Based on our data, we argue that software startups do not work in a chaotic environment, nor are they characterized by unique uncertainty unlike that experienced by other firms.
SEFeb 11, 2021
Software Startup Practices -- Software Development in Startups through the Lens of the Essence Theory of Software EngineeringKai-Kristian Kemell, Ville Ravaska, Anh Nguyen-Duc et al.
Software startups continue to be important drivers of economy globally. As the initial investment required to found a new software company becomes smaller and smaller resulting from technological advances such as cloud technology, increasing numbers of new software startups are born. Typically, the main argument for studying software startups is that they differ from mature software organizations in various ways, thus making the findings of many existing studies not directly applicable to them. How, exactly, software startups really differ from other types of software organizations as an on-going debate. In this paper, we seek to better understand how software startups differ from mature software organizations in terms of development practices. Past studies have primarily studied method use, and in comparison, we take on a more atomic approach by focusing on practices. Utilizing the Essence Theory of Software Engineering as a framework, we split these practices into categories for analysis while simultaneously evaluating the suitability of the theory for the context of software startups. Based on the results, we propose changes to the Essence Theory of Software Engineering for it to better fit the startup context.
SESep 24, 2018
The Essence Theory of Software Engineering - Large-Scale Classroom Experiences from 450+ Software Engineering BSc StudentsKai-Kristian Kemell, Anh Nguyen-Duc, Xiaofeng Wang et al.
Software Engineering as an industry is highly diverse in terms of development methods and practices. Practitioners employ a myriad of methods and tend to further tailor them by e.g. omitting some practices or rules. This diversity in development methods poses a challenge for software engineering education, creating a gap between education and industry. General theories such as the Essence Theory of Software Engineering can help bridge this gap by presenting software engineering students with higher-level frameworks upon which to build an understanding of software engineering methods and practical project work. In this paper, we study Essence in an educational setting to evaluate its usefulness for software engineering students while also investigating barriers to its adoption in this context. To this end, we observe 102 student teams utilize Essence in practical software engineering projects during a semester long, project-based course.
SESep 23, 2018
Gamifying the Escape from the Engineering Method Prison - An Innovative Board Game to Teach the Essence Theory to Future Project Managers and Software EngineersKai-Kristian Kemell, Juhani Risku, Arthur Evensen et al.
Software Engineering is an engineering discipline but lacks a solid theoretical foundation. One effort in remedying this situation has been the SEMAT Essence specification. Essence consists of a language for modeling Software Engineering (SE) practices and methods and a kernel containing what its authors describe as being elements that are present in every software development project. In practice, it is a method agnostic project management tool for SE Projects. Using the language of the specification, Essence can be used to model any software development method or practice. Thus, the specification can potentially be applied to any software development context, making it a powerful tool. However, due to the manual work and the learning process involved in modeling practices with Essence, its initial adoption can be tasking for development teams. Due to the importance of project management in SE projects, new project management tools such as Essence are valuable, and facilitating their adoption is consequently important. To tackle this issue in the case of Essence, we present a game-based approach to teaching the use Essence. In this paper, we gamify the learning process by means of an innovative board game. The game is empirically validated in a study involving students from the IT faculty of University of Jyväskylä (n=61). Based on the results, we report the effectiveness of the game-based approach to teaching both Essence and SE project work.
SEAug 8, 2018
Essencery - A Tool for Essentializing Software Engineering PracticesArthur Evensen, Kai-Kristian Kemell, Xiaofeng Wang et al.
Software Engineering practitioners work using highly diverse methods and practices, and general theories in software engineering are lacking. One attempt at creating a common ground in the area of software engineering methodologies has been the Essence Theory of Software Engineering, which can be considered a method-agnostic project management tool for software engineering. Essence supports the use of any development practices and provides a framework for building a suitable method for any software engineering context. However, Essence presently suffers from low practitioner adoption that is partially considered to be caused by a lack of proper tooling. In this paper, we present Essencery, a tool for essentializing software engineering methods and practices using the Essence graphical syntax. Essencery aims to facilitate adoption of Essence among potential future users. We present an empirical evaluation of the tool by means of a qualitative, quasi-formal experiment and, based on the experiment, confirm that the tool is easy to use and useful for its intended purpose.