CVJul 21, 2024Code
VideoGameBunny: Towards vision assistants for video gamesMohammad Reza Taesiri, Cor-Paul Bezemer
Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/
71.6CVApr 13Code
RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video GamesYakun Yu, Ashley Wiens, Adrián Barahona-Ríos et al.
Visual glitches in video games degrade player experience and perceived quality, yet manual quality assurance cannot scale to the growing test surface of modern game development. Prior automation efforts, particularly those using vision-language models (VLMs), largely operate on single frames or rely on limited video-level baselines that struggle under realistic scene variation, making robust video-level glitch detection challenging. We present RESP, a practical multi-frame framework for gameplay glitch detection with VLMs. Our key idea is reference-guided prompting: for each test frame, we select a reference frame from earlier in the same video, establishing a visual baseline and reframing detection as within-video comparison rather than isolated classification. RESP sequentially prompts the VLM with reference/test pairs and aggregates noisy frame predictions into a stable video-level decision without fine-tuning the VLM. To enable controlled analysis of reference effects, we introduce RefGlitch, a synthetic dataset of manually labeled reference/test frame pairs with balanced coverage across five glitch types. Experiments across five VLMs and three datasets (one synthetic, two real-world) show that reference guidance consistently strengthens frame-level detection and that the improved frame-level evidence reliably transfers to stronger video-level triage under realistic QA conditions. Code and data are available at: \href{https://github.com/PipiZong/RESP_code.git}{this https URL}.
CVApr 11, 2023
ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image ClassificationMohammad Reza Taesiri, Giang Nguyen, Sarra Habchi et al.
Image classifiers are information-discarding machines, by design. Yet, how these models discard information remains mysterious. We hypothesize that one way for image classifiers to reach high accuracy is to first zoom to the most discriminative region in the image and then extract features from there to predict image labels, discarding the rest of the image. Studying six popular networks ranging from AlexNet to CLIP, we find that proper framing of the input image can lead to the correct classification of 98.91% of ImageNet images. Furthermore, we uncover positional biases in various datasets, especially a strong center bias in two popular datasets: ImageNet-A and ObjectNet. Finally, leveraging our insights into the potential of zooming, we propose a test-time augmentation (TTA) technique that improves classification accuracy by forcing models to explicitly perform zoom-in operations before making predictions. Our method is more interpretable, accurate, and faster than MEMO, a state-of-the-art (SOTA) TTA method. We introduce ImageNet-Hard, a new benchmark that challenges SOTA classifiers including large vision-language models even when optimal zooming is allowed.
CLOct 5, 2022
Large Language Models are Pretty Good Zero-Shot Video Game Bug DetectorsMohammad Reza Taesiri, Finlay Macklon, Yihe Wang et al.
Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, it is challenging to fully automate game testing. In this study, we explore the possibility of leveraging the zero-shot capabilities of large language models for video game bug detection. By formulating the bug detection problem as a question-answering task, we show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. To this end, we introduce the GameBugDescriptions benchmark dataset, which consists of 167 buggy gameplay videos and a total of 334 question-answer pairs across 8 games. We extensively evaluate the performance of six models across the OPT and InstructGPT large language model families on our benchmark dataset. Our results show promising results for employing language models to detect video game bugs. With the proper prompting technique, we could achieve an accuracy of 70.66%, and on some video games, up to 78.94%. Our code, evaluation data and the benchmark can be found on https://asgaardlab.github.io/LLMxBugs
CVMar 21, 2022
CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learningMohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer
Gameplay videos contain rich information about how players interact with the game and how the game responds. Sharing gameplay videos on social media platforms, such as Reddit, has become a common practice for many players. Often, players will share gameplay videos that showcase video game bugs. Such gameplay videos are software artifacts that can be utilized for game testing, as they provide insight for bug analysis. Although large repositories of gameplay videos exist, parsing and mining them in an effective and structured fashion has still remained a big challenge. In this paper, we propose a search method that accepts any English text query as input to retrieve relevant videos from large repositories of gameplay videos. Our approach does not rely on any external information (such as video metadata); it works solely based on the content of the video. By leveraging the zero-shot transfer capabilities of the Contrastive Language-Image Pre-Training (CLIP) model, our approach does not require any data labeling or training. To evaluate our approach, we present the $\texttt{GamePhysics}$ dataset consisting of 26,954 videos from 1,873 games, that were collected from the GamePhysics section on the Reddit website. Our approach shows promising results in our extensive analysis of simple queries, compound queries, and bug queries, indicating that our approach is useful for object and event detection in gameplay videos. An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs. Please visit the following link for the code and the data: https://asgaardlab.github.io/CLIPxGamePhysics/
SEJul 7, 2024
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software QualityHao Li, Gopi Krishnan Rajbahadur, Cor-Paul Bezemer
Bindings for machine learning frameworks (such as TensorFlow and PyTorch) allow developers to integrate a framework's functionality using a programming language different from the framework's default language (usually Python). In this paper, we study the impact of using TensorFlow and PyTorch bindings in C#, Rust, Python and JavaScript on the software quality in terms of correctness (training and test accuracy) and time cost (training and inference time) when training and performing inference on five widely used deep learning models. Our experiments show that a model can be trained in one binding and used for inference in another binding for the same framework without losing accuracy. Our study is the first to show that using a non-default binding can help improve machine learning software quality from the time cost perspective compared to the default Python binding while still achieving the same level of correctness.
57.8CVMar 24
How Far Can VLMs Go for Visual Bug Detection? Studying 19,738 Keyframes from 41 Hours of Gameplay VideosWentao Lu, Alexander Senchenko, Alan Sayle et al.
Video-based quality assurance (QA) for long-form gameplay video is labor-intensive and error-prone, yet valuable for assessing game stability and visual correctness over extended play sessions. Vision language models (VLMs) promise general-purpose visual reasoning capabilities and thus appear attractive for detecting visual bugs directly from video frames. Recent benchmarks suggest that VLMs can achieve promising results in detecting visual glitches on curated datasets. Building on these findings, we conduct a real-world study using industrial QA gameplay videos to evaluate how well VLMs perform in practical scenarios. Our study samples keyframes from long gameplay videos and asks a VLM whether each keyframe contains a bug. Starting from a single-prompt baseline, the model achieves a precision of 0.50 and an accuracy of 0.72. We then examine two common enhancement strategies used to improve VLM performance without fine-tuning: (1) a secondary judge model that re-evaluates VLM outputs, and (2) metadata-augmented prompting through the retrieval of prior bug reports. Across \textbf{100 videos} totaling \textbf{41 hours} and \textbf{19,738 keyframes}, these strategies provide only marginal improvements over the simple baseline, while introducing additional computational cost and output variance. Our findings indicate that off-the-shelf VLMs are already capable of detecting a certain range of visual bugs in QA gameplay videos, but further progress likely requires hybrid approaches that better separate textual and visual anomaly detection.
57.7CVMay 20
TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay VideosYakun Yu, Ashley Wiens, Adrián Barahona-Ríos et al.
Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.
SEOct 11, 2024Code
Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation ModelsHao Li, Cor-Paul Bezemer, Ahmed E. Hassan
Foundation models (FMs) such as large language models (LLMs) have significantly impacted many fields, including software engineering (SE). The interaction between SE and FMs has led to the integration of FMs into SE practices (FM4SE) and the application of SE methodologies to FMs (SE4FM). While several literature surveys exist on academic contributions to these trends, we are the first to provide a practitioner's view. We analyze 155 FM4SE and 997 SE4FM blog posts from leading technology companies, leveraging an FM-powered surveying approach to systematically label and summarize the discussed activities and tasks. We observed that while code generation is the most prominent FM4SE task, FMs are leveraged for many other SE activities such as code understanding, summarization, and API recommendation. The majority of blog posts on SE4FM are about model deployment & operation, and system architecture & orchestration. Although the emphasis is on cloud deployments, there is a growing interest in compressing FMs and deploying them on smaller devices such as edge or mobile devices. We outline eight future research directions inspired by our gained insights, aiming to bridge the gap between academic findings and real-world applications. Our study not only enriches the body of knowledge on practical applications of FM4SE and SE4FM but also demonstrates the utility of FMs as a powerful and efficient approach in conducting literature surveys within technical and grey literature domains. Our dataset, results, code and used prompts can be found in our online replication package at https://github.com/SAILResearch/fmse-blogs.
SEJan 18, 2022Code
A Taxonomy of Testable HTML5 Canvas IssuesFinlay Macklon, Markos Viggiato, Natalia Romanova et al.
The HTML5 <canvas> is widely used to display high quality graphics in web applications. However, the combination of web, GUI, and visual techniques that are required to build <canvas> applications, together with the lack of testing and debugging tools, makes developing such applications very challenging. To help direct future research on testing <canvas> applications, in this paper we present a taxonomy of testable <canvas> issues. First, we extracted 2,403 <canvas>-related issue reports from 123 open-source GitHub projects that use the HTML5 <canvas>. Second, we constructed our taxonomy by manually classifying a random sample of 332 issue reports. Our manual classification identified five broad categories of testable <canvas> issues, such as Visual and Performance issues. We found that Visual issues are the most frequent (35%), while Performance issues are relatively infrequent (5%). We also found that many testable <canvas> issues that present themselves visually on the <canvas> are actually caused by other components of the web application. Our taxonomy of testable <canvas> issues can be used to steer future research into <canvas> issues and testing.
SEJan 18, 2022Code
Bridging the Language Gap: An Empirical Study of Bindings for Open Source Machine Learning Libraries Across Software Package EcosystemsHao Li, Cor-Paul Bezemer
Open source machine learning (ML) libraries enable developers to integrate advanced ML functionality into their own applications. However, popular ML libraries, such as TensorFlow, are not available natively in all programming languages and software package ecosystems. Hence, developers who wish to use an ML library which is not available in their programming language or ecosystem of choice, may need to resort to using a so-called binding library (or binding). Bindings provide support across programming languages and package ecosystems for reusing a host library. For example, the Keras .NET binding provides support for the Keras library in the NuGet (.NET) ecosystem even though the Keras library was written in Python. In this paper, we collect 2,436 cross-ecosystem bindings for 546 ML libraries across 13 software package ecosystems by using an approach called BindFind, which can automatically identify bindings and link them to their host libraries. Furthermore, we conduct an in-depth study of 133 cross-ecosystem bindings and their development for 40 popular open source ML libraries. Our findings reveal that the majority of ML library bindings are maintained by the community, with npm being the most popular ecosystem for these bindings. Our study also indicates that most bindings cover only a limited range of the host library's releases, often experience considerable delays in supporting new releases, and have widespread technical lag. Our findings highlight key factors to consider for developers integrating bindings for ML libraries and open avenues for researchers to further investigate bindings in software package ecosystems.
SEApr 4, 2019Code
Bounties in Open Source Development on GitHub: A Case Study of Bountysource BountiesJiayuan Zhou, Shaowei Wang, Cor-Paul Bezemer et al.
Due to the voluntary nature of open source software, it can be hard to find a developer to work on a particular task. For example, some issue reports may be too cumbersome and unexciting for someone to volunteer to do them, yet these issue reports may be of high priority to the success of a project. To provide an incentive for implementing such issue reports, one can propose a monetary reward, i.e., a bounty, to the developer who completes that particular task. In this paper, we study bounties in open source projects on GitHub to better understand how bounties can be leveraged to evolve such projects in terms of addressing issue reports. We investigated 5,445 bounties for GitHub projects. These bounties were proposed through the Bountysource platform with a total bounty value of $406,425. We find that 1) in general, the timing of proposing bounties and the bounty-usage frequency are the most important factors that impact the likelihood of an issue being addressed. More specifically, issue reports are more likely to be addressed if they are for projects in which bounties are used more frequently and if they are proposed earlier. 2) The bounty value that an issue report has is the most important factor that impacts the issue-addressing likelihood in the projects in which no bounties were used before. Backers in such projects proposed higher bounty values to get issues addressed. 3) There is a risk of wasting money for backers who invest money on long-standing issue reports.
CVDec 8, 2023
GlitchBench: Can large multimodal models detect video game glitches?Mohammad Reza Taesiri, Tianjun Feng, Anh Nguyen et al.
Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood, especially when it comes to real-world tasks. To address this gap, we introduce GlitchBench, a novel benchmark derived from video game quality assurance tasks, to test and evaluate the reasoning capabilities of LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios from video games and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events. We evaluate multiple state-of-the-art LMMs, and we show that GlitchBench presents a new challenge for these models. Code and data are available at: https://glitchbench.github.io/
CVMay 21, 2025
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality AssuranceMohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj et al.
With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: https://asgaardlab.github.io/videogameqa-bench/
SEJan 18, 2024
Keeping Deep Learning Models in Check: A History-Based Approach to Mitigate OverfittingHao Li, Gopi Krishnan Rajbahadur, Dayi Lin et al.
In software engineering, deep learning models are increasingly deployed for critical tasks such as bug detection and code review. However, overfitting remains a challenge that affects the quality, reliability, and trustworthiness of software systems that utilize deep learning models. Overfitting can be (1) prevented (e.g., using dropout or early stopping) or (2) detected in a trained model (e.g., using correlation-based approaches). Both overfitting detection and prevention approaches that are currently used have constraints (e.g., requiring modification of the model structure, and high computing resources). In this paper, we propose a simple, yet powerful approach that can both detect and prevent overfitting based on the training history (i.e., validation losses). Our approach first trains a time series classifier on training histories of overfit models. This classifier is then used to detect if a trained model is overfit. In addition, our trained classifier can be used to prevent overfitting by identifying the optimal point to stop a model's training. We evaluate our approach on its ability to identify and prevent overfitting in real-world samples. We compare our approach against correlation-based detection approaches and the most commonly used prevention approach (i.e., early stopping). Our approach achieves an F1 score of 0.91 which is at least 5% higher than the current best-performing non-intrusive overfitting detection approach. Furthermore, our approach can stop training to avoid overfitting at least 32% of the times earlier than early stopping and has the same or a better rate of returning the best model.
SEJan 27, 2022
An Empirical Study of Yanked Releases in the Rust Package RegistryHao Li, Filipe R. Cogo, Cor-Paul Bezemer
Cargo, the software packaging manager of Rust, provides a yank mechanism to support release-level deprecation, which can prevent packages from depending on yanked releases. Most prior studies focused on code-level (i.e., deprecated APIs) and package-level deprecation (i.e., deprecated packages). However, few studies have focused on release-level deprecation. In this study, we investigate how often and how the yank mechanism is used, the rationales behind its usage, and the adoption of yanked releases in the Cargo ecosystem. Our study shows that 9.6% of the packages in Cargo have at least one yanked release, and the proportion of yanked releases kept increasing from 2014 to 2020. Package owners yank releases for other reasons than withdrawing a defective release, such as fixing a release that does not follow semantic versioning or indicating a package is removed or replaced. In addition, we found that 46% of the packages directly adopted at least one yanked release and the yanked releases propagated through the dependency network, which leads to 1.4% of the releases in the ecosystem having unresolved dependencies.
LGOct 22, 2021
Applications of Generative Adversarial Networks in Anomaly Detection: A Systematic Literature ReviewMikael Sabuhi, Ming Zhou, Cor-Paul Bezemer et al.
Anomaly detection has become an indispensable tool for modern society, applied in a wide range of applications, from detecting fraudulent transactions to malignant brain tumours. Over time, many anomaly detection techniques have been introduced. However, in general, they all suffer from the same problem: a lack of data that represents anomalous behaviour. As anomalous behaviour is usually costly (or dangerous) for a system, it is difficult to gather enough data that represents such behaviour. This, in turn, makes it difficult to develop and evaluate anomaly detection techniques. Recently, generative adversarial networks (GANs) have attracted a great deal of attention in anomaly detection research, due to their unique ability to generate new data. In this paper, we present a systematic literature review of the applications of GANs in anomaly detection, covering 128 papers on the subject. The goal of this review paper is to analyze and summarize: (1) which anomaly detection techniques can benefit from certain types of GANs, and how, (2) in which application domains GAN-assisted anomaly detection techniques have been applied, and (3) which datasets and performance metrics have been used to evaluate these techniques. Our study helps researchers and practitioners to find the most suitable GAN-assisted anomaly detection technique for their application. In addition, we present a research roadmap for future studies in this area.
SEOct 14, 2021
Identifying Similar Test Cases That Are Specified in Natural LanguageMarkos Viggiato, Dale Paas, Chris Buzon et al.
Software testing is still a manual process in many industries, despite the recent improvements in automated testing techniques. As a result, test cases are often specified in natural language by different employees and many redundant test cases might exist in the test suite. This increases the (already high) cost of test execution. Manually identifying similar test cases is a time-consuming and error-prone task. Therefore, in this paper, we propose an unsupervised approach to identify similar test cases. Our approach uses a combination of text embedding, text similarity and clustering techniques to identify similar test cases. We evaluate five different text embedding techniques, two text similarity metrics, and two clustering techniques to cluster similar test steps and four techniques to identify similar test cases from the test step clusters. Through an evaluation in an industrial setting, we showed that our approach achieves a high performance to cluster test steps (an F-score of 87.39%) and identify similar test cases (an F-score of 83.47%). Furthermore, a validation with developers indicates several different practical usages of our approach (such as identifying redundant and legacy test cases), which help to reduce the testing manual effort and time.
DCJul 28, 2021
A Case Study on the Stability of Performance Tests for Serverless ApplicationsSimon Eismann, Diego Elias Costa, Lizhi Liao et al.
Context. While in serverless computing, application resource management and operational concerns are generally delegated to the cloud provider, ensuring that serverless applications meet their performance requirements is still a responsibility of the developers. Performance testing is a commonly used performance assessment practice; however, it traditionally requires visibility of the resource environment. Objective. In this study, we investigate whether performance tests of serverless applications are stable, that is, if their results are reproducible, and what implications the serverless paradigm has for performance tests. Method. We conduct a case study where we collect two datasets of performance test results: (a) repetitions of performance tests for varying memory size and load intensities and (b) three repetitions of the same performance test every day for ten months. Results. We find that performance tests of serverless applications are comparatively stable if conducted on the same day. However, we also observe short-term performance variations and frequent long-term performance changes. Conclusion. Performance tests for serverless applications can be stable; however, the serverless model impacts the planning, execution, and analysis of performance tests.
SEMar 26, 2021
An Empirical Study of the Characteristics of Popular Minecraft ModsDaniel Lee, Gopi Krishnan Rajbahadur, Dayi Lin et al.
It is becoming increasingly difficult for game developers to manage the cost of developing a game, while meeting the high expectations of gamers. One way to balance the increasing gamer expectation and development stress is to build an active modding community around the game. There exist several examples of games with an extremely active and successful modding community, with the Minecraft game being one of the most notable ones. This paper reports on an empirical study of 1,114 popular and 1,114 unpopular Minecraft mods from the CurseForge mod distribution platform, one of the largest distribution platforms for Minecraft mods. We analyzed the relationship between 33 features across 5 dimensions of mod characteristics and the popularity of mods (i.e., mod category, mod documentation, environmental context of the mod, remuneration for the mod, and community contribution for the mod), to understand the characteristics of popular Minecraft mods. We firstly verify that the studied dimensions have significant explanatory power in distinguishing the popularity of the studied mods. Then we evaluated the contribution of each of the 33 features across the 5 dimensions. We observed that popular mods tend to have a high quality description and promote community contribution.
SEMar 12, 2021
Building the perfect game -- an empirical study of game modificationsDaniel Lee, Dayi Lin, Cor-Paul Bezemer et al.
Game developers cannot always meet the growing and changing needs of the gaming community, due to the often already overloaded schedules of developers. So-called modders can potentially assist game developers with addressing gamers' needs. Modders are enthusiasts who provide modifications or completely new content for a game. By supporting modders, game developers can meet the rapidly growing and varying needs of their gamer base. Modders have the potential to play a role in extending the life expectancy of a game, thereby saving game developers time and money, and leading to a better overall gaming experience for their gamer base. In this paper, we empirically study the metadata of 9,521 mods of the 20 most-modded games on the Nexus Mods distribution platform. Our goal is to provide useful insights into the modding community of the Nexus Mods distribution platform from a quantitative perspective, and to provide researchers with a solid foundation for future exploration of game mods. In doing so, game developers can potentially reduce development time and cost due to the increased replayability of their games through mods. We find that providing official support for mods can be beneficial for the perceived quality of the mods of a game. In addition, mod users are willing to submit bug reports for a mod. However, they often fail to do this in a systematic manner using the bug reporting tool of the Nexus Mods platform, resulting in low-quality bug reports which are difficult to resolve. Based on our findings, we recommend that game developers who desire an active modding community for their own games provide the modding community with an officially-supported modding tool. In addition, we recommend that mod distribution platforms, such as Nexus Mods, improve their bug reporting system to receive higher quality bug reports.
SEAug 21, 2018
How is Performance Addressed in DevOps? A Survey on Industrial PracticesCor-Paul Bezemer, Simon Eismann, Vincenzo Ferme et al.
DevOps is a modern software engineering paradigm that is gaining widespread adoption in industry. The goal of DevOps is to bring software changes into production with a high frequency and fast feedback cycles. This conflicts with software quality assurance activities, particularly with respect to performance. For instance, performance evaluation activities -- such as load testing -- require a considerable amount of time to get statistically significant results. We conducted an industrial survey to get insights into how performance is addressed in industrial DevOps settings. In particular, we were interested in the frequency of executing performance evaluations, the tools being used, the granularity of the obtained performance data, and the use of model-based techniques. The survey responses, which come from a wide variety of participants from different industry sectors, indicate that the complexity of performance engineering approaches and tools is a barrier for wide-spread adoption of performance analysis in DevOps. The implication of our results is that performance analysis tools need to have a short learning curve, and should be easy to integrate into the DevOps pipeline.