Kevin Moran

h-index56

54papers

3,067citations

Novelty40%

AI Score51

Ranked #40,983 of 201,326 authors (top 20%)#402 in SE (top 12%)

54 Papers

SEJan 3, 2023Code

An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation

Kevin Moran, Ali Yachnes, George Purnell et al.

Existing automated techniques for software documentation typically attempt to reason between two main sources of information: code and natural language. However, this reasoning process is often complicated by the lexical gap between more abstract natural language and more structured programming languages. One potential bridge for this gap is the Graphical User Interface (GUI), as GUIs inherently encode salient information about underlying program functionality into rich, pixel-based data representations. This paper offers one of the first comprehensive empirical investigations into the connection between GUIs and functional, natural language descriptions of software. First, we collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications. The descriptions were obtained from human labelers and underwent several quality control mechanisms. To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input. We evaluate these models quantitatively, using common machine translation metrics, and qualitatively through a large-scale user study. Finally, we offer learned lessons and a discussion of the potential shown by multimodal models to enhance future techniques for automated software documentation.

27.4SEMay 29

Reassessing Code Authorship Attribution in the Era of Language Models

Atish Kumar Dipongkor, Ziyu Yao, Kevin Moran

The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA has been illustrated to be an important component of automating software engineering (SE) tasks such as bug triaging, fault localization, and test prioritization. In addition, CAA is also important in cybersecurity and software forensics for addressing copyright disputes and detecting plagiarism. Past techniques for CAA tend to leverage hand-crafted code-related features typically carry limitations that prevent proper authorship characterization and lead to sensitivities to adversarial attacks. Recently, transformer-based Language Models (LMs) have shown remarkable efficacy across a range of SE tasks, and in authorship attribution for natural language in the NLP domain. However, their effectiveness in CAA is not well understood. As such, we conduct the first extensive empirical study applying two larger state-of-the-art code LMs, and five smaller code LMs to the task of CAA on six diverse datasets that encompass 12k code snippets written by 463 developers. Furthermore, we perform an in-depth quantitative and qualitative analysis of our studied models' performance on CAA using established interpretability techniques. Our results illustrate important aspects of the behavior of LMs in understanding stylometric code patterns.

SEFeb 7, 2023

Toward a Theory of Causation for Interpreting Neural Code Models

David N. Palacio, Alejandro Velasco, Nathan Cooper et al.

Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs.

SEAug 7, 2023

Evaluating and Explaining Large Language Models for Code Using Syntactic Structures

David N Palacio, Alejandro Velasco, Daniel Rodriguez-Cardenas et al.

Large Language Models (LLMs) for code are a family of high-parameter, transformer-based neural networks pre-trained on massive datasets of both natural and programming languages. These models are rapidly being employed in commercial AI-based developer tools, such as GitHub CoPilot. However, measuring and explaining their effectiveness on programming tasks is a challenging proposition, given their size and complexity. The methods for evaluating and explaining LLMs for code are inextricably linked. That is, in order to explain a model's predictions, they must be reliably mapped to fine-grained, understandable concepts. Once this mapping is achieved, new methods for detailed model evaluations are possible. However, most current explainability techniques and evaluation benchmarks focus on model robustness or individual task performance, as opposed to interpreting model predictions. To this end, this paper introduces ASTxplainer, an explainability method specific to LLMs for code that enables both new methods for LLM evaluation and visualizations of LLM predictions that aid end-users in understanding model predictions. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes, by extracting and aggregating normalized model logits within AST structures. To demonstrate the practical benefit of ASTxplainer, we illustrate the insights that our framework can provide by performing an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects. Additionally, we perform a user study examining the usefulness of an ASTxplainer-derived visualization of model predictions aimed at enabling model users to explain predictions. The results of these studies illustrate the potential for ASTxplainer to provide insights into LLM effectiveness, and aid end-users in understanding predictions.

SEMar 12, 2023

AidUI: Toward Automated Recognition of Dark Patterns in User Interfaces

SM Hasan Mansur, Sabiha Salma, Damilola Awofisayo et al.

Past studies have illustrated the prevalence of UI dark patterns, or user interfaces that can lead end-users toward (unknowingly) taking actions that they may not have intended. Such deceptive UI designs can result in adverse effects on end users, such as oversharing personal information or financial loss. While significant research progress has been made toward the development of dark pattern taxonomies, developers and users currently lack guidance to help recognize, avoid, and navigate these often subtle design motifs. However, automated recognition of dark patterns is a challenging task, as the instantiation of a single type of pattern can take many forms, leading to significant variability. In this paper, we take the first step toward understanding the extent to which common UI dark patterns can be automatically recognized in modern software applications. To do this, we introduce AidUI, a novel automated approach that uses computer vision and natural language processing techniques to recognize a set of visual and textual cues in application screenshots that signify the presence of ten unique UI dark patterns, allowing for their detection, classification, and localization. To evaluate our approach, we have constructed ContextDP, the current largest dataset of fully-localized UI dark patterns that spans 175 mobile and 83 web UI screenshots containing 301 dark pattern instances. The results of our evaluation illustrate that \AidUI achieves an overall precision of 0.66, recall of 0.67, F1-score of 0.65 in detecting dark pattern instances, reports few false positives, and is able to localize detected patterns with an IoU score of ~0.84. Furthermore, a significant subset of our studied dark patterns can be detected quite reliably (F1 score of over 0.82), and future research directions may allow for improved detection of additional patterns.

SEOct 10, 2023Code

A Comparative Study of Transformer-based Neural Text Representation Techniques on Bug Triaging

Atish Kumar Dipongkor, Kevin Moran

Often, the first step in managing bug reports is related to triaging a bug to the appropriate developer who is best suited to understand, localize, and fix the target bug. Additionally, assigning a given bug to a particular part of a software project can help to expedite the fixing process. However, despite the importance of these activities, they are quite challenging, where days can be spent on the manual triaging process. Past studies have attempted to leverage the limited textual data of bug reports to train text classification models that automate this process -- to varying degrees of success. However, the textual representations and machine learning models used in prior work are limited by their expressiveness, often failing to capture nuanced textual patterns that might otherwise aid in the triaging process. Recently, large, transformer-based, pre-trained neural text representation techniques such as BERT have achieved greater performance in several natural language processing tasks. However, the potential for using these techniques to improve upon prior approaches for automated bug triaging is not well studied or understood. Therefore, in this paper we offer one of the first investigations that fine-tunes transformer-based language models for the task of bug triaging on four open source datasets, spanning a collective 53 years of development history with over 400 developers and over 150 software project components. Our study includes both a quantitative and qualitative analysis of effectiveness. Our findings illustrate that DeBERTa is the most effective technique across the triaging tasks of developer and component assignment, and the measured performance delta is statistically significant compared to other techniques. However, through our qualitative analysis, we also observe that each technique possesses unique abilities best suited to certain types of bug reports.

SEJul 12, 2024

Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations

David N. Palacio, Daniel Rodriguez-Cardenas, Alejandro Velasco et al.

Trustworthiness and interpretability are inextricably linked concepts for LLMs. The more interpretable an LLM is, the more trustworthy it becomes. However, current techniques for interpreting LLMs when applied to code-related tasks largely focus on accuracy measurements, measures of how models react to change, or individual task performance instead of the fine-grained explanations needed at prediction time for greater interpretability, and hence trust. To improve upon this status quo, this paper introduces ASTrust, an interpretability method for LLMs of code that generates explanations grounded in the relationship between model confidence and syntactic structures of programming languages. ASTrust explains generated code in the context of syntax categories based on Abstract Syntax Trees and aids practitioners in understanding model predictions at both local (individual code snippets) and global (larger datasets of code) levels. By distributing and assigning model confidence scores to well-known syntactic structures that exist within ASTs, our approach moves beyond prior techniques that perform token-level confidence mapping by offering a view of model confidence that directly aligns with programming language concepts with which developers are familiar. To put ASTrust into practice, we developed an automated visualization that illustrates the aggregated model confidence scores superimposed on sequence, heat-map, and graph-based visuals of syntactic structures from ASTs. We examine both the practical benefit that ASTrust can provide through a data science study on 12 popular LLMs on a curated set of GitHub repos and the usefulness of ASTrust through a human study.

SEJul 11, 2024

Semantic GUI Scene Learning and Video Alignment for Detecting Duplicate Video-based Bug Reports

Yanfu Yan, Nathan Cooper, Oscar Chaparro et al.

Video-based bug reports are increasingly being used to document bugs for programs centered around a graphical user interface (GUI). However, developing automated techniques to manage video-based reports is challenging as it requires identifying and understanding often nuanced visual patterns that capture key information about a reported bug. In this paper, we aim to overcome these challenges by advancing the bug report management task of duplicate detection for video-based reports. To this end, we introduce a new approach, called JANUS, that adapts the scene-learning capabilities of vision transformers to capture subtle visual and textual patterns that manifest on app UI screens - which is key to differentiating between similar screens for accurate duplicate report detection. JANUS also makes use of a video alignment technique capable of adaptive weighting of video frames to account for typical bug manifestation patterns. In a comprehensive evaluation on a benchmark containing 7,290 duplicate detection tasks derived from 270 video-based bug reports from 90 Android app bugs, the best configuration of our approach achieves an overall mRR/mAP of 89.8%/84.7%, and for the large majority of duplicate detection tasks, outperforms prior work by around 9% to a statistically significant degree. Finally, we qualitatively illustrate how the scene-learning capabilities provided by Janus benefits its performance.

CRJul 15, 2021Code

Mutation-based Evaluation of Cryptographic API Misuse Detectors

Amit Seal Ami, Scott Marsden, Kevin Moran et al.

The correct use of cryptography is central to ensuring data security in modern software systems. Hence, several academic and commercial static analysis tools have been developed for detecting and mitigating crypto-API misuse. While developers are optimistically adopting these crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of their effectiveness at finding crypto-API misuse in practice. This paper describes the MASC framework, which enables a systematic and data-driven evaluation of crypto-detectors using mutation testing. We ground MASC in a comprehensive view of the problem space by developing a data-driven taxonomy of existing crypto-API misuse, containing 107 misuse cases organized among nine semantic clusters. We develop 19 generalizable usage-based mutation operators and three mutation scopes that can expressively instantiate thousands of compilable variants of the misuse cases for thoroughly evaluating crypto-detectors. Using MASC, in a previous study, we evaluated nine major crypto-detectors and discovered 19 unique, undocumented flaws that severely impact the ability of crypto-detectors to discover misuses in practice. This paper substantially extends our MASC framework and offers updated evaluation of the crypto-detectors in our 2022 study, in addition to 5 more, major crypto-detectors. Through this work, we find 6 new, undocumented flaws, and demonstrate that these flaws affect the crypto-detectors regardless of their origin; open-source community, industry, and/or research. We conclude with a discussion on the diverse perspectives that influence the design of crypto-detectors and future directions towards building security-focused crypto-detectors by design.

SEAug 1, 2019Code

Learning to Identify Security-Related Issues Using Convolutional Neural Networks

David N. Palacio, Daniel McCrystal, Kevin Moran et al.

Software security is becoming a high priority for both large companies and start-ups alike due to the increasing potential for harm that vulnerabilities and breaches carry with them. However, attaining robust security assurance while delivering features requires a precarious balancing act in the context of agile development practices. One path forward to help aid development teams in securing their software products is through the design and development of security-focused automation. Ergo, we present a novel approach, called SecureReqNet, for automatically identifying whether issues in software issue tracking systems describe security-related content. Our approach consists of a two-phase neural net architecture that operates purely on the natural language descriptions of issues. The first phase of our approach learns high dimensional word embeddings from hundreds of thousands of vulnerability descriptions listed in the CVE database and issue descriptions extracted from open source projects. The second phase then utilizes the semantic ontology represented by these embeddings to train a convolutional neural network capable of predicting whether a given issue is security-related. We evaluated SecureReqNet by applying it to identify security-related issues from a dataset of thousands of issues mined from popular projects on GitLab and GitHub. In addition, we also applied our approach to identify security-related requirements from a commercial software project developed by a major telecommunication company. Our preliminary results are encouraging, with SecureReqNet achieving an accuracy of 96% on open source issues and 71.6% on industrial requirements.

SEFeb 13, 2018Code

Automated Reporting of GUI Design Violations for Mobile Apps

Kevin Moran, Boyang Li, Carlos Bernal-Cárdenas et al.

The inception of a mobile app often takes form of a mock-up of the Graphical User Interface (GUI), represented as a static image delineating the proper layout and style of GUI widgets that satisfy requirements. Following this initial mock-up, the design artifacts are then handed off to developers whose goal is to accurately implement these GUIs and the desired functionality in code. Given the sizable abstraction gap between mock-ups and code, developers often introduce mistakes related to the GUI that can negatively impact an app's success in highly competitive marketplaces. Moreover, such mistakes are common in the evolutionary context of rapidly changing apps. This leads to the time-consuming and laborious task of design teams verifying that each screen of an app was implemented according to intended design specifications. This paper introduces a novel, automated approach for verifying whether the GUI of a mobile app was implemented according to its intended design. Our approach resolves GUI-related information from both implemented apps and mock-ups and uses computer vision techniques to identify common errors in the implementations of mobile GUIs. We implemented this approach for Android in a tool called GVT and carried out both a controlled empirical evaluation with open-source apps as well as an industrial evaluation with designers and developers from Huawei. The results show that GVT solves an important, difficult, and highly practical problem with remarkable efficiency and accuracy and is both useful and scalable from the point of view of industrial designers and developers. The tool is currently used by over one-thousand industrial designers and developers at Huawei to improve the quality of their mobile apps.

SEJan 19, 2018Code

Mining Android App Usages for Generating Actionable GUI-based Execution Scenarios

Mario Linares-Vasquez, Martin White, Carlos Bernal-Cardenas et al.

GUI-based models extracted from Android app execution traces, events, or source code can be extremely useful for challenging tasks such as the generation of scenarios or test cases. However, extracting effective models can be an expensive process. Moreover, existing approaches for automatically deriving GUI-based models are not able to generate scenarios that include events which were not observed in execution (nor event) traces. In this paper, we address these and other major challenges in our novel hybrid approach, coined as MonkeyLab. Our approach is based on the Record-Mine-Generate-Validate framework, which relies on recording app usages that yield execution (event) traces, mining those event traces and generating execution scenarios using statistical language modeling, static and dynamic analyses, and validating the resulting scenarios using an interactive execution of the app on a real device. The framework aims at mining models capable of generating feasible and fully replayable (i.e., actionable) scenarios reflecting either natural user behavior or uncommon usages (e.g., corner cases) for a given app. We evaluated MONKEYLAB in a case study involving several medium-to-large open-source Android apps. Our results demonstrate that MonkeyLab is able to mine GUI-based models that can be used to generate actionable execution scenarios for both natural and unnatural sequences of events on Google Nexus 7 tablets.

SEJan 19, 2018Code

How do Developers Test Android Applications?

Mario Linares Vasquez, Carlos Bernal-Cardenas, Kevin Moran et al.

Enabling fully automated testing of mobile applications has recently become an important topic of study for both researchers and practitioners. A plethora of tools and approaches have been proposed to aid mobile developers both by augmenting manual testing practices and by automating various parts of the testing process. However, current approaches for automated testing fall short in convincing developers about their benefits, leading to a majority of mobile testing being performed manually. With the goal of helping researchers and practitioners - who design approaches supporting mobile testing - to understand developer's needs, we analyzed survey responses from 102 open source contributors to Android projects about their practices when performing testing. The survey focused on questions regarding practices and preferences of developers/testers in-the-wild for (i) designing and generating test cases, (ii) automated testing practices, and (iii) perceptions of quality metrics such as code coverage for determining test quality. Analyzing the information gleaned from this survey, we compile a body of knowledge to help guide researchers and professionals toward tailoring new automated testing approaches to the need of a diverse set of open source developers.

SEJan 18, 2018Code

Enhancing Bug Reports for Mobile Apps

Kevin Moran

The modern software development landscape has seen a shift in focus toward mobile applications as "smart" devices near ubiquitous adoption. Due to this trend, the complexity of mobile applications has been increasing, making development and maintenance particularly challenging. However, it is clear that current bug tracking systems are not able effectively support construction of reports with actionable information that will directly lead to a bug's resolution. To address the need for an improved reporting system, we introduce a novel solution, called FUSION, that helps users auto-complete reproduction steps in bug reports for mobile apps. FUSION links information, that users provide, to program artifacts extracted through static and dynamic analysis performed before testing or release. The approach that FUSION employs is generalizable to other current mobile software platforms, and constitutes a new method by which off-device bug reporting can be conducted for mobile software projects. We evaluate FUSION by conducting a study that quantitatively and qualitatively measures the user experience of the system for both reporting and reproducing bugs, as well as the quality of the bug reports it produces. In a study involving 28 participants we apply FUSION to support the maintenance tasks of reporting and reproducing defects on 15 real-world bugs found in 14 open source Android apps. Our results demonstrate that FUSION allows for more reliable reproduction of bugs from reports by aiding users in reporting more detailed application-specific information compared to traditional bug tracking systems.

SEMay 11, 2017Code

Auto-completing Bug Reports for Android Applications

Kevin Moran, Mario Linares-Vásquez, Carlos Bernal-Cárdenas et al.

The modern software development landscape has seen a shift in focus toward mobile applications as tablets and smartphones near ubiquitous adoption. Due to this trend, the complexity of these apps has been increasing, making development and maintenance challenging. Additionally, current bug tracking systems are not able to effectively support construction of reports with actionable information that directly lead to a bug's resolution. To address the need for an improved reporting system, we introduce a novel solution, called FUSION, that helps users auto complete reproduction steps in bug reports for mobile apps. FUSION links user provided information to program artifacts extracted through static and dynamic analysis performed before testing or release. The approach that FUSION employs is generalizable to other current mobile software platforms, and constitutes a new method by which off device bug reporting can be conducted for mobile software projects. In a study involving 28 participants we applied FUSION to support the maintenance tasks of reporting and reproducing defects from 15 real world bugs found in 14 open source Android apps while qualitatively and qualitatively measuring the user experience of the system. Our results demonstrate that FUSION both effectively facilitates reporting and allows for more reliable reproduction of bugs from reports compared to traditional issue tracking systems by presenting more detailed contextual app information.

SEMar 20, 2024

MotorEase: Automated Detection of Motor Impairment Accessibility Issues in Mobile App UIs

Arun Krishnavajjala, SM Hasan Mansur, Justin Jose et al.

Recent research has begun to examine the potential of automatically finding and fixing accessibility issues that manifest in software. However, while recent work makes important progress, it has generally been skewed toward identifying issues that affect users with certain disabilities, such as those with visual or hearing impairments. However, there are other groups of users with different types of disabilities that also need software tooling support to improve their experience. As such, this paper aims to automatically identify accessibility issues that affect users with motor-impairments. To move toward this goal, this paper introduces a novel approach, called MotorEase, capable of identifying accessibility issues in mobile app UIs that impact motor-impaired users. Motor-impaired users often have limited ability to interact with touch-based devices, and instead may make use of a switch or other assistive mechanism -- hence UIs must be designed to support both limited touch gestures and the use of assistive devices. MotorEase adapts computer vision and text processing techniques to enable a semantic understanding of app UI screens, enabling the detection of violations related to four popular, previously unexplored UI design guidelines that support motor-impaired users, including: (i) visual touch target size, (ii) expanding sections, (iii) persisting elements, and (iv) adjacent icon visual distance. We evaluate MotorEase on a newly derived benchmark, called MotorCheck, that contains 555 manually annotated examples of violations to the above accessibility guidelines, across 1599 screens collected from 70 applications via a mobile app testing tool. Our experiments illustrate that MotorEase is able to identify violations with an average accuracy of ~90%, and a false positive rate of less than 9%, outperforming baseline techniques.

SEFeb 6, 2025

Combining Language and App UI Analysis for the Automated Assessment of Bug Reproduction Steps

Junayed Mahmud, Antu Saha, Oscar Chaparro et al.

Bug reports are essential for developers to confirm software problems, investigate their causes, and validate fixes. Unfortunately, reports often miss important information or are written unclearly, which can cause delays, increased issue resolution effort, or even the inability to solve issues. One of the most common components of reports that are problematic is the steps to reproduce the bug(s) (S2Rs), which are essential to replicate the described program failures and reason about fixes. Given the proclivity for deficiencies in reported S2Rs, prior work has proposed techniques that assist reporters in writing or assessing the quality of S2Rs. However, automated understanding of S2Rs is challenging, and requires linking nuanced natural language phrases with specific, semantically related program information. Prior techniques often struggle to form such language to program connections - due to issues in language variability and limitations of information gleaned from program analyses. To more effectively tackle the problem of S2R quality annotation, we propose a new technique called AstroBR, which leverages the language understanding capabilities of LLMs to identify and extract the S2Rs from bug reports and map them to GUI interactions in a program state model derived via dynamic analysis. We compared AstroBR to a related state-of-the-art approach and we found that AstroBR annotates S2Rs 25.2% better (in terms of F1 score) than the baseline. Additionally, AstroBR suggests more accurate missing S2Rs than the baseline (by 71.4% in terms of F1 score).

SEApr 1, 2024

AURORA: Navigating UI Tarpits via Automated Neural Screen Understanding

Safwat Ali Khan, Wenyu Wang, Yiran Ren et al.

Nearly a decade of research in software engineering has focused on automating mobile app testing to help engineers in overcoming the unique challenges associated with the software platform. Much of this work has come in the form of Automated Input Generation tools (AIG tools) that dynamically explore app screens. However, such tools have repeatedly been demonstrated to achieve lower-than-expected code coverage - particularly on sophisticated proprietary apps. Prior work has illustrated that a primary cause of these coverage deficiencies is related to so-called tarpits, or complex screens that are difficult to navigate. In this paper, we take a critical step toward enabling AIG tools to effectively navigate tarpits during app exploration through a new form of automated semantic screen understanding. We introduce AURORA, a technique that learns from the visual and textual patterns that exist in mobile app UIs to automatically detect common screen designs and navigate them accordingly. The key idea of AURORA is that there are a finite number of mobile app screen designs, albeit with subtle variations, such that the general patterns of different categories of UI designs can be learned. As such, AURORA employs a multi-modal, neural screen classifier that is able to recognize the most common types of UI screen designs. After recognizing a given screen, it then applies a set of flexible and generalizable heuristics to properly navigate the screen. We evaluated AURORA both on a set of 12 apps with known tarpits from prior work, and on a new set of five of the most popular apps from the Google Play store. Our results indicate that AURORA is able to effectively navigate tarpit screens, outperforming prior approaches that avoid tarpits by 19.6% in terms of method coverage. The improvements can be attributed to AURORA's UI design classification and heuristic navigation techniques.

36.6SEApr 1

Automated Generation of High-Quality Bug Reports for Android Applications

Antu Saha, Atish Kumar Dipongkor, Sam Bennett et al.

Most defects in mobile applications are visually observable on the device screen. To track these defects, users, testers, and developers must manually submit bug reports, especially in the absence of crashes. However, these reports are frequently ambiguous or inaccurate, often omitting essential components such as the Observed Behavior (OB), Expected Behavior (EB), or Steps to Reproduce (S2Rs). Low-quality reports hinder developers' ability to understand and reproduce defects, delaying resolution and leading to incorrect or unresolvable fixes. In this paper, we posit that providing specific app-related information (e.g., GUI interactions or specific screens where bugs appear) to LLMs as key points of context can assist in automatically generating clear, detailed, and accurate OB, EB, and S2Rs. We built and evaluated a novel approach, BugScribe, that generates bug reports in this way. To support the evaluation, we introduce a unified quality framework that defines correctness and completeness dimensions for OB, EB, and S2Rs. Using 48 bug reports from 26 Android apps, we show that BugScribe produces higher-quality and more accurate components than the original reports and outperforms recent LLM-based baselines. We envision that BugScribe can serve as a practical assistant for testers and developers by enhancing incomplete bug reports with reliable and accurate OB, EB, and S2Rs, thereby streamlining bug resolution and improving mobile app quality.

CLJun 30, 2025

Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

Daking Rai, Samuel Miller, Kevin Moran et al.

Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

CLApr 15, 2025

Automated Python Translation

Joshua Otten, Antonios Anastasopoulos, Kevin Moran

Python is one of the most commonly used programming languages in industry and education. Its English keywords and built-in functions/modules allow it to come close to pseudo-code in terms of its readability and ease of writing. However, those who do not speak English may not experience these advantages. In fact, they may even be hindered in their ability to understand Python code, as the English nature of its terms creates an additional layer of overhead. To that end, we introduce the task of automatically translating Python's natural modality (keywords, error types, identifiers, etc.) into other human languages. This presents a unique challenge, considering the abbreviated nature of these forms, as well as potential untranslatability of advanced mathematical/programming concepts across languages. We therefore create an automated pipeline to translate Python into other human languages, comparing strategies using machine translation and large language models. We then use this pipeline to acquire translations from five common Python libraries (pytorch, pandas, tensorflow, numpy, and random) in seven languages, and do a quality test on a subset of these terms in French, Greek, and Bengali. We hope this will provide a clearer path forward towards creating a universal Python, accessible to anyone regardless of nationality or language background.

SEDec 6, 2024

On Interpreting the Effectiveness of Unsupervised Software Traceability with Information Theory

David N. Palacio, Daniel Rodriguez-Cardenas, Denys Poshyvanyk et al.

Traceability is a cornerstone of modern software development, ensuring system reliability and facilitating software maintenance. While unsupervised techniques leveraging Information Retrieval (IR) and Machine Learning (ML) methods have been widely used for predicting trace links, their effectiveness remains underexplored. In particular, these techniques often assume traceability patterns are present within textual data - a premise that may not hold universally. Moreover, standard evaluation metrics such as precision, recall, accuracy, or F1 measure can misrepresent the model performance when underlying data distributions are not properly analyzed. Given that automated traceability techniques tend to struggle to establish links, we need further insight into the information limits related to traceability artifacts. In this paper, we propose an approach, TraceXplainer, for using information theory metrics to evaluate and better understand the performance (limits) of unsupervised traceability techniques. Specifically, we introduce self-information, cross-entropy, and mutual information (MI) as metrics to measure the informativeness and reliability of traceability links. Through a comprehensive replication and analysis of well-studied datasets and techniques, we investigate the effectiveness of unsupervised techniques that predict traceability links using IR/ML. This application of TraceXplainer illustrates an imbalance in typical traceability datasets where the source code has on average 1.48 more information bits (i.e., entropy) than the linked documentation. Additionally, we demonstrate that an average MI of 4.81 bits, loss of 1.75, and noise of 0.28 bits signify that there are information-theoretic limits on the effectiveness of unsupervised traceability techniques. We hope these findings spur additional research on understanding the limits and progress of traceability research.

SEJun 15, 2021

Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors

Junayed Mahmud, Fahim Faisal, Raihan Islam Arnob et al.

Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy that can be used to drive future research efforts

SEJun 15, 2021

AndroR2: A Dataset of Manually Reproduced Bug Reports for Android Applications

Tyler Wendland, Jingyang Sun, Junayed Mahmud et al.

Software maintenance constitutes a large portion of the software development lifecycle. To carry out maintenance tasks, developers often need to understand and reproduce bug reports. As such, there has been increasing research activity coalescing around the notion of automating various activities related to bug reporting. A sizable portion of this research interest has focused on the domain of mobile apps. However, as research around mobile app bug reporting progresses, there is a clear need for a manually vetted and reproducible set of real-world bug reports that can serve as a benchmark for future work. This paper presents ANDROR2: a dataset of 90 manually reproduced bug reports for Android apps listed on Google Play and hosted on GitHub, systematically collected via an in-depth analysis of 459 reports extracted from the GitHub issue tracker. For each reproduced report, ANDROR2 includes the original bug report, an apk file for the buggy version of the app, an executable reproduction script, and metadata regarding the quality of the reproduction steps associated with the original report. We believe that the ANDROR2 dataset can be used to facilitate research in automatically analyzing, understanding, reproducing, localizing, and fixing bugs for mobile applications as well as other software maintenance activities more broadly.

SEMar 8, 2021

V2S: A Tool for Translating Video Recordings of Mobile App Usages into Replayable Scenarios

Madeleine Havranek, Carlos Bernal-Cárdenas, Nathan Cooper et al.

Screen recordings are becoming increasingly important as rich software artifacts that inform mobile application development processes. However, the amount of manual effort required to extract information from these graphical artifacts can hinder resource-constrained mobile developers. This paper presents Video2Scenario (V2S), an automated tool that processes video recordings of Android app usages, utilizes neural object detection and image classification techniques to classify the depicted user actions, and translates these actions into a replayable scenario. We conducted a comprehensive evaluation to demonstrate V2S's ability to reproduce recorded scenarios across a range of devices and a diverse set of usage cases and applications. The results indicate that, based on its performance with 175 videos depicting 3,534 GUI-based actions, V2S is accurate in reproducing $\approx$89\% of actions from collected videos.

CRFeb 13, 2021

Systematic Mutation-based Evaluation of the Soundness of Security-focused Android Static Analysis Techniques

Amit Seal Ami, Kaushal Kafle, Kevin Moran et al.

Mobile application security has been a major area of focus for security research over the course of the last decade. Numerous application analysis tools have been proposed in response to malicious, curious, or vulnerable apps. However, existing tools, and specifically, static analysis tools, trade soundness of the analysis for precision and performance and are hence soundy. Unfortunately, the specific unsound choices or flaws in the design of these tools is often not known or well-documented, leading to misplaced confidence among researchers, developers, and users. This paper describes the Mutation-based Soundness Evaluation ($μ$SE) framework, which systematically evaluates Android static analysis tools to discover, document, and fix flaws, by leveraging the well-founded practice of mutation analysis. We implemented $μ$SE and applied it to a set of prominent Android static analysis tools that detect private data leaks in apps. In a study conducted previously, we used $μ$SE to discover $13$ previously undocumented flaws in FlowDroid, one of the most prominent data leak detectors for Android apps. Moreover, we discovered that flaws also propagated to other tools that build upon the design or implementation of FlowDroid or its components. This paper substantially extends our $μ$SE framework and offers an new in-depth analysis of two more major tools in our 2020 study, we find $12$ new, undocumented flaws and demonstrate that all $25$ flaws are found in more than one tool, regardless of any inheritance-relation among the tools. Our results motivate the need for systematic discovery and documentation of unsound choices in soundy tools and demonstrate the opportunities in leveraging mutation testing in achieving this goal.

SEFeb 12, 2021

$μ$SE: Mutation-based Evaluation of Security-focused Static Analysis Tools for Android

Amit Seal Ami, Kaushal Kafle, Kevin Moran et al.

This demo paper presents the technical details and usage scenarios of $μ$SE: a mutation-based tool for evaluating security-focused static analysis tools for Android. Mutation testing is generally used by software practitioners to assess the robustness of a given test-suite. However, we leverage this technique to systematically evaluate static analysis tools and uncover and document soundness issues. $μ$SE's analysis has found 25 previously undocumented flaws in static data leak detection tools for Android. $μ$SE offers four mutation schemes, namely Reachability, Complex-reachability, TaintSink, and ScopeSink, which determine the locations of seeded mutants. Furthermore, the user can extend $μ$SE by customizing the API calls targeted by the mutation analysis. $μ$SE is also practical, as it makes use of filtering techniques based on compilation and execution criteria that reduces the number of ineffective mutations.

SEJan 22, 2021

It Takes Two to Tango: Combining Visual and Textual Information for Detecting Duplicate Video-Based Bug Reports

Nathan Cooper, Carlos Bernal-Cárdenas, Oscar Chaparro et al.

When a bug manifests in a user-facing application, it is likely to be exposed through the graphical user interface (GUI). Given the importance of visual information to the process of identifying and understanding such bugs, users are increasingly making use of screenshots and screen-recordings as a means to report issues to developers. However, when such information is reported en masse, such as during crowd-sourced testing, managing these artifacts can be a time-consuming process. As the reporting of screen-recordings in particular becomes more popular, developers are likely to face challenges related to manually identifying videos that depict duplicate bugs. Due to their graphical nature, screen-recordings present challenges for automated analysis that preclude the use of current duplicate bug report detection techniques. To overcome these challenges and aid developers in this task, this paper presents Tango, a duplicate detection technique that operates purely on video-based bug reports by leveraging both visual and textual information. Tango combines tailored computer vision techniques, optical character recognition, and text retrieval. We evaluated multiple configurations of Tango in a comprehensive empirical evaluation on 4,860 duplicate detection tasks that involved a total of 180 screen-recordings from six Android apps. Additionally, we conducted a user study investigating the effort required for developers to manually detect duplicate video-based bug reports and compared this to the effort required to use Tango. The results reveal that Tango's optimal configuration is highly effective at detecting duplicate video-based bug reports, accurately ranking target duplicate videos in the top-2 returned results in 83% of the tasks. Additionally, our user study shows that, on average, Tango can reduce developer effort by over 60%, illustrating its practicality.

SESep 17, 2020

Deep Learning & Software Engineering: State of Research and Future Directions

Prem Devanbu, Matthew Dwyer, Sebastian Elbaum et al.

Given the current transformative potential of research that sits at the intersection of Deep Learning (DL) and Software Engineering (SE), an NSF-sponsored community workshop was conducted in co-location with the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19) in San Diego, California. The goal of this workshop was to outline high priority areas for cross-cutting research. While a multitude of exciting directions for future work were identified, this report provides a general summary of the research areas representing the areas of highest priority which were discussed at the workshop. The intent of this report is to serve as a potential roadmap to guide future work that sits at the intersection of SE & DL.

SESep 14, 2020

A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research

Cody Watson, Nathan Cooper, David Nader Palacio et al.

An increasingly popular set of techniques adopted by software engineering (SE) researchers to automate development tasks are those rooted in the concept of Deep Learning (DL). The popularity of such techniques largely stems from their automated feature engineering capabilities, which aid in modeling software artifacts. However, due to the rapid pace at which DL techniques have been adopted, it is difficult to distill the current successes, failures, and opportunities of the current research landscape. In an effort to bring clarity to this crosscutting area of work, from its modern inception to the present, this paper presents a systematic literature review of research at the intersection of SE & DL. The review canvases work appearing in the most prominent SE and DL conferences and journals and spans 128 papers across 23 unique SE tasks. We center our analysis around the components of learning, a set of principles that govern the application of machine learning techniques (ML) to a given problem domain, discussing several aspects of the surveyed work at a granular level. The end result of our analysis is a research roadmap that both delineates the foundations of DL techniques applied to SE research, and highlights likely areas of fertile exploration for the future.

SEMay 18, 2020

Translating Video Recordings of Mobile App Usages into Replayable Scenarios

Carlos Bernal-Cárdenas, Nathan Cooper, Kevin Moran et al.

Screen recordings of mobile applications are easy to obtain and capture a wealth of information pertinent to software developers (e.g., bugs or feature requests), making them a popular mechanism for crowdsourced app feedback. Thus, these videos are becoming a common artifact that developers must manage. In light of unique mobile development constraints, including swift release cycles and rapidly evolving platforms, automated techniques for analyzing all types of rich software artifacts provide benefit to mobile developers. Unfortunately, automatically analyzing screen recordings presents serious challenges, due to their graphical nature, compared to other types of (textual) artifacts. To address these challenges, this paper introduces V2S, a lightweight, automated approach for translating video recordings of Android app usages into replayable scenarios. V2S is based primarily on computer vision techniques and adapts recent solutions for object detection and image classification to detect and classify user actions captured in a video, and convert these into a replayable test scenario. We performed an extensive evaluation of V2S involving 175 videos depicting 3,534 GUI-based actions collected from users exercising features and reproducing bugs from over 80 popular Android apps. Our results illustrate that V2S can accurately replay scenarios from screen recordings, and is capable of reproducing $\approx$ 89% of our collected videos with minimal overhead. A case study with three industrial partners illustrates the potential usefulness of V2S from the viewpoint of developers.

SEMay 18, 2020

Improving the Effectiveness of Traceability Link Recovery using Hierarchical Bayesian Networks

Kevin Moran, David N. Palacio, Carlos Bernal-Cárdenas et al.

Traceability is a fundamental component of the modern software development process that helps to ensure properly functioning, secure programs. Due to the high cost of manually establishing trace links, researchers have developed automated approaches that draw relationships between pairs of textual software artifacts using similarity measures. However, the effectiveness of such techniques are often limited as they only utilize a single measure of artifact similarity and cannot simultaneously model (implicit and explicit) relationships across groups of diverse development artifacts. In this paper, we illustrate how these limitations can be overcome through the use of a tailored probabilistic model. To this end, we design and implement a HierarchiCal PrObabilistic Model for SoftwarE Traceability (Comet) that is able to infer candidate trace links. Comet is capable of modeling relationships between artifacts by combining the complementary observational prowess of multiple measures of textual similarity. Additionally, our model can holistically incorporate information from a diverse set of sources, including developer feedback and transitive (often implicit) relationships among groups of software artifacts, to improve inference accuracy. We conduct a comprehensive empirical evaluation of Comet that illustrates an improvement over a set of optimally configured baselines of $\approx$14% in the best case and $\approx$5% across all subjects in terms of average precision. The comparative effectiveness of Comet in practice, where optimal configuration is typically not possible, is likely to be higher. Finally, we illustrate Comets potential for practical applicability in a survey with developers from Cisco Systems who used a prototype Comet Jenkins plugin.

SEFeb 13, 2020

On Learning Meaningful Assert Statements for Unit Test Cases

Cody Watson, Michele Tufano, Kevin Moran et al.

Software testing is an essential part of the software lifecycle and requires a substantial amount of time and effort. It has been estimated that software developers spend close to 50% of their time on testing the code they write. For these reasons, a long standing goal within the research community is to (partially) automate software testing. While several techniques and tools have been proposed to automatically generate test methods, recent work has criticized the quality and usefulness of the assert statements they generate. Therefore, we employ a Neural Machine Translation (NMT) based approach called Atlas(AuTomatic Learning of Assert Statements) to automatically generate meaningful assert statements for test methods. Given a test method and a focal method (i.e.,the main method under test), Atlas can predict a meaningful assert statement to assess the correctness of the focal method. We applied Atlas to thousands of test methods from GitHub projects and it was able to predict the exact assert statement manually written by developers in 31% of the cases when only considering the top-1 predicted assert. When considering the top-5 predicted assert statements, Atlas is able to predict exact matches in 50% of the cases. These promising results hint to the potential usefulness ofour approach as (i) a complement to automatic test case generation techniques, and (ii) a code completion support for developers, whocan benefit from the recommended assert statements while writing test code.

CRJun 29, 2019

Helion: Enabling a Natural Perspective of Home Automation

Sunil Manandhar, Kevin Moran, Kaushal Kafle et al.

Security researchers have recently discovered significant security and safety issues related to home automation and developed approaches to address them. Such approaches often face design and evaluation challenges which arise from their restricted perspective of home automation that is bounded by the IoT apps they analyze. The challenges of past work can be overcome by relying on a deeper understanding of realistic home automation usage. More specifically, the availability of natural home automation scenarios, i.e., sequences of home automation events that may realistically occur in an end-user's home, could help security researchers design better security/safety systems. This paper presents Helion, a framework for building a natural perspective of home automation. Helion identifies the regularities in user-driven home automation, i.e., from user-driven routines that are increasingly being created by users through intuitive platform UIs. Our intuition for designing Helion is that smart home event sequences created by users exhibit an inherent set of semantic patterns, or naturalness that can be modeled and used to generate valid and useful scenarios. To evaluate our approach, we first empirically demonstrate that this naturalness hypothesis holds, with a corpus of 30,518 home automation events, constructed from 273 routines collected from 40 users. We then demonstrate that the scenarios generated by Helion are reasonable and valid from an end-user perspective, through an evaluation with 16 external evaluators. We further show the usefulness of Helion's scenarios by generating 17 home security/safety policies with significantly less effort than existing approaches. We conclude by discussing key takeaways and future research challenges enabled by Helion's natural perspective of home automation.

SEJun 17, 2019

Assessing the Quality of the Steps to Reproduce in Bug Reports

Oscar Chaparro, Carlos Bernal-Cardenas, Jing Lu et al.

A major problem with user-written bug reports, indicated by developers and documented by researchers, is the (lack of high) quality of the reported steps to reproduce the bugs. Low-quality steps to reproduce lead to excessive manual effort spent on bug triage and resolution. This paper proposes Euler, an approach that automatically identifies and assesses the quality of the steps to reproduce in a bug report, providing feedback to the reporters, which they can use to improve the bug report. The feedback provided by Euler was assessed by external evaluators and the results indicate that Euler correctly identified 98% of the existing steps to reproduce and 58% of the missing ones, while 73% of its quality annotations are correct.

SEJan 3, 2019

Guigle: A GUI Search Engine for Android Apps

Carlos Bernal-Cardenas, Kevin Moran, Michele Tufano et al.

The process of developing a mobile application typically starts with the ideation and conceptualization of its user interface. This concept is then translated into a set of mock-ups to help determine how well the user interface embodies the intended features of the app. After the creation of mock-ups developers then translate it into an app that runs in a mobile device. In this paper we propose an approach, called GUIGLE, that aims to facilitate the process of conceptualizing the user interface of an app through GUI search. GUIGLE indexes GUI images and metadata extracted using automated dynamic analysis on a large corpus of apps extracted from Google Play. To perform a search, our approach uses information from text displayed on a screen, user interface components, the app name, and screen color palettes to retrieve relevant screens given a query. Furthermore, we provide a lightweight query language that allows for intuitive search of screens. We evaluate GUIGLE with real users and found that, on average, 68.8% of returned screens were relevant to the specified query. Additionally, users found the various different features of GUIGLE useful, indicating that our search engine provides an intuitive user experience. Finally, users agree that the information presented by GUIGLE is useful in conceptualizing the design of new screens for applications.

CRDec 4, 2018

A Study of Data Store-based Home Automation

Kaushal Kafle, Kevin Moran, Sunil Manandhar et al.

Home automation platforms provide a new level of convenience by enabling consumers to automate various aspects of physical objects in their homes. While the convenience is beneficial, security flaws in the platforms or integrated third-party products can have serious consequences for the integrity of a user's physical environment. In this paper we perform a systematic security evaluation of two popular smart home platforms, Google's Nest platform and Philips Hue, that implement home automation "routines" (i.e., trigger-action programs involving apps and devices) via manipulation of state variables in a centralized data store. Our semi-automated analysis examines, among other things, platform access control enforcement, the rigor of non-system enforcement procedures, and the potential for misuse of routines. This analysis results in ten key findings with serious security implications. For instance, we demonstrate the potential for the misuse of smart home routines in the Nest platform to perform a lateral privilege escalation, illustrate how Nest's product review system is ineffective at preventing multiple stages of this attack that it examines, and demonstrate how emerging platforms may fail to provide even bare-minimum security by allowing apps to arbitrarily add/remove other apps from the user's smart home. Our findings draw attention to the unique security challenges of platforms that execute routines via centralized data stores and highlight the importance of enforcing security by design in emerging home automation platforms.

SEJul 25, 2018

Detecting and Summarizing GUI Changes in Evolving Mobile Apps

Kevin Moran, Cody Watson, John Hoskins et al.

Mobile applications have become a popular software development domain in recent years due in part to a large user base, capable hardware, and accessible platforms. However, mobile developers also face unique challenges, including pressure for frequent releases to keep pace with rapid platform evolution, hardware iteration, and user feedback. Due to this rapid pace of evolution, developers need automated support for documenting the changes made to their apps in order to aid in program comprehension. One of the more challenging types of changes to document in mobile apps are those made to the graphical user interface (GUI) due to its abstract, pixel-based representation. In this paper, we present a fully automated approach, called GCAT, for detecting and summarizing GUI changes during the evolution of mobile apps. GCAT leverages computer vision techniques and natural language generation to accurately and concisely summarize changes made to the GUI of a mobile app between successive commits or releases. We evaluate the performance of our approach in terms of its precision and recall in detecting GUI changes compared to developer specified changes, and investigate the utility of the generated change reports in a controlled user study. Our results indicate that GCAT is capable of accurately detecting and classifying GUI changes - outperforming developers - while providing useful documentation.

SEJul 23, 2018

Assessing Test Case Prioritization on Real Faults and Mutants

Qi Luo, Kevin Moran, Denys Poshyvanyk et al.

Test Case Prioritization (TCP) is an important component of regression testing, allowing for earlier detection of faults or helping to reduce testing time and cost. While several TCP approaches exist in the research literature, a growing number of studies have evaluated them against synthetic software defects, called mutants. Hence, it is currently unclear to what extent TCP performance on mutants would be representative of the performance achieved on real faults. To answer this fundamental question, we conduct the first empirical study comparing the performance of TCP techniques applied to both real-world and mutation faults. The context of our study includes eight well-studied TCP approaches, 35k+ mutation faults, and 357 real-world faults from five Java systems in the Defects4J dataset. Our results indicate that the relative performance of the studied TCP techniques on mutants may not strongly correlate with performance on real faults, depending upon attributes of the subject programs. This suggests that, in certain contexts, the best performing technique on a set of mutants may not be the best technique in practice when applied to real faults. We also illustrate that these correlations vary for mutants generated by different operators depending on whether chosen operators reflect typical faults of a subject program. This highlights the importance, particularly for TCP, of developing mutation operators tailored for specific program domains.

SEJul 18, 2018

Automating Software Development for Mobile Computing Platforms (Doctoral Symposium)

Kevin Moran

Mobile devices such as smartphones and tablets have become ubiquitous in today's modern computing landscape. The applications that run on these mobile devices (often referred to as "apps") have become a primary means of computing for millions of users and, as such, have garnered immense developer interest. These apps allow for unique, personal software experiences through touch-based UIs and a complex assortment of sensors. However designing and implementing high quality mobile apps can be a difficult process. This is primarily due to challenges unique to mobile development including change-prone APIs and platform fragmentation, just to name a few. This paper presents the motivation and an overview of a dissertation which presents new approaches for automating and improving mobile app design and development practices. Additionally, this paper discusses potential avenues for future research based upon the work conducted, as well as general lessons learned during the author's tenure as a doctoral student in the general areas of software engineering, maintenance, and evolution.

SEJul 18, 2018

Overcoming Language Dichotomies: Toward Effective Program Comprehension for Mobile App Development

Kevin Moran, Carlos Bernal Cardenas, Mario Linares Vasquez et al.

Mobile devices and platforms have become an established target for modern software developers due to performant hardware and a large and growing user base numbering in the billions. Despite their popularity, the software development process for mobile apps comes with a set of unique, domain-specific challenges rooted in program comprehension. Many of these challenges stem from developer difficulties in reasoning about different representations of a program, a phenomenon we define as a "language dichotomy". In this paper, we reflect upon the various language dichotomies that contribute to open problems in program comprehension and development for mobile apps. Furthermore, to help guide the research community towards effective solutions for these problems, we provide a roadmap of directions for future work.

SEJun 26, 2018

How Do Static and Dynamic Test Case Prioritization Techniques Perform on Modern Software Systems? An Extensive Study on GitHub Projects

Qi Luo, Kevin Moran, Lingming Zhang et al.

Test Case Prioritization (TCP) is an increasingly important regression testing technique for reordering test cases according to a pre-defined goal, particularly as agile practices gain adoption. To better understand these techniques, we perform the first extensive study aimed at empirically evaluating four static TCP techniques, comparing them with state-of-research dynamic TCP techniques across several quality metrics. This study was performed on 58 real-word Java programs encompassing 714 KLoC and results in several notable observations. First, our results across two effectiveness metrics (the Average Percentage of Faults Detected APFD and the cost cognizant APFDc) illustrate that at test-class granularity, these metrics tend to correlate, but this correlation does not hold at test-method granularity. Second, our analysis shows that static techniques can be surprisingly effective, particularly when measured by APFDc. Third, we found that TCP techniques tend to perform better on larger programs, but that program size does not affect comparative performance measures between techniques. Fourth, software evolution does not significantly impact comparative performance results between TCP techniques. Fifth, neither the number nor type of mutants utilized dramatically impact measures of TCP effectiveness under typical experimental settings. Finally, our similarity analysis illustrates that highly prioritized test cases tend to uncover dissimilar faults.

CRJun 26, 2018

Discovering Flaws in Security-Focused Static Analysis Tools for Android using Systematic Mutation

Richard Bonett, Kaushal Kafle, Kevin Moran et al.

Mobile application security has been one of the major areas of security research in the last decade. Numerous application analysis tools have been proposed in response to malicious, curious, or vulnerable apps. However, existing tools, and specifically, static analysis tools, trade soundness of the analysis for precision and performance, and are hence soundy. Unfortunately, the specific unsound choices or flaws in the design of these tools are often not known or well-documented, leading to a misplaced confidence among researchers, developers, and users. This paper proposes the Mutation-based soundness evaluation ($μ$SE) framework, which systematically evaluates Android static analysis tools to discover, document, and fix, flaws, by leveraging the well-founded practice of mutation analysis. We implement $μ$SE as a semi-automated framework, and apply it to a set of prominent Android static analysis tools that detect private data leaks in apps. As the result of an in-depth analysis of one of the major tools, we discover 13 undocumented flaws. More importantly, we discover that all 13 flaws propagate to tools that inherit the flawed tool. We successfully fix one of the flaws in cooperation with the tool developers. Our results motivate the urgent need for systematic discovery and documentation of unsound choices in soundy tools, and demonstrate the opportunities in leveraging mutation testing in achieving this goal.

SEFeb 13, 2018

MDroid+: A Mutation Testing Framework for Android

Kevin Moran, Michele Tufano, Carlos Bernal-Cárdenas et al.

Mutation testing has shown great promise in assessing the effectiveness of test suites while exhibiting additional applications to test-case generation, selection, and prioritization. Traditional mutation testing typically utilizes a set of simple language specific source code transformations, called operators, to introduce faults. However, empirical studies have shown that for mutation testing to be most effective, these simple operators must be augmented with operators specific to the domain of the software under test. One challenging software domain for the application of mutation testing is that of mobile apps. While mobile devices and accompanying apps have become a mainstay of modern computing, the frameworks and patterns utilized in their development make testing and verification particularly difficult. As a step toward helping to measure and ensure the effectiveness of mobile testing practices, we introduce MDroid+, an automated framework for mutation testing of Android apps. MDroid+ includes 38 mutation operators from ten empirically derived types of Android faults and has been applied to generate over 8,000 mutants for more than 50 apps.

SEFeb 7, 2018

Machine Learning-Based Prototyping of Graphical User Interfaces for Mobile Apps

Kevin Moran, Carlos Bernal-Cárdenas, Michael Curcio et al.

It is common practice for developers of user-facing software to transform a mock-up of a graphical user interface (GUI) into code. This process takes place both at an application's inception and in an evolutionary context as GUI changes keep pace with evolving features. Unfortunately, this practice is challenging and time-consuming. In this paper, we present an approach that automates this process by enabling accurate prototyping of GUIs via three tasks: detection, classification, and assembly. First, logical components of a GUI are detected from a mock-up artifact using either computer vision techniques or mock-up metadata. Then, software repository mining, automated dynamic analysis, and deep convolutional neural networks are utilized to accurately classify GUI-components into domain-specific types (e.g., toggle-button). Finally, a data-driven, K-nearest-neighbors algorithm generates a suitable hierarchical GUI structure from which a prototype application can be automatically assembled. We implemented this approach for Android in a system called ReDraw. Our evaluation illustrates that ReDraw achieves an average GUI-component classification accuracy of 91% and assembles prototype applications that closely mirror target mock-ups in terms of visual affinity while exhibiting reasonable code structure. Interviews with industrial practitioners illustrate ReDraw's potential to improve real development workflows.

SEJan 19, 2018

Continuous, Evolutionary and Large-Scale: A New Perspective for Automated Mobile App Testing

Mario Linares Vasquez, Kevin Moran, Denys Poshyvanyk

Mobile app development involves a unique set of challenges including device fragmentation and rapidly evolving platforms, making testing a difficult task. The design space for a comprehensive mobile testing strategy includes features, inputs, potential contextual app states, and large combinations of devices and underlying platforms. Therefore, automated testing is an essential activity of the development process. However, current state of the art of automated testing tools for mobile apps poses limitations that has driven a preference for manual testing in practice. As of today, there is no comprehensive automated solution for mobile testing that overcomes fundamental issues such as automated oracles, history awareness in test cases, or automated evolution of test cases. In this perspective paper we survey the current state of the art in terms of the frameworks, tools, and services available to developers to aid in mobile testing, highlighting present shortcomings. Next, we provide commentary on current key challenges that restrict the possibility of a comprehensive, effective, and practical automated testing solution. Finally, we offer our vision of a comprehensive mobile app testing framework, complete with research agenda, that is succinctly summarized along three principles: Continuous, Evolutionary and Large-scale (CEL).

SEJan 18, 2018

FUSION: A Tool for Facilitating and Augmenting Android Bug Reporting

Kevin Moran, Mario Linares-Vasquez, Carlos Bernal-Cardenas et al.

As the popularity of mobile smart devices continues to climb the complexity of "apps" continues to increase, making the development and maintenance process challenging. Current bug tracking systems lack key features to effectively support construction of reports with actionable information that directly lead to a bug's resolution. In this demo we present the implementation of a novel bug reporting system, called Fusion, that facilitates users including reproduction steps in bug reports for mobile apps. Fusion links user-provided information to program artifacts extracted through static and dynamic analysis performed before testing or release. Results of preliminary studies demonstrate that Fusion both effectively facilitates reporting and allows for more reliable reproduction of bugs from reports compared to traditional issue tracking systems by presenting more detailed contextual app information. Tool website: www.fusion-android. com Video url: https://youtu.be/AND9h0ElxRg

SEJan 18, 2018

Fixing Bug Reporting for Mobile and GUI-Based Applications

Kevin Moran

Smartphones and tablets have established themselves as mainstays in the modern computing landscape. It is conceivable that in the near future such devices may supplant laptops and desktops, becoming many users primary means of carrying out typical computer assisted tasks. In turn, this means that mobile applications will continue on a trajectory to becoming more complex, and the primary focus of millions of developers worldwide. In order to properly create and maintain these apps developers will need support, especially with regard to the prompt confirmation and resolution of bug reports. Unfortunately, current issue tracking systems typically only implement collection of coarse grained natural language descriptions, and lack features to facilitate reporters including important information in their reports. This illustrates the lexical information gap that exists in current bug reporting systems for mobile and GUI-based apps. This paper outlines promising preliminary work towards addressing this problem and proposes a comprehensive research program which aims to implement new bug reporting mechanisms and examine the impact that they might have on related software maintenance tasks.

SEJan 18, 2018

CrashScope: A Practical Tool for Automated Testing of Android Applications

Kevin Moran, Mario Linares-Vasquez, Carlos Bernal-Cardenas et al.

Unique challenges arise when testing mobile applications due to their prevailing event-driven nature and complex contextual features (e.g. sensors, notifications). Current automated input generation approaches for Android apps are typically not practical for developers to use due to required instrumentation or platform dependence and generally do not effectively exercise contextual features. To better support developers in mobile testing tasks, in this demo we present a novel, automated tool called CrashScope. This tool explores a given Android app using systematic input generation, according to several strategies informed by static and dynamic analyses, with the intrinsic goal of triggering crashes. When a crash is detected, CrashScope generates an augmented crash report containing screenshots, detailed crash reproduction steps, the captured exception stack trace, and a fully replayable script that automatically reproduces the crash on a target device(s). Results of preliminary studies show that CrashScope is able to uncover about as many crashes as other state of the art tools, while providing detailed useful crash reports and test scripts to developers. Website: www.crashscope-android.com/crashscope-home Video url: https://youtu.be/ii6S1JF6xDw

SEJan 18, 2018

On-Device Bug Reporting for Android Applications

Kevin Moran, Richard Bonett, Carlos Bernal-Cardenas et al.

Bugs that surface in mobile applications can be difficult to reproduce and fix due to several confounding factors including the highly GUI-driven nature of mobile apps, varying contextual states, differing platform versions and device fragmentation. It is clear that developers need support in the form of automated tools that allow for more precise reporting of application defects in order to facilitate more efficient and effective bug fixes. In this paper, we present a tool aimed at supporting application testers and developers in the process of On-Device Bug Reporting. Our tool, called ODBR, leverages the uiautomator framework and low-level event stream capture to offer support for recording and replaying a series of input gesture and sensor events that describe a bug in an Android application.