Xiaoyin Wang

SE
5papers
194citations
Novelty50%
AI Score44

5 Papers

23.4SEMay 15Code
TARIPlay: A Test Framework for AR Applications based on Interactive Area Tracking in Playback Videos

Seyed Amir Mousavi, Xiaoyin Wang

As Augmented Reality (AR) becomes more and more embedded in daily life, ensuring the quality, safety, and reliability of AR applications is increasingly important. However, AR apps present unique challenges for automated testing. Unlike static GUI layouts in traditional mobile apps, AR apps acquire their interaction interface from the surrounding environment, which is volatile and non-deterministic. Recent advancements like ARCore Playback and ARKit Replay allow developers to reuse real-world scenarios by recording and playing back enriched videos, enabling more feasible automated AR testing. However, using playback videos introduces two major challenges: test inputs must be timed precisely, and interactive areas in the video are dynamic, irregular, and difficult to identify. To address these challenges, we propose TARIPlay, a framework that analyzes playback videos to detect, track, and filter proper interactive areas over time for automated testing. In particular, TARIPlay identifies viable test opportunities based on criteria like stability and visibility, then feeds this information to an automated testing engine to simulate user interactions. We perform an experiment with four open-source AR apps and nine playback videos. Evaluation results show that TARIPlay significantly outperforms the existing tool Monkey in test coverage (55.8% over 41.98% on branch coverage) of AR-related code, and can also be used to assess the quality of playback videos for testing suitability.

SEJun 11, 2024
Validating LLM-Generated Programs with Metamorphic Prompt Testing

Xiaoyin Wang, Dakai Zhu

The latest paradigm shift in software development brings in the innovation and automation afforded by Large Language Models (LLMs), showcased by Generative Pre-trained Transformer (GPT), which has shown remarkable capacity to generate code autonomously, significantly reducing the manual effort required for various programming tasks. Although, the potential benefits of LLM-generated code are vast, most notably in efficiency and rapid prototyping, as LLMs become increasingly integrated into the software development lifecycle and hence the supply chain, complex and multifaceted challenges arise as the code generated from these language models carry profound questions on quality and correctness. Research is required to comprehensively explore these critical concerns surrounding LLM-generated code. In this paper, we propose a novel solution called metamorphic prompt testing to address these challenges. Our intuitive observation is that intrinsic consistency always exists among correct code pieces but may not exist among flawed code pieces, so we can detect flaws in the code by detecting inconsistencies. Therefore, we can vary a given prompt to multiple prompts with paraphrasing, and to ask the LLM to acquire multiple versions of generated code, so that we can validate whether the semantic relations still hold in the acquired code through cross-validation. Our evaluation on HumanEval shows that metamorphic prompt testing is able to detect 75 percent of the erroneous programs generated by GPT-4, with a false positive rate of 8.6 percent.

IRFeb 24, 2020
Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning

Wei Ye, Rui Xie, Jinglei Zhang et al.

Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connection between the two tasks as they train these tasks in a separate or pipeline manner, which means their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task.

SEJan 4, 2019
How Reliable is the Crowdsourced Knowledge of Security Implementation?

Mengsu Chen, Felix Fischer, Na Meng et al.

Stack Overflow (SO) is the most popular online Q&A site for developers to share their expertise in solving programming issues. Given multiple answers to certain questions, developers may take the accepted answer, the answer from a person with high reputation, or the one frequently suggested. However, researchers recently observed exploitable security vulnerabilities in popular SO answers. This observation inspires us to explore the following questions: How much can we trust the security implementation suggestions on SO? If suggested answers are vulnerable, can developers rely on the community's dynamics to infer the vulnerability and identify a secure counterpart? To answer these highly important questions, we conducted a study on SO posts by contrasting secure and insecure advices with the community-given content evaluation. We investigated whether SO incentive mechanism is effective in improving security properties of distributed code examples. Moreover, we also traced duplicated answers to assess whether the community behavior facilitates propagation of secure and insecure code suggestions. We compiled 953 different groups of similar security-related code examples and labeled their security, identifying 785 secure answer posts and 644 insecure ones. Compared with secure suggestions, insecure ones had higher view counts (36,508 vs. 18,713), received a higher score (14 vs. 5), and had significantly more duplicates (3.8 vs. 3.0) on average. 34% of the posts provided by highly reputable so-called trusted users were insecure. Our findings show that there are lots of insecure snippets on SO, while the community-given feedback does not allow differentiating secure from insecure choices. Moreover, the reputation mechanism fails in indicating trustworthy users with respect to security questions, ultimately leaving other users wandering around alone in a software security minefield.

SEJul 30, 2018
Automatic Clone Recommendation for Refactoring Based on the Present and the Past

Ruru Yue, Zhe Gao, Na Meng et al.

When many clones are detected in software programs, not all clones are equally important to developers. To help developers refactor code and improve software quality, various tools were built to recommend clone-removal refactorings based on the past and the present information, such as the cohesion degree of individual clones or the co-evolution relations of clone peers. The existence of these tools inspired us to build an approach that considers as many factors as possible to more accurately recommend clones. This paper introduces CREC, a learning-based approach that recommends clones by extracting features from the current status and past history of software projects. Given a set of software repositories, CREC first automatically extracts the clone groups historically refactored (R-clones) and those not refactored (NR-clones) to construct the training set. CREC extracts 34 features to characterize the content and evolution behaviors of individual clones, as well as the spatial, syntactical, and co-change relations of clone peers. With these features, CREC trains a classifier that recommends clones for refactoring. We designed the largest feature set thus far for clone recommendation, and performed an evaluation on six large projects. The results show that our approach suggested refactorings with 83% and 76% F-scores in the within-project and cross-project settings. CREC significantly outperforms a state-of-the-art similar approach on our data set, with the latter one achieving 70% and 50% F-scores. We also compared the effectiveness of different factors and different learning algorithms.